Package com.yahoo.language.simple
Class SimpleDetector
java.lang.Object
com.yahoo.language.simple.SimpleDetector
- All Implemented Interfaces:
Detector
Includes functionality for determining the langCode from a sample or from the encoding.
There are two ways to guess a String's langCode, by encoding and by character
set. If the encoding is available this is a very good indication of the langCode. If the encoding is not available,
then the actual characters in the string can be used to make an educated guess at the String's langCode. Recall a
String in Java is unicode. Therefore, we can simply look at the unicode blocks of the characters in the string.
Unfortunately, its not 100% fool-proof. From what I've been able to determine, Korean characters do not overlap with
Japanese or Chinese characters, so their presence is a good indication of Korean. If a string contains phonetic
japanese, this is a good indication of Japanese. However, Japanese and Chinese characters occupy many of the same
character blocks, so if there are no definitive signs of Japanese then it is assumed that the String is Chinese.
- Author:
- Rich Pito, bjorncs
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionDetects language and encoding of the supplied byte array, possibly using a language/encoding hint.Detects language of the supplied String, possibly using a language hint.detect
(ByteBuffer input, Hint hint) Detects language and encoding of the supplied ByteBuffer, possibly using a language/encoding hint.guessEncoding
(byte[] input) guessEncoding
(byte[] input, int offset, int length) guessLanguage
(byte[] buf, int offset, int length) guessLanguage
(String input)
-
Constructor Details
-
SimpleDetector
public SimpleDetector()
-
-
Method Details
-
detect
Description copied from interface:Detector
Detects language and encoding of the supplied byte array, possibly using a language/encoding hint.- Specified by:
detect
in interfaceDetector
- Parameters:
input
- the buffer that is to be inspectedoffset
- the offset to detect fromlength
- the size to detect fromhint
- a hint to the detector, or null for no hint- Returns:
- an array of possible language/encoding pairs, sorted by decreasing confidence (possibly empty, but never null)
-
detect
Description copied from interface:Detector
Detects language and encoding of the supplied ByteBuffer, possibly using a language/encoding hint.- Specified by:
detect
in interfaceDetector
- Parameters:
input
- the buffer that is to be inspected, from its current position to its limithint
- a hint to the detector, or null for no hint- Returns:
- an array of possible language/encoding pairs, sorted by decreasing confidence (possibly empty, but never null)
-
detect
Description copied from interface:Detector
Detects language of the supplied String, possibly using a language hint. -
guessLanguage
-
guessLanguage
-
guessEncoding
-
guessEncoding
-