Class SimpleDetector

  • All Implemented Interfaces:
    Detector

    public class SimpleDetector
    extends java.lang.Object
    implements Detector
    Includes functionality for determining the langCode from a sample or from the encoding. There are two ways to guess a String's langCode, by encoding and by character set. If the encoding is available this is a very good indication of the langCode. If the encoding is not available, then the actual characters in the string can be used to make an educated guess at the String's langCode. Recall a String in Java is unicode. Therefore, we can simply look at the unicode blocks of the characters in the string. Unfortunately, its not 100% fool-proof. From what I've been able to determine, Korean characters do not overlap with Japanese or Chinese characters, so their presence is a good indication of Korean. If a string contains phonetic japanese, this is a good indication of Japanese. However, Japanese and Chinese characters occupy many of the same character blocks, so if there are no definitive signs of Japanese then it is assumed that the String is Chinese.
    Author:
    Rich Pito, bjorncs
    • Constructor Summary

      Constructors 
      Constructor Description
      SimpleDetector()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      Detection detect​(byte[] input, int offset, int length, Hint hint)
      Detects language and encoding of the supplied byte array, possibly using a language/encoding hint.
      Detection detect​(java.lang.String input, Hint hint)
      Detects language of the supplied String, possibly using a language hint.
      Detection detect​(java.nio.ByteBuffer input, Hint hint)
      Detects language and encoding of the supplied ByteBuffer, possibly using a language/encoding hint.
      java.lang.String guessEncoding​(byte[] input)  
      Language guessLanguage​(byte[] buf, int offset, int length)  
      Language guessLanguage​(java.lang.String input)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • SimpleDetector

        public SimpleDetector()
    • Method Detail

      • detect

        public Detection detect​(byte[] input,
                                int offset,
                                int length,
                                Hint hint)
        Description copied from interface: Detector
        Detects language and encoding of the supplied byte array, possibly using a language/encoding hint.
        Specified by:
        detect in interface Detector
        Parameters:
        input - the buffer that is to be inspected
        offset - the offset to detect from
        length - the size to detect from
        hint - a hint to the detector, or null for no hint
        Returns:
        an array of possible language/encoding pairs, sorted by decreasing confidence (possibly empty, but never null)
      • detect

        public Detection detect​(java.nio.ByteBuffer input,
                                Hint hint)
        Description copied from interface: Detector
        Detects language and encoding of the supplied ByteBuffer, possibly using a language/encoding hint.
        Specified by:
        detect in interface Detector
        Parameters:
        input - the buffer that is to be inspected, from its current position to its limit
        hint - a hint to the detector, or null for no hint
        Returns:
        an array of possible language/encoding pairs, sorted by decreasing confidence (possibly empty, but never null)
      • detect

        public Detection detect​(java.lang.String input,
                                Hint hint)
        Description copied from interface: Detector
        Detects language of the supplied String, possibly using a language hint.
        Specified by:
        detect in interface Detector
        Parameters:
        input - the string that is to be inspected
        hint - a hint to the detector, or null for no hint
        Returns:
        an array of possible language/encoding pairs, sorted by decreasing confidence (possibly empty, but never null)
      • guessLanguage

        public Language guessLanguage​(byte[] buf,
                                      int offset,
                                      int length)
      • guessLanguage

        public Language guessLanguage​(java.lang.String input)
      • guessEncoding

        public java.lang.String guessEncoding​(byte[] input)