Class DefaultTextTokenizer

  • All Implemented Interfaces:
    TextTokenizer

    @API(EXPERIMENTAL)
    public class DefaultTextTokenizer
    extends Object
    implements TextTokenizer
    This is the default tokenizer used by full-text indexes. It will split the text on whitespace, normalize the input into Unicode normalization form KD (compatibility decomposition), case fold input to lower case, and strip all diacritical marks. This is appropriate for exact matching of many languages (those that use whitespace as their word separator, e.g., most European languages, Korean, Semitic languages, etc.), but it doesn't handle highly synthetic languages particularly well, nor does it handle languages like Chinese, Japanese, or Thai that do not generally use whitespace to indicate word boundaries.
    • Field Detail

      • NAME

        @Nonnull
        public static final String NAME
        The name of the default tokenizer. This can be used to explicitly require the default tokenizer in a text index.
        See Also:
        Constant Field Values
    • Method Detail

      • instance

        @Nonnull
        public static DefaultTextTokenizer instance()
        Get this class's singleton. This text tokenizer maintains no state, so only one instance is needed.
        Returns:
        this tokenizer's singleton instance
      • tokenize

        @Nonnull
        public Iterator<String> tokenize​(@Nonnull
                                         String text,
                                         int version,
                                         @Nonnull
                                         TextTokenizer.TokenizerMode mode)
        Tokenize the text based on whitespace. This normalizes the input using the NFKD (compatibility decomposition) normal form, case-folds to lower case, and then strips out diacritical marks. It makes no other attempts to stem words into their base forms, nor does it attempt to make word splits between words in synthetic languages or in languages that do not use whitespace as tokenizers. This tokenizer performs identically when used to tokenize documents at index time and when used to tokenize query strings.
        Specified by:
        tokenize in interface TextTokenizer
        Parameters:
        text - source text to split
        version - version of the tokenizer to use to split the text
        mode - ignored as this tokenizer operates the same way at index and query time
        Returns:
        an iterator over whitespace-separated tokens
      • getName

        @Nonnull
        public String getName()
        Get the name for this tokenizer. For default tokenizers, the name is ""default"".
        Specified by:
        getName in interface TextTokenizer
        Returns:
        the name of the default tokenizer
      • getMaxVersion

        public int getMaxVersion()
        Get the maximum supported version. Currently, there is only one version of this tokenizer, so the maximum version is the same as the minimum version.
        Specified by:
        getMaxVersion in interface TextTokenizer
        Returns:
        the maximum version supported by this tokenizer