Class DefaultTextTokenizer
- java.lang.Object
-
- com.apple.foundationdb.record.provider.common.text.DefaultTextTokenizer
-
- All Implemented Interfaces:
TextTokenizer
@API(EXPERIMENTAL) public class DefaultTextTokenizer extends Object implements TextTokenizer
This is the default tokenizer used by full-text indexes. It will split the text on whitespace, normalize the input into Unicode normalization form KD (compatibility decomposition), case fold input to lower case, and strip all diacritical marks. This is appropriate for exact matching of many languages (those that use whitespace as their word separator, e.g., most European languages, Korean, Semitic languages, etc.), but it doesn't handle highly synthetic languages particularly well, nor does it handle languages like Chinese, Japanese, or Thai that do not generally use whitespace to indicate word boundaries.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface com.apple.foundationdb.record.provider.common.text.TextTokenizer
TextTokenizer.TokenizerMode
-
-
Field Summary
Fields Modifier and Type Field Description static String
NAME
The name of the default tokenizer.-
Fields inherited from interface com.apple.foundationdb.record.provider.common.text.TextTokenizer
GLOBAL_MIN_VERSION
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getMaxVersion()
Get the maximum supported version.String
getName()
Get the name for this tokenizer.static DefaultTextTokenizer
instance()
Get this class's singleton.Iterator<String>
tokenize(String text, int version, TextTokenizer.TokenizerMode mode)
Tokenize the text based on whitespace.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.apple.foundationdb.record.provider.common.text.TextTokenizer
getMinVersion, tokenizeToList, tokenizeToMap, validateVersion
-
-
-
-
Field Detail
-
NAME
@Nonnull public static final String NAME
The name of the default tokenizer. This can be used to explicitly require the default tokenizer in a text index.- See Also:
- Constant Field Values
-
-
Method Detail
-
instance
@Nonnull public static DefaultTextTokenizer instance()
Get this class's singleton. This text tokenizer maintains no state, so only one instance is needed.- Returns:
- this tokenizer's singleton instance
-
tokenize
@Nonnull public Iterator<String> tokenize(@Nonnull String text, int version, @Nonnull TextTokenizer.TokenizerMode mode)
Tokenize the text based on whitespace. This normalizes the input using the NFKD (compatibility decomposition) normal form, case-folds to lower case, and then strips out diacritical marks. It makes no other attempts to stem words into their base forms, nor does it attempt to make word splits between words in synthetic languages or in languages that do not use whitespace as tokenizers. This tokenizer performs identically when used to tokenize documents at index time and when used to tokenize query strings.- Specified by:
tokenize
in interfaceTextTokenizer
- Parameters:
text
- source text to splitversion
- version of the tokenizer to use to split the textmode
- ignored as this tokenizer operates the same way at index and query time- Returns:
- an iterator over whitespace-separated tokens
-
getName
@Nonnull public String getName()
Get the name for this tokenizer. For default tokenizers, the name is ""default"".- Specified by:
getName
in interfaceTextTokenizer
- Returns:
- the name of the default tokenizer
-
getMaxVersion
public int getMaxVersion()
Get the maximum supported version. Currently, there is only one version of this tokenizer, so the maximum version is the same as the minimum version.- Specified by:
getMaxVersion
in interfaceTextTokenizer
- Returns:
- the maximum version supported by this tokenizer
-
-