Interface TextTokenizer
-
- All Known Implementing Classes:
DefaultTextTokenizer
@API(EXPERIMENTAL) public interface TextTokenizer
An interface to tokenize text fields. Implementations of this interface should handle normalization, stemming, case-folding, stop-word removal, and all other analysis steps to generate a token list from raw text. When indexing, theTextIndexMaintainer
will use an instance of this class to generate the token list from text and use that to generate the position list for each token. Each implementation should also implement aTextTokenizerFactory
that instantiates instances of that class. The factory class can then be picked up by aTextTokenizerRegistry
and used by theTextIndexMaintainer
to tokenize text while indexing.To correctly maintain indexes, it is important that each tokenizer be deterministic for a given input, and that the tokenizing logic be frozen once data are written to the database with that tokenizer. To support backwards-incompatible tokenizer updates (for example, adding or removing stop words or making different normalization decisions), one can create a new "version" of that tokenizer. The version number is passed to the tokenizer at tokenize-time through the
version
parameter. One should continue to support using older versions until such time as all data that were written with the old version have been migrated to the new version. At that time, one can drop support for the older version by increasing the value returned bygetMinVersion()
to exclude the older tokenizers.- See Also:
TextTokenizerRegistry
,TextIndexMaintainer
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static class
TextTokenizer.TokenizerMode
Mode that can be used to alter tokenizer behavior depending on the the context its used in.
-
Field Summary
Fields Modifier and Type Field Description static int
GLOBAL_MIN_VERSION
The absolute minimum tokenizer version.
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description int
getMaxVersion()
The maximum supported version of this tokenizer.default int
getMinVersion()
The minimum supported version of this tokenizer.String
getName()
Get the name of this tokenizer.Iterator<? extends CharSequence>
tokenize(String text, int version, TextTokenizer.TokenizerMode mode)
Create a stream of tokens from the given input text.default List<String>
tokenizeToList(String text, int version, TextTokenizer.TokenizerMode mode)
Create a list of tokens from the given input text.default Map<String,List<Integer>>
tokenizeToMap(String text, int version, TextTokenizer.TokenizerMode mode)
Create a map from tokens to their offset lists from the given input text.default void
validateVersion(int version)
Verify that the provided version is supported by this tokenizer.
-
-
-
Field Detail
-
GLOBAL_MIN_VERSION
static final int GLOBAL_MIN_VERSION
The absolute minimum tokenizer version. All tokenizers should begin at this version and work their way up. If no explicit tokenizer version is included in the meta-data, the index maintainer will use this version.- See Also:
- Constant Field Values
-
-
Method Detail
-
tokenize
@Nonnull Iterator<? extends CharSequence> tokenize(@Nonnull String text, int version, @Nonnull TextTokenizer.TokenizerMode mode)
Create a stream of tokens from the given input text. This should encapsulate all analysis done on the text to produce a sensible token list, i.e., the user should not assume that any normalization or text processing is done on the results from this function. To indicate the presence of an un-indexed token (like a stop word), this should emit the empty string in its place (so that the position list is correct). The version parameter of this method can be used to maintain old behavior when necessary as the tokenizer is updated.- Parameters:
text
- source text to tokenizeversion
- version of the tokenizer to usemode
- whether this tokenizer is being used to index a document or query a set of documents- Returns:
- a stream of tokens retrieved from the text
-
tokenizeToMap
@Nonnull default Map<String,List<Integer>> tokenizeToMap(@Nonnull String text, int version, @Nonnull TextTokenizer.TokenizerMode mode)
Create a map from tokens to their offset lists from the given input text. This should be consistent with thetokenize()
function in that it should apply the same analysis on the token list as that function does (or call that function directly). By default, this callstokenize()
to produce a token stream and then inserts each token into a map. It keeps track of the current number of tokens and updates the value of the map with additional offsets. More exotic implementations of this function could, for example, decide to stop tokenizing the source text after reaching a maximum number of unique tokens or write a different offset list than would be done by default. But if that behavior changes, it is the responsibility of the tokenizer maintainer to bump the version of this tokenizer so that the old behavior can be reliably replicated at a future date.The
TextIndexMaintainer
will use this method to tokenize a document into a map and place each entry into the database. This method is not used by queries (which instead use thetokenize
method instead).- Parameters:
text
- source text to tokenizeversion
- version of the tokenizer to usemode
- whether this tokenizer is being used to index a document or query a set of documents- Returns:
- a mapping from token to a list of offsets in the original text
-
tokenizeToList
default List<String> tokenizeToList(@Nonnull String text, int version, @Nonnull TextTokenizer.TokenizerMode mode)
Create a list of tokens from the given input text. By default, this will just run thetokenize()
method on the given text at the given version and then add all of the elements to a list.- Parameters:
text
- source text to tokenizeversion
- version of the tokenizer to usemode
- whether this tokenizer is being used to index a document or query a set of documents- Returns:
- a list of tokens retrieved from the text
-
getMinVersion
default int getMinVersion()
The minimum supported version of this tokenizer. By default, this is theglobal minimum version
, which indicates that this tokenizer can tokenize strings at all versions that this tokenizer type has ever been able to tokenize. However, if this tokenizer has dropped support for some older format, then this function should be implemented to- Returns:
- the minimum supported tokenizer version
-
getMaxVersion
int getMaxVersion()
The maximum supported version of this tokenizer. This should be greater than or equal to the minimum version.- Returns:
- the maximum supported tokenizer version
-
getName
@Nonnull String getName()
Get the name of this tokenizer. This should be the same name as is returned by the correspondingTextTokenizerFactory
'sgetName()
method.- Returns:
- this tokenizer's name
-
validateVersion
default void validateVersion(int version)
Verify that the provided version is supported by this tokenizer. This makes sure that the version given is greater than or equal to the minimum version and less than or equal to the maximum version.- Parameters:
version
- tokenizer version to verify is in bounds- Throws:
RecordCoreArgumentException
- if the version is out of bounds
-
-