@API(value=EXPERIMENTAL) public interface TextTokenizer
TextIndexMaintainer
will use an instance of this class
to generate the token list from text and use that to generate the position list for each token.
Each implementation should also implement a TextTokenizerFactory
that instantiates instances of
that class. The factory class can then be picked up by a TextTokenizerRegistry
and used by the TextIndexMaintainer
to tokenize text while indexing.
To correctly maintain indexes, it is important that each tokenizer be deterministic for a
given input, and that the tokenizing logic be frozen once data are written to the database with
that tokenizer. To support backwards-incompatible tokenizer updates (for example, adding or removing
stop words or making different normalization decisions), one can create a new "version" of that
tokenizer. The version number is passed to the tokenizer at tokenize-time through the
version
parameter. One should continue to support using older versions until
such time as all data that were written with the old version have been migrated to the
new version. At that time, one can drop support for the older version by increasing the value
returned by getMinVersion()
to exclude the older tokenizers.
TextTokenizerRegistry
,
TextIndexMaintainer
Modifier and Type | Interface and Description |
---|---|
static class |
TextTokenizer.TokenizerMode
Mode that can be used to alter tokenizer behavior depending on the
the context its used in.
|
Modifier and Type | Field and Description |
---|---|
static int |
GLOBAL_MIN_VERSION
The absolute minimum tokenizer version.
|
Modifier and Type | Method and Description |
---|---|
int |
getMaxVersion()
The maximum supported version of this tokenizer.
|
default int |
getMinVersion()
The minimum supported version of this tokenizer.
|
String |
getName()
Get the name of this tokenizer.
|
Iterator<? extends CharSequence> |
tokenize(String text,
int version,
TextTokenizer.TokenizerMode mode)
Create a stream of tokens from the given input text.
|
default List<String> |
tokenizeToList(String text,
int version,
TextTokenizer.TokenizerMode mode)
Create a list of tokens from the given input text.
|
default Map<String,List<Integer>> |
tokenizeToMap(String text,
int version,
TextTokenizer.TokenizerMode mode)
Create a map from tokens to their offset lists from the given input text.
|
default void |
validateVersion(int version)
Verify that the provided version is supported by this tokenizer.
|
static final int GLOBAL_MIN_VERSION
@Nonnull Iterator<? extends CharSequence> tokenize(@Nonnull String text, int version, @Nonnull TextTokenizer.TokenizerMode mode)
text
- source text to tokenizeversion
- version of the tokenizer to usemode
- whether this tokenizer is being used to index a document or query a set of documents@Nonnull default Map<String,List<Integer>> tokenizeToMap(@Nonnull String text, int version, @Nonnull TextTokenizer.TokenizerMode mode)
tokenize()
function
in that it should apply the same analysis on the token list as that function
does (or call that function directly). By default, this calls tokenize()
to produce a token stream and then inserts each token into a map. It keeps track
of the current number of tokens and updates the value of the map with additional
offsets. More exotic implementations of this function could, for example, decide
to stop tokenizing the source text after reaching a maximum number of unique tokens
or write a different offset list than would be done by default. But if that
behavior changes, it is the responsibility of the tokenizer maintainer
to bump the version of this tokenizer so that the old behavior can be reliably
replicated at a future date.
The TextIndexMaintainer
will use this method to tokenize a document into a map and place each entry into
the database. This method is not used by queries (which instead use the
tokenize
method instead).
text
- source text to tokenizeversion
- version of the tokenizer to usemode
- whether this tokenizer is being used to index a document or query a set of documentsdefault List<String> tokenizeToList(@Nonnull String text, int version, @Nonnull TextTokenizer.TokenizerMode mode)
tokenize()
method on the given text at
the given version and then add all of the elements to a list.text
- source text to tokenizeversion
- version of the tokenizer to usemode
- whether this tokenizer is being used to index a document or query a set of documentsdefault int getMinVersion()
global minimum version
, which indicates that this
tokenizer can tokenize strings at all versions that this tokenizer type
has ever been able to tokenize. However, if this tokenizer has dropped support
for some older format, then this function should be implemented toint getMaxVersion()
@Nonnull String getName()
TextTokenizerFactory
's
getName()
method.default void validateVersion(int version)
version
- tokenizer version to verify is in boundsRecordCoreArgumentException
- if the version is out of bounds