TextTokenizer (fdb-record-layer-core 2.8.89.0 API)

All Known Implementing Classes:

DefaultTextTokenizer
```
@API(value=EXPERIMENTAL)
public interface TextTokenizer
```
An interface to tokenize text fields. Implementations of this interface should handle normalization, stemming, case-folding, stop-word removal, and all other analysis steps to generate a token list from raw text. When indexing, the TextIndexMaintainer will use an instance of this class to generate the token list from text and use that to generate the position list for each token. Each implementation should also implement a TextTokenizerFactory that instantiates instances of that class. The factory class can then be picked up by a TextTokenizerRegistry and used by the TextIndexMaintainer to tokenize text while indexing.
To correctly maintain indexes, it is important that each tokenizer be deterministic for a given input, and that the tokenizing logic be frozen once data are written to the database with that tokenizer. To support backwards-incompatible tokenizer updates (for example, adding or removing stop words or making different normalization decisions), one can create a new "version" of that tokenizer. The version number is passed to the tokenizer at tokenize-time through the version parameter. One should continue to support using older versions until such time as all data that were written with the old version have been migrated to the new version. At that time, one can drop support for the older version by increasing the value returned by getMinVersion() to exclude the older tokenizers.

See Also:

TextTokenizerRegistry, TextIndexMaintainer

Nested Class Summary

Nested Classes
Modifier and Type	Interface and Description
`static class`	`TextTokenizer.TokenizerMode` Mode that can be used to alter tokenizer behavior depending on the the context its used in.

Field Summary

Fields
Modifier and Type Field and Description

static int GLOBAL_MIN_VERSION
The absolute minimum tokenizer version.

Fields
Modifier and Type	Field and Description
`static int`	`GLOBAL_MIN_VERSION` The absolute minimum tokenizer version.

Method Summary

All Methods Instance Methods Abstract Methods Default Methods
Modifier and Type	Method and Description
`int`	`getMaxVersion()` The maximum supported version of this tokenizer.
`default int`	`getMinVersion()` The minimum supported version of this tokenizer.
`String`	`getName()` Get the name of this tokenizer.
`Iterator<? extends CharSequence>`	`tokenize(String text, int version, TextTokenizer.TokenizerMode mode)` Create a stream of tokens from the given input text.
`default List<String>`	`tokenizeToList(String text, int version, TextTokenizer.TokenizerMode mode)` Create a list of tokens from the given input text.
`default Map<String,List<Integer>>`	`tokenizeToMap(String text, int version, TextTokenizer.TokenizerMode mode)` Create a map from tokens to their offset lists from the given input text.
`default void`	`validateVersion(int version)` Verify that the provided version is supported by this tokenizer.

- Field Detail
  - GLOBAL_MIN_VERSION
```
static final int GLOBAL_MIN_VERSION
```
    The absolute minimum tokenizer version. All tokenizers should begin at this version and work their way up. If no explicit tokenizer version is included in the meta-data, the index maintainer will use this version.
    
    See Also:
    
    Constant Field Values
- Method Detail
  - tokenize
```
@Nonnull
Iterator<? extends CharSequence> tokenize(@Nonnull
                                                   String text,
                                                   int version,
                                                   @Nonnull
                                                   TextTokenizer.TokenizerMode mode)
```
    Create a stream of tokens from the given input text. This should encapsulate all analysis done on the text to produce a sensible token list, i.e., the user should not assume that any normalization or text processing is done on the results from this function. To indicate the presence of an un-indexed token (like a stop word), this should emit the empty string in its place (so that the position list is correct). The version parameter of this method can be used to maintain old behavior when necessary as the tokenizer is updated.
    
    Parameters:
    
    text - source text to tokenize
    
    version - version of the tokenizer to use
    
    mode - whether this tokenizer is being used to index a document or query a set of documents
    
    Returns:
    
    a stream of tokens retrieved from the text
  - tokenizeToMap
```
@Nonnull
default Map<String,List<Integer>> tokenizeToMap(@Nonnull
                                                         String text,
                                                         int version,
                                                         @Nonnull
                                                         TextTokenizer.TokenizerMode mode)
```
    Create a map from tokens to their offset lists from the given input text. This should be consistent with the tokenize() function in that it should apply the same analysis on the token list as that function does (or call that function directly). By default, this calls tokenize() to produce a token stream and then inserts each token into a map. It keeps track of the current number of tokens and updates the value of the map with additional offsets. More exotic implementations of this function could, for example, decide to stop tokenizing the source text after reaching a maximum number of unique tokens or write a different offset list than would be done by default. But if that behavior changes, it is the responsibility of the tokenizer maintainer to bump the version of this tokenizer so that the old behavior can be reliably replicated at a future date.
    The TextIndexMaintainer will use this method to tokenize a document into a map and place each entry into the database. This method is not used by queries (which instead use the tokenize method instead).
    
    Parameters:
    
    text - source text to tokenize
    
    version - version of the tokenizer to use
    
    mode - whether this tokenizer is being used to index a document or query a set of documents
    
    Returns:
    
    a mapping from token to a list of offsets in the original text
  - tokenizeToList
```
default List<String> tokenizeToList(@Nonnull
                                    String text,
                                    int version,
                                    @Nonnull
                                    TextTokenizer.TokenizerMode mode)
```
    Create a list of tokens from the given input text. By default, this will just run the tokenize() method on the given text at the given version and then add all of the elements to a list.
    
    Parameters:
    
    text - source text to tokenize
    
    version - version of the tokenizer to use
    
    mode - whether this tokenizer is being used to index a document or query a set of documents
    
    Returns:
    
    a list of tokens retrieved from the text
  - getMinVersion
```
default int getMinVersion()
```
    The minimum supported version of this tokenizer. By default, this is the global minimum version, which indicates that this tokenizer can tokenize strings at all versions that this tokenizer type has ever been able to tokenize. However, if this tokenizer has dropped support for some older format, then this function should be implemented to
    
    Returns:
    
    the minimum supported tokenizer version
  - getMaxVersion
```
int getMaxVersion()
```
    The maximum supported version of this tokenizer. This should be greater than or equal to the minimum version.
    
    Returns:
    
    the maximum supported tokenizer version
  - getName
```
@Nonnull
String getName()
```
    Get the name of this tokenizer. This should be the same name as is returned by the corresponding TextTokenizerFactory's getName() method.
    
    Returns:
    
    this tokenizer's name
  - validateVersion
```
default void validateVersion(int version)
```
    Verify that the provided version is supported by this tokenizer. This makes sure that the version given is greater than or equal to the minimum version and less than or equal to the maximum version.
    
    Parameters:
    
    version - tokenizer version to verify is in bounds
    
    Throws:
    
    RecordCoreArgumentException - if the version is out of bounds

Interface TextTokenizer

Nested Class Summary

Field Summary

Method Summary

Field Detail

GLOBAL_MIN_VERSION

Method Detail

tokenize

tokenizeToMap

tokenizeToList

getMinVersion

getMaxVersion

getName

validateVersion