Interface TextTokenizer

  • All Known Implementing Classes:
    DefaultTextTokenizer

    @API(EXPERIMENTAL)
    public interface TextTokenizer
    An interface to tokenize text fields. Implementations of this interface should handle normalization, stemming, case-folding, stop-word removal, and all other analysis steps to generate a token list from raw text. When indexing, the TextIndexMaintainer will use an instance of this class to generate the token list from text and use that to generate the position list for each token. Each implementation should also implement a TextTokenizerFactory that instantiates instances of that class. The factory class can then be picked up by a TextTokenizerRegistry and used by the TextIndexMaintainer to tokenize text while indexing.

    To correctly maintain indexes, it is important that each tokenizer be deterministic for a given input, and that the tokenizing logic be frozen once data are written to the database with that tokenizer. To support backwards-incompatible tokenizer updates (for example, adding or removing stop words or making different normalization decisions), one can create a new "version" of that tokenizer. The version number is passed to the tokenizer at tokenize-time through the version parameter. One should continue to support using older versions until such time as all data that were written with the old version have been migrated to the new version. At that time, one can drop support for the older version by increasing the value returned by getMinVersion() to exclude the older tokenizers.

    See Also:
    TextTokenizerRegistry, TextIndexMaintainer
    • Field Detail

      • GLOBAL_MIN_VERSION

        static final int GLOBAL_MIN_VERSION
        The absolute minimum tokenizer version. All tokenizers should begin at this version and work their way up. If no explicit tokenizer version is included in the meta-data, the index maintainer will use this version.
        See Also:
        Constant Field Values
    • Method Detail

      • tokenize

        @Nonnull
        Iterator<? extends CharSequence> tokenize​(@Nonnull
                                                  String text,
                                                  int version,
                                                  @Nonnull
                                                  TextTokenizer.TokenizerMode mode)
        Create a stream of tokens from the given input text. This should encapsulate all analysis done on the text to produce a sensible token list, i.e., the user should not assume that any normalization or text processing is done on the results from this function. To indicate the presence of an un-indexed token (like a stop word), this should emit the empty string in its place (so that the position list is correct). The version parameter of this method can be used to maintain old behavior when necessary as the tokenizer is updated.
        Parameters:
        text - source text to tokenize
        version - version of the tokenizer to use
        mode - whether this tokenizer is being used to index a document or query a set of documents
        Returns:
        a stream of tokens retrieved from the text
      • tokenizeToMap

        @Nonnull
        default Map<String,​List<Integer>> tokenizeToMap​(@Nonnull
                                                              String text,
                                                              int version,
                                                              @Nonnull
                                                              TextTokenizer.TokenizerMode mode)
        Create a map from tokens to their offset lists from the given input text. This should be consistent with the tokenize() function in that it should apply the same analysis on the token list as that function does (or call that function directly). By default, this calls tokenize() to produce a token stream and then inserts each token into a map. It keeps track of the current number of tokens and updates the value of the map with additional offsets. More exotic implementations of this function could, for example, decide to stop tokenizing the source text after reaching a maximum number of unique tokens or write a different offset list than would be done by default. But if that behavior changes, it is the responsibility of the tokenizer maintainer to bump the version of this tokenizer so that the old behavior can be reliably replicated at a future date.

        The TextIndexMaintainer will use this method to tokenize a document into a map and place each entry into the database. This method is not used by queries (which instead use the tokenize method instead).

        Parameters:
        text - source text to tokenize
        version - version of the tokenizer to use
        mode - whether this tokenizer is being used to index a document or query a set of documents
        Returns:
        a mapping from token to a list of offsets in the original text
      • tokenizeToList

        default List<String> tokenizeToList​(@Nonnull
                                            String text,
                                            int version,
                                            @Nonnull
                                            TextTokenizer.TokenizerMode mode)
        Create a list of tokens from the given input text. By default, this will just run the tokenize() method on the given text at the given version and then add all of the elements to a list.
        Parameters:
        text - source text to tokenize
        version - version of the tokenizer to use
        mode - whether this tokenizer is being used to index a document or query a set of documents
        Returns:
        a list of tokens retrieved from the text
      • getMinVersion

        default int getMinVersion()
        The minimum supported version of this tokenizer. By default, this is the global minimum version, which indicates that this tokenizer can tokenize strings at all versions that this tokenizer type has ever been able to tokenize. However, if this tokenizer has dropped support for some older format, then this function should be implemented to
        Returns:
        the minimum supported tokenizer version
      • getMaxVersion

        int getMaxVersion()
        The maximum supported version of this tokenizer. This should be greater than or equal to the minimum version.
        Returns:
        the maximum supported tokenizer version
      • validateVersion

        default void validateVersion​(int version)
        Verify that the provided version is supported by this tokenizer. This makes sure that the version given is greater than or equal to the minimum version and less than or equal to the maximum version.
        Parameters:
        version - tokenizer version to verify is in bounds
        Throws:
        RecordCoreArgumentException - if the version is out of bounds