Class TextIndexMaintainer


  • @API(EXPERIMENTAL)
    public class TextIndexMaintainer
    extends StandardIndexMaintainer
    The index maintainer class for full-text indexes. This takes an expression whose first column (not counting grouping columns) is of type string. It will split the text found at that column using a TextTokenizer and then write separate index keys for each token found in the text. This then supports queries on the tokenized text, such as:
    • All records containing all elements from a set of tokens: Query.field(fieldName).text().containsAll(tokens)
    • All records containing any elements from a set of tokens: Query.field(fieldName).text().containsAny(tokens)
    • All records containing all elements from a set of tokens within some maximum span: Query.field(fieldName).text().containsAll(tokens, span)
    • All records containing an exact phrase (modulo normalization and stop-word removal done by the tokenizer): Query.field(fieldName).text().containsPhrase(phrase)
    • All records containing at least one token that begins with a given prefix: Query.field(fieldName).text().containsPrefix(prefix)
    • All records containing at least one token that begins with any of a set of prefixes: Query.field(fieldName).text().containsAnyPrefix(prefixes)
    • All records containing at least one token that begins with each of a set of prefixes: Query.field(fieldName).text().containsAllPrefixes(prefixes)

    One can specify a tokenizer to use by setting the "textTokenizerName" and "textTokenizerVersion" options on the index. If no tokenizer is given, it will use a DefaultTextTokenizer, and if no version is specified, it will assume version 0. There should be one TextTokenizer implementation that uses that name and one TextTokenizerFactory implementation that will supply instances of the tokenizer of that name. The version of the tokenizer used to serialize each record is stored by this index maintainer, so if an index's tokenizer version changes, then this index maintainer will continue to use the older tokenizer version to tokenize the fields of any records present in the index prior to the version change. This guarantees that for every record, the same tokenizer version is used when inserting it and when deleting it. If one wants to re-tokenize a record following a tokenizer version change, then if one takes an existing record (tokenized with an older version) and saves the record again, then that record will be re-indexed using the newer version.

    Because each update will add a conflict range for each token included in each indexed text field per record, index updates can be particularly taxing on the resolver process within the FoundationDB cluster. Some use cases can therefore benefit from having fewer, larger conflict ranges per transaction to lessen the work done. The trade-off is that there is now potentially less parallelism in that there is a larger change of conflicts between records that arrive simultaneously, though it should be noted that the underlying data structure of the text index means that it is already likely that two records that happen to share common tokens that are updated simultaneously will conflict, so it might not actually produce more conflicts in practice. To enable adding conflict ranges over larger areas, set the "textAddAggressiveConflictRanges" option to true. Warning: This feature is currently experimental, and may change at any moment without prior notice.

    Note: At the moment, this index is under active development and should be considered experimental. At the current time, this index will be correctly updated on insert and removal and can be manually scanned, but it will only be selected by the query planner in limited circumstances to satisfy full text queries. For example, the query planner will not select this index if there are sorts involved in the query or if the filter involves using the position list to determine the relative positions of tokens within a document.

    • Method Detail

      • getTokenizer

        @Nonnull
        public static TextTokenizer getTokenizer​(@Nonnull
                                                 Index index)
        Get the text tokenizer associated with this index. This uses the value of the ""textTokenizerName"" option to determine the name of the tokenizer and then looks up the tokenizer in the tokenizer registry.
        Parameters:
        index - the index to get the tokenizer of
        Returns:
        the tokenizer associated with this index
      • getIndexTokenizerVersion

        public static int getIndexTokenizerVersion​(@Nonnull
                                                   Index index)
        Get the tokenizer version associated with this index. This will parse the ""textTokenizerVersion"" option and produce an integer value from it. If none is specified, this returns the global miminum tokenizer version.
        Parameters:
        index - the index to get the tokenizer version of
        Returns:
        the tokenizer version associated with the given index
      • update

        @Nonnull
        public <M extends MessageCompletableFuture<Void> update​(@Nullable
                                                                  FDBIndexableRecord<M> oldRecord,
                                                                  @Nullable
                                                                  FDBIndexableRecord<M> newRecord)
        Updates an associated text index with the data associated with a new record. Unlike most standard indexes, the text-index can behave somewhat differently if a record was previously written with this index but with an older tokenizer version, then it will always re-index the record and will write index entries to the database even if they are un-changed. The record will then be registered as having been written at the new tokenizer version (so subsequent updates will not have to do any additional updates for unchanged fields).
        Overrides:
        update in class StandardIndexMaintainer
        Type Parameters:
        M - type of message
        Parameters:
        oldRecord - the previous stored record or null if a new record is being created
        newRecord - the new record or null if an old record is being deleted
        Returns:
        a future that is complete when the record update is done
        See Also:
        IndexMaintainer.update(FDBIndexableRecord, FDBIndexableRecord)
      • canDeleteWhere

        public boolean canDeleteWhere​(@Nonnull
                                      QueryToKeyMatcher matcher,
                                      @Nonnull
                                      Key.Evaluated evaluated)
        Indicates whether the expression allows for this index to perform a FDBRecordStoreBase.deleteRecordsWhere(QueryComponent) operation. A text index can only delete records that are aligned with its grouping key, as once text from the index has been tokenized, there is not a way to efficiently remove all of documents within the grouped part of the index.
        Overrides:
        canDeleteWhere in class StandardIndexMaintainer
        Parameters:
        matcher - object to match the grouping key to a query component
        evaluated - an evaluated key that might align with this index's grouping key
        Returns:
        whether the index maintainer can remove all records matching matcher
      • scan

        @Nonnull
        public RecordCursor<IndexEntry> scan​(@Nonnull
                                             IndexScanType scanType,
                                             @Nonnull
                                             TupleRange range,
                                             @Nullable
                                             byte[] continuation,
                                             @Nonnull
                                             ScanProperties scanProperties)
        Scan this index between a range of tokens. This index type requires that it be scanned only by text token. The range to scan can otherwise be between any two entries in the list, and scans over a prefix are supported by passing a value of range that uses PREFIX_STRING as both endpoint types. The keys returned in the index entry will include the token that was found in the index when scanning in the column that is used for the text field of the index's root expression. The value portion of each index entry will be a tuple whose first element is the position list for that entry within its associated record's field.
        Specified by:
        scan in class IndexMaintainer
        Parameters:
        scanType - the type of scan to perform
        range - the range to scan
        continuation - any continuation from a previous scan invocation
        scanProperties - skip, limit and other properties of the scan
        Returns:
        a cursor over all index entries in range
        Throws:
        RecordCoreException - if scanType is not IndexScanType.BY_TEXT_TOKEN
        See Also:
        TextCursor