Class Text


  • @API(EXPERIMENTAL)
    public abstract class Text
    extends Object
    Predicates that can be applied to a field that has been indexed with a full-text index. These allow for querying on properties of the text contents, e.g., whether the text contains a given token, token list, or phrase. Most of the methods here that allow for multiple tokens to be supplied can either be given a single string or a list. If a single string is given, then the string will be tokenized later using an appropriate tokenizer. If a list is given, then the assumption is that the user has already tokenized the string and the list is the result of that tokenization.

    This type allows the user to specify a "tokenizer name". If one is given, then it will use this tokenizer to tokenize the query string (if not pre-tokenized) and will require that if an index is used, it uses the tokenizer provided. If no tokenizer is specified, then it will allow itself to be matched against any text index on the field and apply the index's tokenizer to the query string. If no suitable index can be found and a full scan with a post-filter has to be done, then a fallback tokenizer will be used both to tokenize the query string as well as to tokenize the record's text. By default, this is the DefaultTextTokenizer (with name ""default""), but one can specify a different one if one wishes.

    This should be created by calling the text() method on a query Field or OneOfThem instance. For example, one might call: Query.field("text").text() to create a predicate on the text field's contents.

    See Also:
    TextIndexMaintainer, TextTokenizer, DefaultTextTokenizer
    • Method Detail

      • contains

        @Nonnull
        public QueryComponent contains​(@Nonnull
                                       String token)
        Checks if the field contains a token. This token should either be generated by the tokenizer associated with this text predicate or should be a plausible token that the tokenizer could have generated. This token will not be further sanitized or normalized before searching for it in the text.
        Parameters:
        token - the token to search for
        Returns:
        a new component for doing the actual evaluation
      • containsPrefix

        @Nonnull
        public QueryComponent containsPrefix​(@Nonnull
                                             String prefix)
        Checks if the field contains any token matching the provided prefix. This should be the beginning of a token that could be generated by the tokenizer associated with this text predicate. No additional sanitization or normalization of this prefix will be performed before searching for it in the text.
        Parameters:
        prefix - the prefix to search for
        Returns:
        a new component for doing the actual evaluation
      • containsAll

        @Nonnull
        public QueryComponent containsAll​(@Nonnull
                                          String tokens)
        Checks if the field contains all of the provided tokens. At query evaluation time, the tokens provided here will be tokenized into a list of tokens. This predicate will then return Boolean.TRUE if all of the tokens (except stop words) are present in the text field, Boolean.FALSE if any of them are not, and null if either the field is null or if the token list contains only stop words or is empty. If the same token appears multiple times in the token list, then the token must only appear at least once in the searched text to satisfy the filter (i.e., it is not required to appear as many times in the text as in the token list).
        Parameters:
        tokens - the tokens to search for
        Returns:
        a new component for doing the actual evaluation
      • containsAll

        @Nonnull
        public QueryComponent containsAll​(@Nonnull
                                          List<String> tokens)
        Checks if the field contains all of provided tokens. This behaves like containsAll(String), except that the token list is assumed to have already been tokenized with an appropriate tokenizer. No further sanitization or normalization is performed on the tokens before searching for them in the text.
        Parameters:
        tokens - the tokens to search for
        Returns:
        a new component for doing the actual evaluation
      • containsAll

        @Nonnull
        public QueryComponent containsAll​(@Nonnull
                                          String tokens,
                                          int maxDistance)
        Checks if the field text contains all of the provided tokens within a given number of tokens. For example, in the string "a b c" (tokenized by whitespace), tokens "a" and "c" are a distance of two tokens of each other, so containsAll("a c", 2) when evaluated against that string would return Boolean.TRUE, but containsAll("a c", 1) would return Boolean.FALSE. Stop words in the query string are ignored, and if there are no tokens in the string (or all tokens are stop words), this will evaluate to null. It will also evaluate to null if the field is null. If the same token appears multiple times in the token list, then the token must only appear at least once in the searched text to satisfy the filter (i.e., it is not required to appear as many times in the text as in the token list).
        Parameters:
        tokens - the tokens to search for
        maxDistance - the maximum distance (expressed in number of tokens) to allow between found
        Returns:
        a new component for doing the actual evaluation
      • containsAll

        @Nonnull
        public QueryComponent containsAll​(@Nonnull
                                          List<String> tokens,
                                          int maxDistance)
        Checks if the field text contains all of the provided tokens within a given number of tokens. This behaves like containsAll(String, int) except that the token list is assumed to have already been tokenized with an appropriate tokenizer. No further sanitization or normalization is performed on the tokens before searching for them in the text.
        Parameters:
        tokens - the tokens to search for
        maxDistance - the maximum distance (expressed in number of tokens) to allow between found
        Returns:
        a new component for doing the actual evaluation
      • containsAllPrefixes

        @Nonnull
        public QueryComponent containsAllPrefixes​(@Nonnull
                                                  String tokenPrefixes)
        Checks if the field contains tokens matching all of of the given prefixes. The given String will be tokenized into multiple tokens using an appropriate tokenizer. This variant of containsAllPrefixes is strict, i.e., the planner will ensure that it does not return any false positives when evaluated with an index scan. However, the scan can be made more efficient (if false positives are acceptable) if one uses one of the other variants of this function and supply false to the strict parameter.
        Parameters:
        tokenPrefixes - the token prefixes to search for
        Returns:
        a new component for doing the actual evaluation
        See Also:
        containsAllPrefixes(String, boolean)
      • containsAllPrefixes

        @Nonnull
        public QueryComponent containsAllPrefixes​(@Nonnull
                                                  String tokenPrefixes,
                                                  boolean strict)
        Checks if the field contains tokens matching all of of the given prefixes. The given String will be tokenized into multiple tokens using an appropriate tokenizer. The strict parameter determines whether this comparison is strictly evaluated against an index. If the parameter is set to true, then this will return no false positives, but it may require that there are additional reads performed to filter out any false positives that occur internally during query execution.
        Parameters:
        tokenPrefixes - the token prefixes to search for
        strict - true if this should not return false positives
        Returns:
        a new component for doing the actual evaluation
      • containsAllPrefixes

        @Nonnull
        public QueryComponent containsAllPrefixes​(@Nonnull
                                                  String tokenPrefixes,
                                                  boolean strict,
                                                  long expectedRecords,
                                                  double falsePositivePercentage)
        Checks if the field contains tokens matching all of of the given prefixes. The given String will be tokenized into multiple tokens using an appropriate tokenizer. The strict parameter behaves the same way here as it does in the other overload of containsAllPrefixes(). The expectedRecords and falsePositivePercentage flags can be used to tweak the behavior of underlying probabilistic data structures used during query execution. See the Comparisons.TextContainsAllPrefixesComparison class for more details.
        Parameters:
        tokenPrefixes - the token prefixes to search for
        strict - true if this should not return any false positives
        expectedRecords - the expected number of records read for each prefix
        falsePositivePercentage - an acceptable percentage of false positives for each token prefix
        Returns:
        a new component for doing the actual evaluation
        See Also:
        Comparisons.TextContainsAllPrefixesComparison, containsAllPrefixes(String, boolean)
      • containsAllPrefixes

        @Nonnull
        public QueryComponent containsAllPrefixes​(@Nonnull
                                                  List<String> tokenPrefixes)
        Checks if the field contains tokens matching all of of the given prefixes. This will produce a component that behaves exactly like the component returned by the variant of containsAllPrefixes(String) that takes a single String, but this method assumes the token prefixes given are already tokenized and normalized.
        Parameters:
        tokenPrefixes - the token prefixes to search for
        Returns:
        a new component for doing the actual evaluation
        See Also:
        containsAllPrefixes(String)
      • containsAllPrefixes

        @Nonnull
        public QueryComponent containsAllPrefixes​(@Nonnull
                                                  List<String> tokenPrefixes,
                                                  boolean strict)
        Checks if the field contains tokens matching all of of the given prefixes. This will produce a component that behaves exactly like the component returned by the variant of containsAllPrefixes(String, boolean) that takes a single String, but this method assumes the token prefixes given are already tokenized and normalized.
        Parameters:
        tokenPrefixes - the token prefixes to search for
        strict - true if this should not return any false positives
        Returns:
        a new component for doing the actual evaluation
        See Also:
        containsAllPrefixes(String, boolean)
      • containsAllPrefixes

        @Nonnull
        public QueryComponent containsAllPrefixes​(@Nonnull
                                                  List<String> tokenPrefixes,
                                                  boolean strict,
                                                  long expectedRecords,
                                                  double falsePositivePercentage)
        Checks if the field contains tokens matching all of of the given prefixes. This will produce a component that behaves exactly like the component returned by the variant of containsAllPrefixes(String, boolean, long, double) that takes a single String, but this method assumes the token prefixes given are already tokenized and normalized.
        Parameters:
        tokenPrefixes - the token prefixes to search for
        strict - true if this should not return any false positives
        expectedRecords - the expected number of records read for each prefix
        falsePositivePercentage - an acceptable percentage of false positives for each token prefix
        Returns:
        a new component for doing the actual evaluation
        See Also:
        containsAllPrefixes(String, boolean, long, double)
      • containsPhrase

        @Nonnull
        public QueryComponent containsPhrase​(@Nonnull
                                             String phrase)
        Checks if the field contains the given phrase. This will match the given field if the given phrased (when tokenized) forms a sublist of the original text. If the tokenization process removes any stop words from the phrase, this will match documents that contain any token in the place of the stop word. This will return Boolean.TRUE if all of the tokens (except stop words) can be found in the given document in the correct order, Boolean.FALSE if any cannot, and null if the phrase is empty or contains only stop words or if the field itself is null.
        Parameters:
        phrase - the phrase to search for
        Returns:
        a new component for doing the actual evaluation
      • containsPhrase

        @Nonnull
        public QueryComponent containsPhrase​(@Nonnull
                                             List<String> phraseTokens)
        Checks if the field text contains the given phrase. This behaves like containsPhrase(String) except that the token list is assumed to have already been tokenized with an appropriate tokenizer. No further sanitization or normalization is performed on the tokens before searching for them in the text. It is assumed that the order of the tokens in the list is the same as the order of the tokens in the original phrase and that there are no gaps (except as indicated by including the empty string to indicate that there was a stop word in the original phrase).
        Parameters:
        phraseTokens - the tokens to search for in the order they appear in the phrase
        Returns:
        a new component for doing the actual evaluation
      • containsAny

        @Nonnull
        public QueryComponent containsAny​(@Nonnull
                                          String tokens)
        Checks if the field contains any of the provided tokens. At query evaluation time, the tokens provided here will be tokenized into a list of tokens. This predicate will then return Boolean.TRUE if any of the tokens (not counting stop words) are present, Boolean.FALSE if all of them are not, and null if either the field is null or if the token list contains only stop words or is empty.
        Parameters:
        tokens - the tokens to search for
        Returns:
        a new component for doing the actual evaluation
      • containsAny

        @Nonnull
        public QueryComponent containsAny​(@Nonnull
                                          List<String> tokens)
        Checks if the field contains all of provided tokens. This behaves like containsAny(String), except that the token list is assumed to have already been tokenized with an appropriate tokenizer. No further sanitization or normalization is performed on the tokens before searching for them in the text.
        Parameters:
        tokens - the tokens to search for
        Returns:
        a new component for doing the actual evaluation
      • containsAnyPrefix

        @Nonnull
        public QueryComponent containsAnyPrefix​(@Nonnull
                                                String tokenPrefixes)
        Checks if the field contains a token that matches any of the given prefixes. At query evaluation time, the string given is tokenized using an appropriate tokenizer.
        Parameters:
        tokenPrefixes - the token prefixes to search for
        Returns:
        a new component for doing the actual evaluation
      • containsAnyPrefix

        @Nonnull
        public QueryComponent containsAnyPrefix​(@Nonnull
                                                List<String> tokenPrefixes)
        Checks if the field contains a token that matches any of the given prefixes. This behaves like the variant of containsAnyPrefix(String) that takes a single String except that it assumes the token prefix list has already been tokenized and normalized.
        Parameters:
        tokenPrefixes - the token prefixes to search for
        Returns:
        a new component for doing the actual evaluation
        See Also:
        containsAnyPrefix(String)