Class BaseTextTokenizer

  • All Implemented Interfaces:
    TextTokenizer
    Direct Known Subclasses:
    EnglishTextTokenizer, UniversalTextTokenizer

    public abstract class BaseTextTokenizer
    extends Object
    implements TextTokenizer
    An abstract text tokenizer which tokenizes a given string. It discards certain words known as stop word depending on the language chosen.
    Since:
    2.1.0
    Author:
    Anindya Chatterjee
    • Constructor Detail

      • BaseTextTokenizer

        public BaseTextTokenizer()
    • Method Detail

      • tokenize

        public Set<String> tokenize​(String text)
        Description copied from interface: TextTokenizer
        Tokenize a text and discards all stop-words from it.
        Specified by:
        tokenize in interface TextTokenizer
        Parameters:
        text - the text to tokenize
        Returns:
        the set of tokens.