Class SimpleTokenizer

  • All Implemented Interfaces:
    Tokenizer

    public class SimpleTokenizer
    extends java.lang.Object
    implements Tokenizer

    A tokenizer which splits on whitespace, normalizes and transforms using the given implementations and stems using the kstem algorithm.

    This is not multithread safe.

    Author:
    Mathias Mølster Lidal, bratseth
    • Constructor Detail

      • SimpleTokenizer

        public SimpleTokenizer()
      • SimpleTokenizer

        public SimpleTokenizer​(Normalizer normalizer)
    • Method Detail

      • tokenize

        public java.lang.Iterable<Token> tokenize​(java.lang.String input,
                                                  Language language,
                                                  StemMode stemMode,
                                                  boolean removeAccents)
        Description copied from interface: Tokenizer
        Returns the tokens produced from an input string under the rules of the given Language and additional options
        Specified by:
        tokenize in interface Tokenizer
        Parameters:
        input - the string to tokenize. May be arbitrarily large.
        language - the language of the input string.
        stemMode - the stem mode applied on the returned tokens
        removeAccents - if true accents and similar are removed from the returned tokens
        Returns:
        the tokens of the input String.