Interface Tokenizer

All Known Implementing Classes:
SimpleTokenizer

public interface Tokenizer
Language-sensitive tokenization of a text string.
Author:
Mathias Mølster Lidal
  • Method Summary

    Modifier and Type
    Method
    Description
    tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
    Returns the tokens produced from an input string under the rules of the given Language and additional options
  • Method Details

    • tokenize

      Iterable<Token> tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
      Returns the tokens produced from an input string under the rules of the given Language and additional options
      Parameters:
      input - the string to tokenize. May be arbitrarily large.
      language - the language of the input string.
      stemMode - the stem mode applied on the returned tokens
      removeAccents - if true accents and similar are removed from the returned tokens
      Returns:
      the tokens of the input String.
      Throws:
      ProcessingException - If the underlying library throws an Exception.