Package com.yahoo.language.simple
Class SimpleTokenizer
java.lang.Object
com.yahoo.language.simple.SimpleTokenizer
- All Implemented Interfaces:
Tokenizer
A tokenizer which splits on whitespace, normalizes and transforms using the given implementations and stems using the kstem algorithm.
This is not multithread safe.
- Author:
- Mathias Mølster Lidal, bratseth
-
Constructor Summary
ConstructorDescriptionSimpleTokenizer
(Normalizer normalizer) SimpleTokenizer
(Normalizer normalizer, Transformer transformer) SimpleTokenizer
(Normalizer normalizer, Transformer transformer, SpecialTokenRegistry specialTokenRegistry) -
Method Summary
Modifier and TypeMethodDescriptionTokenize the input, applying the transform of this to each token string.Tokenize the input, and apply the given transform to each token string.
-
Constructor Details
-
SimpleTokenizer
public SimpleTokenizer() -
SimpleTokenizer
-
SimpleTokenizer
-
SimpleTokenizer
public SimpleTokenizer(Normalizer normalizer, Transformer transformer, SpecialTokenRegistry specialTokenRegistry)
-
-
Method Details
-
tokenize
public Iterable<Token> tokenize(String input, Language language, StemMode stemMode, boolean removeAccents) Tokenize the input, applying the transform of this to each token string.- Specified by:
tokenize
in interfaceTokenizer
- Parameters:
input
- the string to tokenize. May be arbitrarily large.language
- the language of the input string.stemMode
- the stem mode applied on the returned tokensremoveAccents
- if true accents and similar are removed from the returned tokens- Returns:
- the tokens of the input String.
-
tokenize
Tokenize the input, and apply the given transform to each token string.
-