Package com.yahoo.language.simple
Class SimpleTokenizer
java.lang.Object
com.yahoo.language.simple.SimpleTokenizer
- All Implemented Interfaces:
Tokenizer
A tokenizer which splits on whitespace, normalizes and transforms using the given implementations and stems using the kstem algorithm.
This is not multithread safe.
- Author:
- Mathias Mølster Lidal, bratseth
-
Constructor Summary
ConstructorDescriptionSimpleTokenizer
(Normalizer normalizer) SimpleTokenizer
(Normalizer normalizer, Transformer transformer) SimpleTokenizer
(Normalizer normalizer, Transformer transformer, SpecialTokenRegistry specialTokenRegistry) -
Method Summary
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.yahoo.language.process.Tokenizer
getReplacementTerm
-
Constructor Details
-
SimpleTokenizer
public SimpleTokenizer() -
SimpleTokenizer
-
SimpleTokenizer
-
SimpleTokenizer
public SimpleTokenizer(Normalizer normalizer, Transformer transformer, SpecialTokenRegistry specialTokenRegistry)
-
-
Method Details
-
tokenize
public Iterable<Token> tokenize(String input, Language language, StemMode stemMode, boolean removeAccents) Description copied from interface:Tokenizer
Returns the tokens produced from an input string under the rules of the given Language and additional options- Specified by:
tokenize
in interfaceTokenizer
- Parameters:
input
- the string to tokenize. May be arbitrarily large.language
- the language of the input string.stemMode
- the stem mode applied on the returned tokensremoveAccents
- if true accents and similar are removed from the returned tokens- Returns:
- the tokens of the input String.
-