Package com.yahoo.language.simple
Class SimpleTokenizer
- java.lang.Object
-
- com.yahoo.language.simple.SimpleTokenizer
-
- All Implemented Interfaces:
Tokenizer
public class SimpleTokenizer extends Object implements Tokenizer
A tokenizer which splits on whitespace, normalizes and transforms using the given implementations and stems using the kstem algorithm.
This is not multithread safe.
- Author:
- Mathias Mølster Lidal, bratseth
-
-
Constructor Summary
Constructors Constructor Description SimpleTokenizer()
SimpleTokenizer(Normalizer normalizer)
SimpleTokenizer(Normalizer normalizer, Transformer transformer)
SimpleTokenizer(Normalizer normalizer, Transformer transformer, SpecialTokenRegistry specialTokenRegistry)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Iterable<Token>
tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
Returns the tokens produced from an input string under the rules of the given Language and additional options-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.yahoo.language.process.Tokenizer
getReplacementTerm
-
-
-
-
Constructor Detail
-
SimpleTokenizer
public SimpleTokenizer()
-
SimpleTokenizer
public SimpleTokenizer(Normalizer normalizer)
-
SimpleTokenizer
public SimpleTokenizer(Normalizer normalizer, Transformer transformer)
-
SimpleTokenizer
public SimpleTokenizer(Normalizer normalizer, Transformer transformer, SpecialTokenRegistry specialTokenRegistry)
-
-
Method Detail
-
tokenize
public Iterable<Token> tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
Description copied from interface:Tokenizer
Returns the tokens produced from an input string under the rules of the given Language and additional options- Specified by:
tokenize
in interfaceTokenizer
- Parameters:
input
- the string to tokenize. May be arbitrarily large.language
- the language of the input string.stemMode
- the stem mode applied on the returned tokensremoveAccents
- if true accents and similar are removed from the returned tokens- Returns:
- the tokens of the input String.
-
-