Package com.yahoo.language.process
Interface Tokenizer
-
- All Known Implementing Classes:
OpenNlpTokenizer
,SimpleTokenizer
public interface Tokenizer
Language-sensitive tokenization of a text string.- Author:
- Mathias Mølster Lidal
-
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Deprecated Methods Modifier and Type Method Description default String
getReplacementTerm(String tokenString)
Deprecated.replacements are already applied in tokens returned by tokenizeIterable<Token>
tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
Returns the tokens produced from an input string under the rules of the given Language and additional options
-
-
-
Method Detail
-
tokenize
Iterable<Token> tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
Returns the tokens produced from an input string under the rules of the given Language and additional options- Parameters:
input
- the string to tokenize. May be arbitrarily large.language
- the language of the input string.stemMode
- the stem mode applied on the returned tokensremoveAccents
- if true accents and similar are removed from the returned tokens- Returns:
- the tokens of the input String.
- Throws:
ProcessingException
- If the underlying library throws an Exception.
-
getReplacementTerm
@Deprecated default String getReplacementTerm(String tokenString)
Deprecated.replacements are already applied in tokens returned by tokenizeNot used.
-
-