Package com.yahoo.language.process
Interface Tokenizer
-
public interface Tokenizer
Language-sensitive tokenization of a text string.- Author:
- Mathias Mølster Lidal
-
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description default java.lang.String
getReplacementTerm(java.lang.String tokenString)
Return a replacement for an input token string.java.lang.Iterable<Token>
tokenize(java.lang.String input, Language language, StemMode stemMode, boolean removeAccents)
Returns the tokens produced from an input string under the rules of the given Language and additional options
-
-
-
Method Detail
-
tokenize
java.lang.Iterable<Token> tokenize(java.lang.String input, Language language, StemMode stemMode, boolean removeAccents)
Returns the tokens produced from an input string under the rules of the given Language and additional options- Parameters:
input
- the string to tokenize. May be arbitrarily large.language
- the language of the input string.stemMode
- the stem mode applied on the returned tokensremoveAccents
- if true accents and similar are removed from the returned tokens- Returns:
- the tokens of the input String.
- Throws:
ProcessingException
- If the underlying library throws an Exception.
-
getReplacementTerm
default java.lang.String getReplacementTerm(java.lang.String tokenString)
Return a replacement for an input token string. This accepts strings returned by Token.getTokenString and returns a replacement which will be used as the index token. The input token string is returned if there is no replacement.This default implementation always returns the input token string.
- Parameters:
tokenString
- the token string of the term to lookup a replacement for- Returns:
- the replacement, if any, or the argument token string if not
-
-