public abstract class BaseTextTokenizer extends java.lang.Object implements TextTokenizer
An abstract text tokenizer which tokenizes a given string. It discards certain words known as stop word depending on the language chosen.
Constructor and Description |
---|
BaseTextTokenizer() |
Modifier and Type | Method and Description |
---|---|
protected java.lang.String |
convertWord(java.lang.String word)
Converts a
word into all lower case and checks if it
is a known stop word. |
java.util.Set<java.lang.String> |
tokenize(java.lang.String text)
Tokenize a
text and discards all stop-words from it. |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
stopWords
public java.util.Set<java.lang.String> tokenize(java.lang.String text) throws java.io.IOException
TextTokenizer
Tokenize a text
and discards all stop-words from it.
tokenize
in interface TextTokenizer
text
- the text to tokenizejava.io.IOException
- if a low-level I/O error occurs.protected java.lang.String convertWord(java.lang.String word)
Converts a word
into all lower case and checks if it
is a known stop word. If it is, then the word
will be
discarded and will not be considered as a valid token.
word
- the word