Package org.dizitart.no2.index.fulltext
Class BaseTextTokenizer
- java.lang.Object
-
- org.dizitart.no2.index.fulltext.BaseTextTokenizer
-
- All Implemented Interfaces:
TextTokenizer
- Direct Known Subclasses:
EnglishTextTokenizer
,UniversalTextTokenizer
public abstract class BaseTextTokenizer extends Object implements TextTokenizer
An abstract text tokenizer which tokenizes a given string. It discards certain words known as stop word depending on the language chosen.- Since:
- 2.1.0
- Author:
- Anindya Chatterjee
-
-
Constructor Summary
Constructors Constructor Description BaseTextTokenizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Set<String>
tokenize(String text)
Tokenize atext
and discards all stop-words from it.-
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.dizitart.no2.index.fulltext.TextTokenizer
getLanguage, stopWords
-
-
-
-
Method Detail
-
tokenize
public Set<String> tokenize(String text)
Description copied from interface:TextTokenizer
Tokenize atext
and discards all stop-words from it.- Specified by:
tokenize
in interfaceTextTokenizer
- Parameters:
text
- the text to tokenize- Returns:
- the set of tokens.
-
-