BaseTextTokenizer (Nitrite 3.1.0 API)

Skip navigation links

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.dizitart.no2.fulltext.BaseTextTokenizer

All Implemented Interfaces:

TextTokenizer

Direct Known Subclasses:

UniversalTextTokenizer
```
public abstract class BaseTextTokenizer
extends java.lang.Object
implements TextTokenizer
```
An abstract text tokenizer which tokenizes a given string. It discards certain words known as stop word depending on the language chosen.

Since:

2.1.0

Constructor Summary

Constructors
Constructor and Description

BaseTextTokenizer()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected java.lang.String`	`convertWord(java.lang.String word)` Converts a `word` into all lower case and checks if it is a known stop word.
`java.util.Set<java.lang.String>`	`tokenize(java.lang.String text)` Tokenize a `text` and discards all stop-words from it.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.dizitart.no2.fulltext.TextTokenizer
stopWords

- Constructor Detail
  - BaseTextTokenizer
```
public BaseTextTokenizer()
```
- Method Detail
  - tokenize
```
public java.util.Set<java.lang.String> tokenize(java.lang.String text)
                                         throws java.io.IOException
```
    Description copied from interface: TextTokenizer
    
    Tokenize a text and discards all stop-words from it.
    
    Specified by:
    
    tokenize in interface TextTokenizer
    
    Parameters:
    
    text - the text to tokenize
    
    Returns:
    
    the set of tokens.
    
    Throws:
    
    java.io.IOException - if a low-level I/O error occurs.
  - convertWord
```
protected java.lang.String convertWord(java.lang.String word)
```
    Converts a word into all lower case and checks if it is a known stop word. If it is, then the word will be discarded and will not be considered as a valid token.
    
    Parameters:
    
    word - the word
    
    Returns:
    
    the tokenized word in all upper case.

Skip navigation links

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method