Package org.predict4all.nlp.parser
Class Tokenizer
- java.lang.Object
-
- org.predict4all.nlp.parser.Tokenizer
-
public class Tokenizer extends java.lang.Object
This takes a raw text and to create tokens from it. The tokens are purely lowest text unit, like words and punctuation (space included).
The resulting tokens can be then used by other NLP task to generate/use data.
-
-
Field Summary
Fields Modifier and Type Field Description static java.text.DecimalFormat
TOKEN_COUNT_FORMAT
-
Constructor Summary
Constructors Constructor Description Tokenizer(LanguageModel languageModel)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.util.List<Token>
tokenize(java.lang.String rawText)
java.util.List<TrainerTask>
tokenize(TrainingCorpus corpus)
-
-
-
Constructor Detail
-
Tokenizer
public Tokenizer(LanguageModel languageModel)
-
-
Method Detail
-
tokenize
public java.util.List<Token> tokenize(java.lang.String rawText) throws java.io.IOException
- Throws:
java.io.IOException
-
tokenize
public java.util.List<TrainerTask> tokenize(TrainingCorpus corpus)
-
-