Package opennlp.tools.tokenize
package opennlp.tools.tokenize
Contains classes related to finding token or words in a string. All
tokenizer implement the Tokenizer interface. Currently there is the
learnable
TokenizerME
, the WhitespaceTokenizer
and
the SimpleTokenizer
which is a character class tokenizer.-
ClassDescriptionGenerate events for maxent decisions for tokenization.A Detokenizer merges tokens back to their untokenized representation.This enum contains an operation for every token to merge the tokens together to their detokenized form.The
DetokenizerEvaluator
measures the performance of the givenDetokenizer
with the provided referenceTokenSample
s.A rule based detokenizer.Performs tokenization using character classes.Interface forTokenizerME
context generators.The interface for tokenizers, which segment a string into its tokens.TheTokenizerEvaluator
measures the performance of the givenTokenizer
with the provided referenceTokenSample
s.The factory that providesTokenizer
default implementations and resources.A Tokenizer for converting raw text into separated tokens.TheTokenizerModel
is the model used by a learnableTokenizer
.TheTokenizerStream
uses a tokenizer to tokenize the input string and outputTokenSample
s.ATokenSample
is text with token spans.This class is a stream filter which reads in string encoded samples and createsTokenSample
s out of them.This class reads theTokenSample
s from the givenIterator
and converts theTokenSample
s intoEvent
s which can be used by the maxent library for training.This tokenizer uses white spaces to tokenize the input text.This stream formats aTokenSample
s into whitespace separated token strings.