Class TokenizerME
- All Implemented Interfaces:
Tokenizer
This tokenizer needs a statistical model to tokenize a text which reproduces
the tokenization observed in the training data used to create the model.
The TokenizerModel
class encapsulates the model and provides
methods to create it from the binary representation.
A tokenizer instance is not thread safe. For each thread one tokenizer
must be instantiated which can share one TokenizerModel
instance
to safe memory.
To train a new model {train(ObjectStream, TokenizerFactory, TrainingParameters)
method
can be used.
Sample usage:
InputStream modelIn;
...
TokenizerModel model = TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
- See Also:
-
Field Summary
Fields -
Constructor Summary
ConstructorsConstructorDescriptionTokenizerME
(TokenizerModel model) TokenizerME
(TokenizerModel model, Factory factory) Deprecated.useTokenizerFactory
to extend the Tokenizer functionality -
Method Summary
Modifier and TypeMethodDescriptiondouble[]
Returns the probabilities associated with the most recent calls toTokenizer.tokenize(String)
ortokenizePos(String)
.String[]
Splits a string into its atomic partsSpan[]
Tokenizes the string.static TokenizerModel
train
(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) Trains a model for theTokenizerME
.boolean
Returns the value of the alpha-numeric optimization flag.
-
Field Details
-
SPLIT
Constant indicates a token split.- See Also:
-
NO_SPLIT
Constant indicates no token split.- See Also:
-
alphaNumeric
Deprecated.As of release 1.5.2, replaced byFactory.getAlphanumeric(String)
Alpha-Numeric Pattern
-
-
Constructor Details
-
TokenizerME
-
TokenizerME
Deprecated.useTokenizerFactory
to extend the Tokenizer functionality
-
-
Method Details
-
getTokenProbabilities
public double[] getTokenProbabilities()Returns the probabilities associated with the most recent calls toTokenizer.tokenize(String)
ortokenizePos(String)
.- Returns:
- probability for each token returned for the most recent call to tokenize. If not applicable an empty array is returned.
-
tokenizePos
Tokenizes the string.- Parameters:
d
- The string to be tokenized.- Returns:
- A span array containing individual tokens as elements.
-
train
public static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) throws IOException Trains a model for theTokenizerME
.- Parameters:
samples
- the samples used for the training.factory
- aTokenizerFactory
to get resources frommlParams
- the machine learning train parameters- Returns:
- the trained
TokenizerModel
- Throws:
IOException
- it throws anIOException
if anIOException
is thrown during IO operations on a temp file which is created during training. Or if reading from theObjectStream
fails.
-
useAlphaNumericOptimization
public boolean useAlphaNumericOptimization()Returns the value of the alpha-numeric optimization flag.- Returns:
- true if the tokenizer should use alpha-numeric optimization, false otherwise.
-
tokenize
Description copied from interface:Tokenizer
Splits a string into its atomic parts
-
Factory.getAlphanumeric(String)