opennlp.tools.tokenize.TokenizerME

All Implemented Interfaces:: Tokenizer

public class TokenizerME extends Object

A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: http://www.cis.upenn.edu/~jcreynar.

This tokenizer needs a statistical model to tokenize a text which reproduces the tokenization observed in the training data used to create the model. The TokenizerModel class encapsulates the model and provides methods to create it from the binary representation.

A tokenizer instance is not thread safe. For each thread one tokenizer must be instantiated which can share one TokenizerModel instance to safe memory.

To train a new model {train(ObjectStream, TokenizerFactory, TrainingParameters) method can be used.

Sample usage:

InputStream modelIn; ... TokenizerModel model = TokenizerModel(modelIn); Tokenizer tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");

See Also:

Field Summary

Fields

Modifier and Type

Field

Description

static final Pattern

alphaNumeric

Deprecated.
As of release 1.5.2, replaced by Factory.getAlphanumeric(String)

static final String

NO_SPLIT

Constant indicates no token split.

static final String

SPLIT

Constant indicates a token split.
Constructor Summary

Constructors

Constructor

Description

TokenizerME(TokenizerModel model)

TokenizerME(TokenizerModel model, Factory factory)

Deprecated.
use TokenizerFactory to extend the Tokenizer functionality
Method Summary

Modifier and Type

Method

Description

double[]

getTokenProbabilities()

Returns the probabilities associated with the most recent calls to Tokenizer.tokenize(String) or tokenizePos(String).

String[]

tokenize(String s)

Splits a string into its atomic parts

Span[]

tokenizePos(String d)

Tokenizes the string.

static TokenizerModel

train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams)

Trains a model for the TokenizerME.

boolean

useAlphaNumericOptimization()

Returns the value of the alpha-numeric optimization flag.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- SPLIT
  
  public static final String SPLIT
  
  Constant indicates a token split.
  See Also:
  
  Constant Field Values
- NO_SPLIT
  
  public static final String NO_SPLIT
  
  Constant indicates no token split.
  See Also:
  
  Constant Field Values
- alphaNumeric
  
  @Deprecated public static final Pattern alphaNumeric
  
  Deprecated.
  As of release 1.5.2, replaced by Factory.getAlphanumeric(String)
  
  Alpha-Numeric Pattern
Constructor Details
- TokenizerME
  
  public TokenizerME(TokenizerModel model)
- TokenizerME
  
  public TokenizerME(TokenizerModel model, Factory factory)
  
  Deprecated.
  use TokenizerFactory to extend the Tokenizer functionality
Method Details
- getTokenProbabilities
  
  public double[] getTokenProbabilities()
  
  Returns the probabilities associated with the most recent calls to Tokenizer.tokenize(String) or tokenizePos(String).
  
  Returns:
  
  probability for each token returned for the most recent call to tokenize. If not applicable an empty array is returned.
- tokenizePos
  
  public Span[] tokenizePos(String d)
  
  Tokenizes the string.
  
  Parameters:
  
  d - The string to be tokenized.
  
  Returns:
  
  A span array containing individual tokens as elements.
- train
  
  public static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) throws IOException
  
  Trains a model for the TokenizerME.
  
  Parameters:
  
  samples - the samples used for the training.
  
  factory - a TokenizerFactory to get resources from
  
  mlParams - the machine learning train parameters
  
  Returns:
  
  the trained TokenizerModel
  
  Throws:
  
  IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.
- useAlphaNumericOptimization
  
  public boolean useAlphaNumericOptimization()
  
  Returns the value of the alpha-numeric optimization flag.
  
  Returns:
  
  true if the tokenizer should use alpha-numeric optimization, false otherwise.
- tokenize
  
  public String[] tokenize(String s)
  
  Description copied from interface: Tokenizer
  
  Splits a string into its atomic parts
  
  Specified by:
  
  tokenize in interface Tokenizer
  
  Parameters:
  
  s - The string to be tokenized.
  
  Returns:
  
  The String[] with the individual tokens as the array elements.

Class TokenizerME

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

SPLIT

NO_SPLIT

alphaNumeric

Constructor Details

TokenizerME

TokenizerME

Method Details

getTokenProbabilities

tokenizePos

train

useAlphaNumericOptimization

tokenize