SimpleTokenizer (linguistics 8.221.29 API)

java.lang.Object

com.yahoo.language.simple.SimpleTokenizer

All Implemented Interfaces:: Tokenizer

public class SimpleTokenizer extends Object implements Tokenizer

A tokenizer which splits on whitespace, normalizes and transforms using the given implementations and stems using the kstem algorithm.

This is not multithread safe.

Author:: Mathias Mølster Lidal, bratseth

Constructor Summary

Constructors

Constructor

Description

SimpleTokenizer()

SimpleTokenizer(Normalizer normalizer)

SimpleTokenizer(Normalizer normalizer, Transformer transformer)

SimpleTokenizer(Normalizer normalizer, Transformer transformer, SpecialTokenRegistry specialTokenRegistry)
Method Summary

Modifier and Type

Method

Description

Iterable<Token>

tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)

Tokenize the input, applying the transform of this to each token string.

Iterable<Token>

tokenize(String input, Function<String,String> tokenProcessor)

Tokenize the input, and apply the given transform to each token string.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- SimpleTokenizer
  
  public SimpleTokenizer()
- SimpleTokenizer
  
  public SimpleTokenizer(Normalizer normalizer)
- SimpleTokenizer
  
  public SimpleTokenizer(Normalizer normalizer, Transformer transformer)
- SimpleTokenizer
  
  public SimpleTokenizer(Normalizer normalizer, Transformer transformer, SpecialTokenRegistry specialTokenRegistry)
Method Details
- tokenize
  
  public Iterable<Token> tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
  
  Tokenize the input, applying the transform of this to each token string.
  
  Specified by:
  
  tokenize in interface Tokenizer
  
  Parameters:
  
  input - the string to tokenize. May be arbitrarily large.
  
  language - the language of the input string.
  
  stemMode - the stem mode applied on the returned tokens
  
  removeAccents - if true accents and similar are removed from the returned tokens
  
  Returns:
  
  the tokens of the input String.
- tokenize
  
  public Iterable<Token> tokenize(String input, Function<String,String> tokenProcessor)
  
  Tokenize the input, and apply the given transform to each token string.