Class LetterTokenizer

All Implemented Interfaces:
Closeable, AutoCloseable
Direct Known Subclasses:
ArabicLetterTokenizer, LowerCaseTokenizer

public class LetterTokenizer extends CharTokenizer
A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate.

Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

You must specify the required Version compatibility when creating LetterTokenizer:

  • As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See CharTokenizer.isTokenChar(int) and CharTokenizer.normalize(int) for details.

  • Constructor Details

    • LetterTokenizer

      public LetterTokenizer(Version matchVersion, Reader in)
      Construct a new LetterTokenizer.
      Parameters:
      matchVersion - Lucene version to match See
      invalid @link
      {@link <a href="#version">above</a>
      }
      in - the input to split up into tokens
    • LetterTokenizer

      public LetterTokenizer(Version matchVersion, AttributeSource.AttributeFactory factory, Reader in)
      Construct a new LetterTokenizer using a given AttributeSource.AttributeFactory.
      Parameters:
      matchVersion - Lucene version to match See
      invalid @link
      {@link <a href="#version">above</a>
      }
      factory - the attribute factory to use for this Tokenizer
      in - the input to split up into tokens