org.apache.lucene.analysis.pattern
Class PatternTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.lucene.analysis.pattern.PatternTokenizer
All Implemented Interfaces:
java.io.Closeable

public final class PatternTokenizer
extends org.apache.lucene.analysis.Tokenizer

This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): String.split(java.lang.String)

Using group >= 0 selects the matching group as the token. For example, if you have:

  pattern = \'([^\']+)\'
  group = 0
  input = aaa 'bbb' 'ccc'
 
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

NOTE: This Tokenizer does not output tokens that are of zero length.

See Also:
Pattern

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
PatternTokenizer(java.io.Reader input, java.util.regex.Pattern pattern, int group)
          creates a new PatternTokenizer returning tokens from group (-1 for split functionality)
 
Method Summary
 void end()
           
 boolean incrementToken()
           
 void reset(java.io.Reader input)
           
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
reset
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

PatternTokenizer

public PatternTokenizer(java.io.Reader input,
                        java.util.regex.Pattern pattern,
                        int group)
                 throws java.io.IOException
creates a new PatternTokenizer returning tokens from group (-1 for split functionality)

Throws:
java.io.IOException
Method Detail

incrementToken

public boolean incrementToken()
                       throws java.io.IOException
Specified by:
incrementToken in class org.apache.lucene.analysis.TokenStream
Throws:
java.io.IOException

end

public void end()
         throws java.io.IOException
Overrides:
end in class org.apache.lucene.analysis.TokenStream
Throws:
java.io.IOException

reset

public void reset(java.io.Reader input)
           throws java.io.IOException
Overrides:
reset in class org.apache.lucene.analysis.Tokenizer
Throws:
java.io.IOException