org.apache.solr.analysis
Class PatternTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.solr.analysis.PatternTokenizer
All Implemented Interfaces:
Closeable

public final class PatternTokenizer
extends Tokenizer

This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): String.split(java.lang.String)

Using group >= 0 selects the matching group as the token. For example, if you have:

  pattern = \'([^\']+)\'
  group = 0
  input = aaa 'bbb' 'ccc'
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

NOTE: This Tokenizer does not output tokens that are of zero length.

Version:
$Id: PatternTokenizer.java 940806 2010-05-04 11:18:46Z uschindler $
See Also:
Pattern

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
PatternTokenizer(Reader input, Pattern pattern, int group)
          creates a new PatternTokenizer returning tokens from group (-1 for split functionality)
 
Method Summary
 void end()
           
 boolean incrementToken()
           
 void reset(Reader input)
           
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
reset
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

PatternTokenizer

public PatternTokenizer(Reader input,
                        Pattern pattern,
                        int group)
                 throws IOException
creates a new PatternTokenizer returning tokens from group (-1 for split functionality)

Throws:
IOException
Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException
Specified by:
incrementToken in class TokenStream
Throws:
IOException

end

public void end()
         throws IOException
Overrides:
end in class TokenStream
Throws:
IOException

reset

public void reset(Reader input)
           throws IOException
Overrides:
reset in class Tokenizer
Throws:
IOException