org.apache.lucene.analysis.pattern
Class PatternTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.pattern.PatternTokenizer
- All Implemented Interfaces:
- java.io.Closeable
public final class PatternTokenizer
- extends org.apache.lucene.analysis.Tokenizer
This tokenizer uses regex pattern matching to construct distinct tokens
for the input stream. It takes two arguments: "pattern" and "group".
- "pattern" is the regular expression.
- "group" says which group to extract into tokens.
group=-1 (the default) is equivalent to "split". In this case, the tokens will
be equivalent to the output from (without empty tokens):
String.split(java.lang.String)
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\'
group = 0
input = aaa 'bbb' 'ccc'
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input
but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This Tokenizer does not output tokens that are of zero length.
- See Also:
Pattern
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Constructor Summary |
PatternTokenizer(java.io.Reader input,
java.util.regex.Pattern pattern,
int group)
creates a new PatternTokenizer returning tokens from group (-1 for split functionality) |
Methods inherited from class org.apache.lucene.analysis.Tokenizer |
close, correctOffset |
Methods inherited from class org.apache.lucene.analysis.TokenStream |
reset |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
PatternTokenizer
public PatternTokenizer(java.io.Reader input,
java.util.regex.Pattern pattern,
int group)
throws java.io.IOException
- creates a new PatternTokenizer returning tokens from group (-1 for split functionality)
- Throws:
java.io.IOException
incrementToken
public boolean incrementToken()
throws java.io.IOException
- Specified by:
incrementToken
in class org.apache.lucene.analysis.TokenStream
- Throws:
java.io.IOException
end
public void end()
throws java.io.IOException
- Overrides:
end
in class org.apache.lucene.analysis.TokenStream
- Throws:
java.io.IOException
reset
public void reset(java.io.Reader input)
throws java.io.IOException
- Overrides:
reset
in class org.apache.lucene.analysis.Tokenizer
- Throws:
java.io.IOException