Package org.apache.lucene.analysis.ngram
Class EdgeNGramTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.ngram.NGramTokenizer
org.apache.lucene.analysis.ngram.EdgeNGramTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
Tokenizes the input from an edge into n-grams of given size(s).
This Tokenizer
create n-grams from the beginning edge or ending edge of a input token.
As of Lucene 4.4, this tokenizer
- can handle
maxGram
larger than 1024 chars, but beware that this will result in increased memory usage - doesn't trim the input,
- sets position increments equal to 1 instead of 1 for the first token and 0 for all other ones
- doesn't support backward n-grams anymore.
- supports
pre-tokenization
, - correctly handles supplementary characters.
Although highly discouraged, it is still possible
to use the old behavior through Lucene43EdgeNGramTokenizer
.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
static final int
Fields inherited from class org.apache.lucene.analysis.ngram.NGramTokenizer
DEFAULT_MAX_NGRAM_SIZE, DEFAULT_MIN_NGRAM_SIZE
-
Constructor Summary
ConstructorsConstructorDescriptionEdgeNGramTokenizer
(Version version, Reader input, int minGram, int maxGram) Creates EdgeNGramTokenizer that can generate n-grams in the sizes of the given rangeEdgeNGramTokenizer
(Version version, AttributeSource.AttributeFactory factory, Reader input, int minGram, int maxGram) Creates EdgeNGramTokenizer that can generate n-grams in the sizes of the given range -
Method Summary
Methods inherited from class org.apache.lucene.analysis.ngram.NGramTokenizer
end, incrementToken, reset
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
-
Field Details
-
DEFAULT_MAX_GRAM_SIZE
public static final int DEFAULT_MAX_GRAM_SIZE- See Also:
-
DEFAULT_MIN_GRAM_SIZE
public static final int DEFAULT_MIN_GRAM_SIZE- See Also:
-
-
Constructor Details
-
EdgeNGramTokenizer
Creates EdgeNGramTokenizer that can generate n-grams in the sizes of the given range- Parameters:
version
- the Lucene match versioninput
-Reader
holding the input to be tokenizedminGram
- the smallest n-gram to generatemaxGram
- the largest n-gram to generate
-
EdgeNGramTokenizer
public EdgeNGramTokenizer(Version version, AttributeSource.AttributeFactory factory, Reader input, int minGram, int maxGram) Creates EdgeNGramTokenizer that can generate n-grams in the sizes of the given range- Parameters:
version
- the Lucene match versionfactory
-AttributeSource.AttributeFactory
to useinput
-Reader
holding the input to be tokenizedminGram
- the smallest n-gram to generatemaxGram
- the largest n-gram to generate
-