public class DictionaryCompoundWordTokenFilter extends CompoundWordTokenFilterBase
TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.
You must specify the required Version
compatibility when creating
CompoundWordTokenFilterBase:
AttributeSource.AttributeFactory, AttributeSource.State
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE
Constructor and Description |
---|
DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
CharArraySet dictionary)
Creates a new
DictionaryCompoundWordTokenFilter |
DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
CharArraySet dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Creates a new
DictionaryCompoundWordTokenFilter |
incrementToken, reset
close, end
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, CharArraySet dictionary)
DictionaryCompoundWordTokenFilter
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processdictionary
- the word dictionary to match against.public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
DictionaryCompoundWordTokenFilter
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processdictionary
- the word dictionary to match against.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the streamCopyright © 2010 - 2020 Adobe. All Rights Reserved