Module org.elasticsearch.server
Class DeDuplicatingTokenFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.FilteringTokenFilter
org.elasticsearch.lucene.analysis.miscellaneous.DeDuplicatingTokenFilter
- All Implemented Interfaces:
Closeable
,AutoCloseable
,org.apache.lucene.util.Unwrappable<org.apache.lucene.analysis.TokenStream>
public class DeDuplicatingTokenFilter
extends org.apache.lucene.analysis.FilteringTokenFilter
Inspects token streams for duplicate sequences of tokens. Token sequences
have a minimum length - 6 is a good heuristic as it avoids filtering common
idioms/phrases but detects longer sections that are typical of cut+paste
copies of text.
Internally each token is hashed/moduloed into a single byte (so 256 possible
values for each token) and then recorded in a trie of seen byte sequences
using a DuplicateByteSequenceSpotter
. This trie is passed into the
TokenFilter constructor so a single object can be reused across multiple
documents.
The emitDuplicates setting controls if duplicate tokens are filtered from
results or are output (the DuplicateSequenceAttribute
attribute can
be used to inspect the number of prior sightings when emitDuplicates is true)
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionDeDuplicatingTokenFilter
(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter) DeDuplicatingTokenFilter
(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter, boolean emitDuplicates) -
Method Summary
Methods inherited from class org.apache.lucene.analysis.FilteringTokenFilter
end, incrementToken, reset
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, unwrap
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
DeDuplicatingTokenFilter
public DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter) -
DeDuplicatingTokenFilter
public DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter, boolean emitDuplicates) - Parameters:
in
- The input token streambyteStreamDuplicateSpotter
- object which retains trie of token sequencesemitDuplicates
- true if duplicate tokens are to be emitted (useDuplicateSequenceAttribute
attribute to inspect number of prior sightings of tokens as part of a sequence).
-
-
Method Details
-
accept
- Specified by:
accept
in classorg.apache.lucene.analysis.FilteringTokenFilter
- Throws:
IOException
-