NGramTokenizer (The Adobe Experience Manager SDK 2020.5.3528.20200529T053942Z-200507)

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.ngram.NGramTokenizer

All Implemented Interfaces:

Closeable, AutoCloseable

Direct Known Subclasses:

EdgeNGramTokenizer
```
public class NGramTokenizer
extends Tokenizer
```
Tokenizes the input into n-grams of the given size(s).
On the contrary to NGramTokenFilter, this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars.
For example, "abcde" would be tokenized as (minGram=2, maxGram=3):

Term ab abc bc bcd cd cde de

Position increment 1 1 1 1 1 1 1

Position length 1 1 1 1 1 1 1

Offsets [0,2[ [0,3[ [1,3[ [1,4[ [2,4[ [2,5[ [3,5[

This tokenizer changed a lot in Lucene 4.4 in order to:
- give the ability to pre-tokenize the stream before computing n-grams.
Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).
Although highly discouraged, it is still possible to use the old behavior through Lucene43NGramTokenizer.

Term	ab	abc	bc	bcd	cd	cde	de
Position increment	1	1	1	1	1	1	1
Position length	1	1	1	1	1	1	1
Offsets	[0,2[	[0,3[	[1,3[	[1,4[	[2,4[	[2,5[	[3,5[

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.AttributeFactory, AttributeSource.State

Field Summary

Fields
Modifier and Type Field and Description

static int DEFAULT_MAX_NGRAM_SIZE

static int DEFAULT_MIN_NGRAM_SIZE

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_MAX_NGRAM_SIZE`
`static int`	`DEFAULT_MIN_NGRAM_SIZE`

Constructor Summary

Constructors
Constructor and Description
`NGramTokenizer(Version version, AttributeSource.AttributeFactory factory, Reader input, int minGram, int maxGram)` Creates NGramTokenizer with given min and max n-grams.
`NGramTokenizer(Version version, Reader input)` Creates NGramTokenizer with default min and max n-grams.
`NGramTokenizer(Version version, Reader input, int minGram, int maxGram)` Creates NGramTokenizer with given min and max n-grams.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`end()` This method is called by the consumer after the last token has been consumed, after `TokenStream.incrementToken()` returned `false` (using the new `TokenStream` API).
`boolean`	`incrementToken()` Consumers (i.e., `IndexWriter`) use this method to advance the stream to the next token.
`void`	`reset()` This method is called by a consumer before it begins consumption using `TokenStream.incrementToken()`.

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, setReader

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString

Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - DEFAULT_MIN_NGRAM_SIZE
```
public static final int DEFAULT_MIN_NGRAM_SIZE
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_NGRAM_SIZE
```
public static final int DEFAULT_MAX_NGRAM_SIZE
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - NGramTokenizer
```
public NGramTokenizer(Version version,
                      Reader input,
                      int minGram,
                      int maxGram)
```
    Creates NGramTokenizer with given min and max n-grams.
    
    Parameters:
    
    version - the lucene compatibility version
    
    input - Reader holding the input to be tokenized
    
    minGram - the smallest n-gram to generate
    
    maxGram - the largest n-gram to generate
  - NGramTokenizer
```
public NGramTokenizer(Version version,
                      AttributeSource.AttributeFactory factory,
                      Reader input,
                      int minGram,
                      int maxGram)
```
    Creates NGramTokenizer with given min and max n-grams.
    
    Parameters:
    
    version - the lucene compatibility version
    
    factory - AttributeSource.AttributeFactory to use
    
    input - Reader holding the input to be tokenized
    
    minGram - the smallest n-gram to generate
    
    maxGram - the largest n-gram to generate
  - NGramTokenizer
```
public NGramTokenizer(Version version,
                      Reader input)
```
    Creates NGramTokenizer with default min and max n-grams.
    
    Parameters:
    
    version - the lucene compatibility version
    
    input - Reader holding the input to be tokenized
- Method Detail
  - incrementToken
```
public final boolean incrementToken()
                             throws IOException
```
    Description copied from class: TokenStream
    
    Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate AttributeImpls with the attributes of the next token.
    The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.captureState() to create a copy of the current attribute state.
    This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.addAttribute(Class) and AttributeSource.getAttribute(Class), references to all AttributeImpls that this stream uses should be retrieved during instantiation.
    To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in TokenStream.incrementToken().
    
    Specified by:
    
    incrementToken in class TokenStream
    
    Returns:
    
    false for end of stream; true otherwise
    
    Throws:
    
    IOException
  - end
```
public final void end()
               throws IOException
```
    Description copied from class: TokenStream
    
    This method is called by the consumer after the last token has been consumed, after TokenStream.incrementToken() returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.
    This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.
    Additionally any skipped positions (such as those removed by a stopfilter) can be applied to the position increment, or any adjustment of other attributes where the end-of-stream value may be important.
    If you override this method, always call super.end().
    
    Overrides:
    
    end in class TokenStream
    
    Throws:
    
    IOException - If an I/O error occurs
  - reset
```
public final void reset()
                 throws IOException
```
    Description copied from class: TokenStream
    
    This method is called by a consumer before it begins consumption using TokenStream.incrementToken().
    Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
    If you override this method, always call super.reset(), otherwise some internal state will not be correctly reset (e.g., Tokenizer will throw IllegalStateException on further usage).
    
    Overrides:
    
    reset in class Tokenizer
    
    Throws:
    
    IOException

Class NGramTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_MIN_NGRAM_SIZE

DEFAULT_MAX_NGRAM_SIZE

Constructor Detail

NGramTokenizer

NGramTokenizer

NGramTokenizer

Method Detail

incrementToken

end

reset