WikipediaTokenizer (The Adobe Experience Manager SDK 2020.6.3717.20200611T200904Z-200604)

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.wikipedia.WikipediaTokenizer

All Implemented Interfaces:

Closeable, AutoCloseable
```
public final class WikipediaTokenizer
extends Tokenizer
```
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.AttributeFactory, AttributeSource.State

Field Summary

Fields
Modifier and Type	Field and Description
`static int`	`ACRONYM_ID`
`static int`	`ALPHANUM_ID`
`static int`	`APOSTROPHE_ID`
`static String`	`BOLD`
`static int`	`BOLD_ID`
`static String`	`BOLD_ITALICS`
`static int`	`BOLD_ITALICS_ID`
`static int`	`BOTH` Output the both the untokenized token and the splits
`static String`	`CATEGORY`
`static int`	`CATEGORY_ID`
`static String`	`CITATION`
`static int`	`CITATION_ID`
`static int`	`CJ_ID`
`static int`	`COMPANY_ID`
`static int`	`EMAIL_ID`
`static String`	`EXTERNAL_LINK`
`static int`	`EXTERNAL_LINK_ID`
`static String`	`EXTERNAL_LINK_URL`
`static int`	`EXTERNAL_LINK_URL_ID`
`static String`	`HEADING`
`static int`	`HEADING_ID`
`static int`	`HOST_ID`
`static String`	`INTERNAL_LINK`
`static int`	`INTERNAL_LINK_ID`
`static String`	`ITALICS`
`static int`	`ITALICS_ID`
`static int`	`NUM_ID`
`static String`	`SUB_HEADING`
`static int`	`SUB_HEADING_ID`
`static String[]`	`TOKEN_TYPES` String token types that correspond to token type int constants
`static int`	`TOKENS_ONLY` Only output tokens
`static int`	`UNTOKENIZED_ONLY` Only output untokenized tokens, which are tokens that would normally be split into several tokens
`static int`	`UNTOKENIZED_TOKEN_FLAG` This flag is used to indicate that the produced "Token" would, if `TOKENS_ONLY` was used, produce multiple tokens.

Constructor Summary

Constructors
Constructor and Description
`WikipediaTokenizer(AttributeSource.AttributeFactory factory, Reader input, int tokenOutput, Set<String> untokenizedTypes)` Creates a new instance of the `WikipediaTokenizer`.
`WikipediaTokenizer(Reader input)` Creates a new instance of the `WikipediaTokenizer`.
`WikipediaTokenizer(Reader input, int tokenOutput, Set<String> untokenizedTypes)` Creates a new instance of the `WikipediaTokenizer`.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`close()` Releases resources associated with this stream.
`void`	`end()` This method is called by the consumer after the last token has been consumed, after `TokenStream.incrementToken()` returned `false` (using the new `TokenStream` API).
`boolean`	`incrementToken()` Consumers (i.e., `IndexWriter`) use this method to advance the stream to the next token.
`void`	`reset()` This method is called by a consumer before it begins consumption using `TokenStream.incrementToken()`.

Methods inherited from class org.apache.lucene.analysis.Tokenizer
setReader

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString

Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - INTERNAL_LINK
```
public static final String INTERNAL_LINK
```
    See Also:
    
    Constant Field Values
  - EXTERNAL_LINK
```
public static final String EXTERNAL_LINK
```
    See Also:
    
    Constant Field Values
  - EXTERNAL_LINK_URL
```
public static final String EXTERNAL_LINK_URL
```
    See Also:
    
    Constant Field Values
  - CITATION
```
public static final String CITATION
```
    See Also:
    
    Constant Field Values
  - CATEGORY
```
public static final String CATEGORY
```
    See Also:
    
    Constant Field Values
  - BOLD
```
public static final String BOLD
```
    See Also:
    
    Constant Field Values
  - ITALICS
```
public static final String ITALICS
```
    See Also:
    
    Constant Field Values
  - BOLD_ITALICS
```
public static final String BOLD_ITALICS
```
    See Also:
    
    Constant Field Values
  - HEADING
```
public static final String HEADING
```
    See Also:
    
    Constant Field Values
  - SUB_HEADING
```
public static final String SUB_HEADING
```
    See Also:
    
    Constant Field Values
  - ALPHANUM_ID
```
public static final int ALPHANUM_ID
```
    See Also:
    
    Constant Field Values
  - APOSTROPHE_ID
```
public static final int APOSTROPHE_ID
```
    See Also:
    
    Constant Field Values
  - ACRONYM_ID
```
public static final int ACRONYM_ID
```
    See Also:
    
    Constant Field Values
  - COMPANY_ID
```
public static final int COMPANY_ID
```
    See Also:
    
    Constant Field Values
  - EMAIL_ID
```
public static final int EMAIL_ID
```
    See Also:
    
    Constant Field Values
  - HOST_ID
```
public static final int HOST_ID
```
    See Also:
    
    Constant Field Values
  - NUM_ID
```
public static final int NUM_ID
```
    See Also:
    
    Constant Field Values
  - CJ_ID
```
public static final int CJ_ID
```
    See Also:
    
    Constant Field Values
  - INTERNAL_LINK_ID
```
public static final int INTERNAL_LINK_ID
```
    See Also:
    
    Constant Field Values
  - EXTERNAL_LINK_ID
```
public static final int EXTERNAL_LINK_ID
```
    See Also:
    
    Constant Field Values
  - CITATION_ID
```
public static final int CITATION_ID
```
    See Also:
    
    Constant Field Values
  - CATEGORY_ID
```
public static final int CATEGORY_ID
```
    See Also:
    
    Constant Field Values
  - BOLD_ID
```
public static final int BOLD_ID
```
    See Also:
    
    Constant Field Values
  - ITALICS_ID
```
public static final int ITALICS_ID
```
    See Also:
    
    Constant Field Values
  - BOLD_ITALICS_ID
```
public static final int BOLD_ITALICS_ID
```
    See Also:
    
    Constant Field Values
  - HEADING_ID
```
public static final int HEADING_ID
```
    See Also:
    
    Constant Field Values
  - SUB_HEADING_ID
```
public static final int SUB_HEADING_ID
```
    See Also:
    
    Constant Field Values
  - EXTERNAL_LINK_URL_ID
```
public static final int EXTERNAL_LINK_URL_ID
```
    See Also:
    
    Constant Field Values
  - TOKEN_TYPES
```
public static final String[] TOKEN_TYPES
```
    String token types that correspond to token type int constants
  - TOKENS_ONLY
```
public static final int TOKENS_ONLY
```
    Only output tokens
    
    See Also:
    
    Constant Field Values
  - UNTOKENIZED_ONLY
```
public static final int UNTOKENIZED_ONLY
```
    Only output untokenized tokens, which are tokens that would normally be split into several tokens
    
    See Also:
    
    Constant Field Values
  - BOTH
```
public static final int BOTH
```
    Output the both the untokenized token and the splits
    
    See Also:
    
    Constant Field Values
  - UNTOKENIZED_TOKEN_FLAG
```
public static final int UNTOKENIZED_TOKEN_FLAG
```
    This flag is used to indicate that the produced "Token" would, if TOKENS_ONLY was used, produce multiple tokens.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - WikipediaTokenizer
```
public WikipediaTokenizer(Reader input)
```
    Creates a new instance of the WikipediaTokenizer. Attaches the input to a newly created JFlex scanner.
    
    Parameters:
    
    input - The Input Reader
  - WikipediaTokenizer
```
public WikipediaTokenizer(Reader input,
                          int tokenOutput,
                          Set<String> untokenizedTypes)
```
    Creates a new instance of the WikipediaTokenizer. Attaches the input to a the newly created JFlex scanner.
    
    Parameters:
    
    input - The input
    
    tokenOutput - One of TOKENS_ONLY, UNTOKENIZED_ONLY, BOTH
  - WikipediaTokenizer
```
public WikipediaTokenizer(AttributeSource.AttributeFactory factory,
                          Reader input,
                          int tokenOutput,
                          Set<String> untokenizedTypes)
```
    Creates a new instance of the WikipediaTokenizer. Attaches the input to a the newly created JFlex scanner. Uses the given AttributeSource.AttributeFactory.
    
    Parameters:
    
    input - The input
    
    tokenOutput - One of TOKENS_ONLY, UNTOKENIZED_ONLY, BOTH
- Method Detail
  - incrementToken
```
public final boolean incrementToken()
                             throws IOException
```
    Description copied from class: TokenStream
    
    Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate AttributeImpls with the attributes of the next token.
    The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.captureState() to create a copy of the current attribute state.
    This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.addAttribute(Class) and AttributeSource.getAttribute(Class), references to all AttributeImpls that this stream uses should be retrieved during instantiation.
    To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in TokenStream.incrementToken().
    
    Specified by:
    
    incrementToken in class TokenStream
    
    Returns:
    
    false for end of stream; true otherwise
    
    Throws:
    
    IOException
  - close
```
public void close()
           throws IOException
```
    Description copied from class: Tokenizer
    
    Releases resources associated with this stream.
    If you override this method, always call super.close(), otherwise some internal state will not be correctly reset (e.g., Tokenizer will throw IllegalStateException on reuse).
    NOTE: The default implementation closes the input Reader, so be sure to call super.close() when overriding this method.
    
    Specified by:
    
    close in interface Closeable
    
    Specified by:
    
    close in interface AutoCloseable
    
    Overrides:
    
    close in class Tokenizer
    
    Throws:
    
    IOException
  - reset
```
public void reset()
           throws IOException
```
    Description copied from class: TokenStream
    
    This method is called by a consumer before it begins consumption using TokenStream.incrementToken().
    Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
    If you override this method, always call super.reset(), otherwise some internal state will not be correctly reset (e.g., Tokenizer will throw IllegalStateException on further usage).
    
    Overrides:
    
    reset in class Tokenizer
    
    Throws:
    
    IOException
  - end
```
public void end()
         throws IOException
```
    Description copied from class: TokenStream
    
    This method is called by the consumer after the last token has been consumed, after TokenStream.incrementToken() returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.
    This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.
    Additionally any skipped positions (such as those removed by a stopfilter) can be applied to the position increment, or any adjustment of other attributes where the end-of-stream value may be important.
    If you override this method, always call super.end().
    
    Overrides:
    
    end in class TokenStream
    
    Throws:
    
    IOException - If an I/O error occurs

Class WikipediaTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Detail

INTERNAL_LINK

EXTERNAL_LINK

EXTERNAL_LINK_URL

CITATION

CATEGORY

BOLD

ITALICS

BOLD_ITALICS

HEADING

SUB_HEADING

ALPHANUM_ID

APOSTROPHE_ID

ACRONYM_ID

COMPANY_ID

EMAIL_ID

HOST_ID

NUM_ID

CJ_ID

INTERNAL_LINK_ID

EXTERNAL_LINK_ID

CITATION_ID

CATEGORY_ID

BOLD_ID

ITALICS_ID

BOLD_ITALICS_ID

HEADING_ID

SUB_HEADING_ID

EXTERNAL_LINK_URL_ID

TOKEN_TYPES

TOKENS_ONLY

UNTOKENIZED_ONLY

BOTH

UNTOKENIZED_TOKEN_FLAG

Constructor Detail

WikipediaTokenizer

WikipediaTokenizer

WikipediaTokenizer

Method Detail

incrementToken

close

reset

end