Package org.apache.lucene.analysis.standard
Fast, general-purpose grammar-based tokenizers.
The org.apache.lucene.analysis.standard
package contains three
fast grammar-based tokenizers constructed with JFlex:
StandardTokenizer
: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. UnlikeUAX29URLEmailTokenizer
, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.
StandardAnalyzer
includesStandardTokenizer
,StandardFilter
,LowerCaseFilter
andStopFilter
. When theVersion
specified in the constructor is lower than 3.1, theClassicTokenizer
implementation is invoked.ClassicTokenizer
: this class was formerly (prior to Lucene 3.1) namedStandardTokenizer
. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.)ClassicAnalyzer
includesClassicTokenizer
,StandardFilter
,LowerCaseFilter
andStopFilter
.UAX29URLEmailTokenizer
: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailAnalyzer
includesUAX29URLEmailTokenizer
,StandardFilter
,LowerCaseFilter
andStopFilter
.
-
Interface Summary Interface Description StandardTokenizerInterface Internal interface for supporting versioned grammars. -
Class Summary Class Description ClassicAnalyzer FiltersClassicTokenizer
withClassicFilter
,LowerCaseFilter
andStopFilter
, using a list of English stop words.ClassicFilter Normalizes tokens extracted withClassicTokenizer
.ClassicFilterFactory Factory forClassicFilter
.ClassicTokenizer A grammar-based tokenizer constructed with JFlexClassicTokenizerFactory Factory forClassicTokenizer
.StandardAnalyzer FiltersStandardTokenizer
withStandardFilter
,LowerCaseFilter
andStopFilter
, using a list of English stop words.StandardFilter Normalizes tokens extracted withStandardTokenizer
.StandardFilterFactory Factory forStandardFilter
.StandardTokenizer A grammar-based tokenizer constructed with JFlex.StandardTokenizerFactory Factory forStandardTokenizer
.StandardTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.UAX29URLEmailAnalyzer FiltersUAX29URLEmailTokenizer
withStandardFilter
,LowerCaseFilter
andStopFilter
, using a list of English stop words.UAX29URLEmailTokenizer This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in ` Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.UAX29URLEmailTokenizerFactory Factory forUAX29URLEmailTokenizer
.UAX29URLEmailTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.