Class PatternAnalyzer
java.lang.Object
org.apache.lucene.analysis.Analyzer
org.apache.lucene.analysis.miscellaneous.PatternAnalyzer
- All Implemented Interfaces:
Closeable
,AutoCloseable
Deprecated.
(4.0) use the pattern-based analysis in the analysis/pattern package instead.
Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a
Reader
, that can flexibly separate text into terms via a regular expression Pattern
(with behaviour identical to String.split(String)
),
and that combines the functionality of
LetterTokenizer
,
LowerCaseTokenizer
,
WhitespaceTokenizer
,
StopFilter
into a single efficient
multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider
prototyping by simply trying various expressions on some test texts via
String.split(String)
. Once you are satisfied, give that regex to
PatternAnalyzer. Also see Java Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers.
It can also serve as a building block in a compound Lucene
TokenFilter
chain. For example as in this
stemming example:
PatternAnalyzer pat = ... TokenStream tokenStream = new SnowballFilter( pat.tokenStream("content", "James is running round in the woods"), "English"));
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
Analyzer.GlobalReuseStrategy, Analyzer.PerFieldReuseStrategy, Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final PatternAnalyzer
Deprecated.A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.static final PatternAnalyzer
Deprecated.A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader.static final Pattern
Deprecated."\\W+"
; Divides text at non-letters (NOT Character.isLetter(c))static final Pattern
Deprecated."\\s+"
; Divides text at whitespaces (Character.isWhitespace(c))Fields inherited from class org.apache.lucene.analysis.Analyzer
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
-
Constructor Summary
ConstructorsConstructorDescriptionPatternAnalyzer
(Version matchVersion, Pattern pattern, boolean toLowerCase, CharArraySet stopWords) Deprecated.Constructs a new instance with the given parameters. -
Method Summary
Modifier and TypeMethodDescriptioncreateComponents
(String fieldName, Reader reader) Deprecated.Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards totokenStream(String, Reader, String)
and is less efficient thantokenStream(String, Reader, String)
.createComponents
(String fieldName, Reader reader, String text) Deprecated.Creates a token stream that tokenizes the given string into token terms (aka words).boolean
Deprecated.Indicates whether some other object is "equal to" this one.int
hashCode()
Deprecated.Returns a hash code value for the object.Methods inherited from class org.apache.lucene.analysis.Analyzer
close, getOffsetGap, getPositionIncrementGap, getReuseStrategy, tokenStream, tokenStream
-
Field Details
-
NON_WORD_PATTERN
Deprecated."\\W+"
; Divides text at non-letters (NOT Character.isLetter(c)) -
WHITESPACE_PATTERN
Deprecated."\\s+"
; Divides text at whitespaces (Character.isWhitespace(c)) -
DEFAULT_ANALYZER
Deprecated.A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader. -
EXTENDED_ANALYZER
Deprecated.A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html
-
-
Constructor Details
-
PatternAnalyzer
public PatternAnalyzer(Version matchVersion, Pattern pattern, boolean toLowerCase, CharArraySet stopWords) Deprecated.Constructs a new instance with the given parameters.- Parameters:
matchVersion
- currently does nothingpattern
- a regular expression delimiting tokenstoLowerCase
- iftrue
returns tokens after applying String.toLowerCase()stopWords
- if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created viaStopFilter.makeStopSet(Version, String[])
and/orWordlistLoader
as inWordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt")
or other stop words lists .
-
-
Method Details
-
createComponents
public Analyzer.TokenStreamComponents createComponents(String fieldName, Reader reader, String text) Deprecated.Creates a token stream that tokenizes the given string into token terms (aka words).- Parameters:
fieldName
- the name of the field to tokenize (currently ignored).reader
- reader (e.g. charfilter) of the original text. can be null.text
- the string to tokenize- Returns:
- a new token stream
-
createComponents
Deprecated.Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards totokenStream(String, Reader, String)
and is less efficient thantokenStream(String, Reader, String)
.- Parameters:
fieldName
- the name of the field to tokenize (currently ignored).reader
- the reader delivering the text- Returns:
- a new token stream
-
equals
Deprecated.Indicates whether some other object is "equal to" this one. -
hashCode
public int hashCode()Deprecated.Returns a hash code value for the object.
-