Package com.yahoo.language
Interface Linguistics
-
- All Known Implementing Classes:
OpenNlpLinguistics
,SimpleLinguistics
public interface Linguistics
Factory of linguistic processors. For technical reasons this provides more flexibility to provide separate components for different operations than is needed in many cases; in particular the tokenizer should typically stem, transform and normalize using the same operations as provided directly by this. A set of adaptors are provided that makes this easy to achieve. Refer to the {com.yahoo.language.simple.SimpleLinguistics} implementation to set this up.
Thread safety: Instances of this factory type must be thread safe but the processors returned by the factory methods do not. Clients should request separate processor instances for each thread.
- Author:
- Mathias Mølster Lidal, Simon Thoresen Hult, bratseth
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static class
Linguistics.Component
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description boolean
equals(Linguistics other)
Check if another instance is equivalent to this oneCharacterClasses
getCharacterClasses()
Returns a thread-unsafe character classes instance.Detector
getDetector()
Returns a thread-unsafe detector.GramSplitter
getGramSplitter()
Returns a thread-unsafe gram splitter.Normalizer
getNormalizer()
Returns a thread-unsafe normalizer.Segmenter
getSegmenter()
Returns a thread-unsafe segmenter.Stemmer
getStemmer()
Returns a thread-unsafe stemmer or lemmatizer.Tokenizer
getTokenizer()
Returns a thread-unsafe tokenizer.Transformer
getTransformer()
Returns a thread-unsafe transformer.
-
-
-
Method Detail
-
getStemmer
Stemmer getStemmer()
Returns a thread-unsafe stemmer or lemmatizer. This is used at query time to do stemming of search terms to indexes which contains text tokenized with stemming turned on
-
getTokenizer
Tokenizer getTokenizer()
Returns a thread-unsafe tokenizer. This is used at indexing time to produce a optionally stemmed and transformed (accent normalized) stream of indexable tokens.
-
getNormalizer
Normalizer getNormalizer()
Returns a thread-unsafe normalizer. This is used at query time to cjk normalize query text.
-
getTransformer
Transformer getTransformer()
Returns a thread-unsafe transformer. This is used at query time to do stemming of search terms to indexes which contains text tokenized with accent normalization turned on
-
getSegmenter
Segmenter getSegmenter()
Returns a thread-unsafe segmenter. This is used at query time to find the individual semantic components of search terms to indexes tokenized with segmentation.
-
getDetector
Detector getDetector()
Returns a thread-unsafe detector. The language of the text is a parameter to other linguistic operations. This is used to determine the language of a query or document field when not specified explicitly.
-
getGramSplitter
GramSplitter getGramSplitter()
Returns a thread-unsafe gram splitter. This is used to split query or document text into fixed-length grams which allows matching without needing or using segmented tokens.
-
getCharacterClasses
CharacterClasses getCharacterClasses()
Returns a thread-unsafe character classes instance.
-
equals
boolean equals(Linguistics other)
Check if another instance is equivalent to this one
-
-