All Classes and Interfaces
Class
Description
Determines the class of a given character.
CharacterUtils
provides a unified interface to Character-related
operations to implement backwards compatible character operations.A simple IO buffer to use with
CharacterUtils.fill(CharacterBuffer, Reader)
.A simple class that stores key Strings as char[]'s in a
hash table.
A simple class that stores Strings as char[]'s in a
hash table.
Exception that is thrown when detection fails.
Abstract superclass of all Detectors used for language and encoding detection.
An embedder converts a text string to a tensor
Runtime that is injectable through
Embedder
constructor.A class which splits consecutive word character sequences into overlapping character n-grams.
An immutable start index and length pair
A hint that can be given to a
Detector
.A stemmer implementing the Kstem algorithm by Bob Krovetz.
Factory of linguistic processors.
This class provides a case normalization operation to be used e.g.
This interface provides NFKC normalization of Strings through the underlying linguistics library.
A StringBuilder that allows one to access the array.
Exception class indicating that a fatal error occured during linguistic processing.
Interface providing segmentation, i.e.
Includes functionality for determining the langCode from a sample or from the encoding.
Factory of simple linguistic processor implementations.
A tokenizer which splits on whitespace, normalizes and transforms using the given implementations
and stems using the kstem algorithm.
Converts all accented characters into their de-accented counterparts followed by their combining diacritics, then
strips off the diacritics using a regex.
Immutable named lists of "special tokens" - strings which should override the normal tokenizer semantics
and be tokenized into a single token.
An immutable list of special tokens - strings which should override the normal tokenizer semantics
and be tokenized into a single token.
An immutable special token
A list of strings which does not allow for duplicate elements.
Interface providing stemming of single words.
An enum of the stemming modes which can be requested.
A single token produced by the tokenizer.
Language-sensitive tokenization of a text string.
List of token scripts (e.g.
An enumeration of token types.
Interface for providers of text transformations such as accent removal.