All Classes
-
All Classes Interface Summary Class Summary Enum Summary Exception Summary Class Description AbstractLanguageModel AbstractNGramDictionary<T extends AbstractNGramTrieNode<T>> Represent an ngram dictionary in an abstract way : dictionary can be static or dynamic.
Each type of dictionary can or can't support operation, such as dictionary saving, or updating probabilities.
The dictionary has aAbstractNGramDictionary.maxOrder
that represents the max order gram that can be found in the dictionary.AbstractNGramTrieNode<T extends AbstractNGramTrieNode<?>> Represent a node in a trie structure to represent ngrams.AbstractPredictionToCompute AbstractRecursiveMatcher AbstractTokenTrainingDocument AbstractTrainingDocument AbstractWord AcronymMatcher ApostropheMatcher BaseWordDictionary A language specific dictionary : contains lower case words and their unigram frequencies.BiIntegerKey CachedPrecomputedCorrectionRule Cached version of aCorrectionRule
: this rule is to meant to be directly used inWordCorrectionGenerator
.
It only contains information and should not be modified once generated from aCorrectionRule
CoOccurrenceKey CorrectionRule This correction is the most convenient way to create correction rules as it allow direct modification and has helping methods.
TheWordCorrectionGenerator
will then generateCachedPrecomputedCorrectionRule
to use this rule.
Note that a single builder instance can result in multiple correction rule : correction rule should never be directly configured by user as this correction rule is more understandable.
Correction rule work as the following : you define errors which are the part replaced, and replacements which are the part correcting errors.CorrectionRuleNode The way to represent correction rule used inWordPredictor
viaWordCorrectionGenerator
Correction rule are represented as a tree where you can enable/disable whole part of it (e.g. disabling a parent node also disable its children).
Node are typed withCorrectionRuleNode.getType()
so they can beCorrectionRuleNodeType.NODE
orCorrectionRuleNodeType.LEAF
.
Every node can technically containsCorrectionRuleNode.getCorrectionRule()
but be aware that onlyCorrectionRuleNodeType.LEAF
are taken into account byWordCorrectionGenerator
CorrectionRuleNodeType Represent the type of aCorrectionRuleNode
DaemonThreadFactory DataTrainer Class to create prediction data to be used with a word predictor.DataTrainerResult DataTrainerResult.Builder Builder to buildDataTrainerResult
.DateDayMonthMatcher DateFullDigitMatcher DateFullTextMatcher DateMonthYearMatcher DateWeekDayMatcher DoublePredictionToCompute Represent the prediction for two word in a row.
Could have been generic (more than two, but for computing performance, limit combination to two word only)DynamicNGramDictionary Represent aTrainingNGramDictionary
that can also be opened to be trained again.
This type of dictionary is useful when using a dynamic user model : the dynamic user dictionary is loaded and trained during each session, and then saved to be used in the next sessions.DynamicNGramTrieNode Represent a dynamic trie node structure : this trie node is useful when the ngram count has to be retrieved.
Dynamic trie node children are always fully loaded (they are not loaded on demand) and their frequencies can change.
Because dynamic trie node are used to be saved and loaded asStaticNGramTrieNode
orDynamicNGramTrieNode
they contains two write method :DynamicNGramTrieNode.writeStaticNode(FileChannel, int)
if they are saved to be loaded asStaticNGramTrieNode
andDynamicNGramTrieNode.writeDynamicNode(FileChannel, int)
if they are saved to be loaded asDynamicNGramTrieNode
: one save static information about the node (frequency, bow), the other only save dynamic information (count) because frequencies are dynamically computed.EquivalenceClass Represent a equivalence class type that can be used when training a language model.
Useful to group same kind of element in a corpus under a same concept instead of textual data.
3 These are especially used in semantic data.EquivalenceClassToken EquivalenceClassWord FifoSet<T> A set maintaining exactlyFifoSet.maxSize
or less but keeping there insertion order to always delete the first inserted element when set is full.FrenchBaseWordDictionary French dictionary based on Lexique.orgFrenchDefaultCorrectionRuleGenerator Generate base correction rule for french language.
Keep every possible rule inFrenchDefaultCorrectionRuleGenerator.CorrectionRuleType
with a translated name, description and example.FrenchDefaultCorrectionRuleGenerator.CorrectionRuleType FrenchDefaultCorrectionRuleGenerator.TranslationProvider FrenchLanguageModel FrenchLanguageUtils Utils methods for french language.FrenchStopWordDictionary GeneratingCorrectionI HyphenMatcher Term matcher to match word sequence with hyphen between each word.
The sequence should start and end with hyphen, examples : a-t : valid a-t-elle : valid a-t-elle- : not valid -test- : not validLanguageModel Represent a model specific to the input language.
This model is useful to better perform on NLP task by using specific parameters from a language.
E.G.LoggingProgressIndicator NextWord NGramDebugger This interface can be used to check an ngram dictionary while training models.NGramDictionaryGenerator Use this generator to train an ngram model.
It will load texts from aTrainingCorpus
and generate a ngram file that could be later opened with aStaticNGramTrieDictionary
NGramKey NGramPruningMethod NGramTrainingDocument NGramWordPredictorUtils Utils class useful when predicting words with an ngram dictionaries.NoOpProgressIndicator NumberDecimalMatcher NumberIntMatcher Pair<K,T> ParserTrainingDocument PatternMatched PercentMatcher Predict4AllInfo This retrieves information about the library (version and build date).
This should mostly be used to ensure consistency on saved data (i.e. save and load data from same versions)Predict4AllUtils Contains different utils methods that are used in NLP taks.PredictionParameter Contains parameters to configure howWordPredictor
is working.
Changes to an instance ofPredictionParameter
while the predictor is running could be not reflected as some values are cached internally.ProgressIndicator ProperNameMatcher SemanticDictionary Represents a semantic dictionary to be used to predict next words.
WARNING : THIS IS A WIPSemanticDictionaryConfiguration SemanticDictionaryGenerator To generate aSemanticDictionary
from an input corpus.
This creates a term x term matrix and then reduces it with SVD (via an optimized R script, "Rscript" should be available in path).SemanticTrainingDocument Separator Represent chars between words.
This is preferred to regex pattern because separator are fully controlled.
If you add any new separator, watch the last used idSeparatorToken SimpleGeneratingCorrection SimpleWord SingleThreadDoubleAdder Similar toDoubleAdder
but for a single threaded usage.
Just a simple double reference without any overhead.SpecialWordMatcher StaticNGramTrieDictionary Represent a static ngram dictionary where trie node are loaded "on demand" while browsing through the nodes.
This dictionary is read only and cannot be updated or saved : methods likeStaticNGramTrieDictionary.updateProbabilities(double[])
,StaticNGramTrieDictionary.putAndIncrementBy(int[], int)
are not supported by this dictionary.StaticNGramTrieNode Represent a static ngram trie node : when node are used only to retrieve information and compute probabilities, but children are never updated.
This node is particular because children node are loaded on demand from aFileChannel
.
This node is produced in a read only version : to create this node,DynamicNGramTrieNode
andTrainingNGramDictionary
should be used.StopWordDictionary A language specific dictionary : contains every stop words for a languageStringProducer Tag Represent a specific value in a corpus.
Useful to tag specific part of the corpus without any semantic information.
START : represent a sentence start UNKNOWN : represent a word/expression out of vocabularyTagToken TagWord TermMatcherUtils Token Represent the lowest unit when parsing a text.TokenAppender TokenConverter This token converter will convert input token list to another token list, with matchedTokenMatcher
pattern.TokenConverterTrainingDocument TokenFileInputStream TokenFileOutputStream Tokenizer This takes a raw text and to create tokens from it.TokenListAppender TokenListProvider TokenMatcher Represent a matcher that will try to detect if a given token match a specific pattern.
If so, thePatternMatched
contains the the normalized representation of the matched tokens and eventually anEquivalenceClass
.TokenProvider TokenRegexMatcher TokenRegexMatcher.TokenRegexMatcherBuilder TokenRegexResult TrainerTask TrainingConfiguration TrainingCorpus TrainingNGramDictionary Represent a training dictionary : a ngram dictionary used while training an ngram model.
This dictionary is useful because it supports dynamic insertion and probabilities computing...TrainingStep Represent the possible training steps.
This allow training to be stopped and started again at a specific step : going to converted tokens, and then running WORDS_DICTIONARY multiple times.TrieNodeMap<V> Custom implementation copied fromTIntObjectHashMap
but with less attribute to reduce the heap size in Trie.
Source is copied from class hierarchy (with manually merging methods):THash
TPrimitiveHash
TIntHash
TIntObjectHashMap
The implementation is modified to keep the minimum attribute count on this Map because this TrieNodeMap will be created a lot of time !TrieNodeMapConstant Triple<K,T,V> UniquePredictionToCompute UserWord Word Represent a word stored in aWordDictionary
- word are stored with a int ID to optimize memory usage.WordCorrectionGenerator Generate possible correction from a input text and tokens.
Correction are based on rule (CorrectionRule
) and generation is done using a thread pool.
Result correction could be unique word or double word (for example, the error might be a merged word)WordDictionary Represent a word dictionary.
This dictionary identify each sequence of chars as an unique "word" and keep information for this word.
Each word are identified by a single int ID to save memory and space.
The dictionary itself is identified with an UUID to verify consistency when using user dictionary.
Note thatWord
added toWordDictionary
cannot be removed : their ID should be consistent and they could have been used in aAbstractNGramDictionary
: however, you can disable a word withWord.setForceInvalid(boolean, boolean)
WordDictionaryGenerator This will generate a word dictionary from aTrainingCorpus
: this will detect different word in training corpus and try to filter out words : match lower/upper case words, filter on aBaseWordDictionary
, exclude low count words, etc.WordDictionaryMatchingException This exception is mainly thrown if an user dictionary is loaded but is was saved from a previous dictionary.WordDictionaryTrainingDocument WordFileInputStream WordFileOutputStream WordPrediction Represent a predictor fromWordPredictor
WordPredictionResult Contains the result fromWordPredictor
.WordPredictor Main entry point of PREDICT4ALL API.
Instance ofWordPredictor
can predict next words, current word ends and even current corrections.
The predictor mainly relies on two item : ngram dictionary and word dictionary to search for word and existing sequences.
Additionally, a dynamic model can be provided to combine both static ngrams originated from an already learned generic model and a dynamic model specific to user, profil, application...
The predictor configuration is located inPredictionParameter
: the instance provided onWordPredictor
creation can be later modified.WordPrefixDetected Contains information about a started word (found in dictionary)WordPrefixDetector Useful to detect if a existing word is started in a token list.
It's important to detect if a word is already started when predicting next word, because the prediction result should always takes care of giving prediction result that starts like the already started word.
Because word are allowed to have word separator inside (hyphen, etc...), started word detection is much more complicated that just checking if the token list ends with a token separator.WordToken