All Classes
-
All Classes Interface Summary Class Summary Enum Summary Exception Summary Class Description AbstractLanguageModel AbstractNGramDictionary<T extends AbstractNGramTrieNode<T>> Represent an ngram dictionary in an abstract way : dictionary can be static or dynamic.
Each type of dictionary can or can't support operation, such as dictionary saving, or updating probabilities.
The dictionary has aAbstractNGramDictionary.maxOrder
that represents the max order gram that can be found in the dictionary.AbstractNGramTrieNode<T extends AbstractNGramTrieNode<?>> Represent a node in a trie structure to represent ngrams.AbstractPredictionToCompute AbstractRecursiveMatcher AbstractTokenTrainingDocument AbstractTrainingDocument AbstractWord AcronymMatcher ApostropheMatcher BaseWordDictionary A language specific dictionary : contains lower case words and their unigram frequencies.BiIntegerKey CoOccurrenceKey CorrectionRule CorrectionRuleBuilder CorrectionRuleNode CorrectionRuleNode.CorrectionRuleNodeType DaemonThreadFactory DataTrainer Class to create prediction data to be used with a word predictor.DataTrainerResult DataTrainerResult.Builder Builder to buildDataTrainerResult
.DateDayMonthMatcher DateFullDigitMatcher DateFullTextMatcher DateMonthYearMatcher DateWeekDayMatcher DefaultCorrectionRuleGenerator DefaultCorrectionRuleGenerator.CorrectionRuleType DefaultCorrectionRuleGenerator.TranslationProvider DoublePredictionToCompute Represent the prediction for two word in a row.
Could have been generic (more than two, but for computing performance, limit combination to two word only)DynamicNGramDictionary Represent aTrainingNGramDictionary
that can also be opened to be trained again.
This type of dictionary is useful when using a dynamic user model : the dynamic user dictionary is loaded and trained during each session, and then saved to be used in the next sessions.DynamicNGramTrieNode Represent a dynamic trie node structure : this trie node is useful when the ngram count has to be retrieved.
Dynamic trie node children are always fully loaded (they are not loaded on demand) and their frequencies can change.
Because dynamic trie node are used to be saved and loaded asStaticNGramTrieNode
orDynamicNGramTrieNode
they contains two write method :DynamicNGramTrieNode.writeStaticNode(FileChannel, int)
if they are saved to be loaded asStaticNGramTrieNode
andDynamicNGramTrieNode.writeDynamicNode(FileChannel, int)
if they are saved to be loaded asDynamicNGramTrieNode
: one save static information about the node (frequency, bow), the other only save dynamic information (count) because frequencies are dynamically computed.EquivalenceClass Represent a equivalence class type that can be used when training a language model.
Useful to group same kind of element in a corpus under a same concept instead of textual data.
3 These are especially used in semantic data.EquivalenceClassToken EquivalenceClassWord FifoSet<T> A set maintaining exactlyFifoSet.maxSize
or less but keeping there insertion order to always delete the first inserted element when set is full.FrenchBaseWordDictionary French dictionary based on Lexique.orgFrenchLanguageModel FrenchLanguageUtils Utils methods for french language.FrenchStopWordDictionary GeneratingCorrection GeneratingCorrectionI HyphenMatcher Term matcher to match word sequence with hyphen between each word.
The sequence should start and end with hyphen, examples : a-t : valid a-t-elle : valid a-t-elle- : not valid -test- : not validLanguageDataModelTrainer LanguageDataModelTrainerArgs LanguageModel Represent a model specific to the input language.
This model is useful to better perform on NLP task by using specific parameters from a language.
E.G.LoggingProgressIndicator NextWord NGramDebugger This interface can be used to check an ngram dictionary while training models.NGramDictionaryGenerator Use this generator to train an ngram model.
It will load texts from aTrainingCorpus
and generate a ngram file that could be later opened with aStaticNGramTrieDictionary
NGramKey NGramPruningMethod NGramTrainingDocument NGramWordPredictorUtils Utils class useful when predicting words with an ngram dictionaries.NoOpProgressIndicator NumberDecimalMatcher NumberIntMatcher Pair<K,T> ParserTrainingDocument PatternMatched PercentMatcher Predict4AllInfo This retrieves information about the library (version and build date).
This should mostly be used to ensure consistency on saved data (i.e. save and load data from same versions)Predict4AllUtils Contains different utils methods that are used in NLP taks.PredictionParameter ProgressIndicator ProperNameMatcher SemanticDictionary Represents a semantic dictionary to be used to predict next words.
WARNING : THIS IS A WIPSemanticDictionaryConfiguration SemanticDictionaryGenerator To generate aSemanticDictionary
from an input corpus.
This creates a term x term matrix and then reduces it with SVD (via an optimized R script, "Rscript" should be available in path).SemanticTrainingDocument Separator Represent chars between words.
This is preferred to regex pattern because separator are fully controlled.
If you add any new separator, watch the last used idSeparatorToken SimpleGeneratingCorrection SimpleWord SingleThreadDoubleAdder Similar toDoubleAdder
but for a single threaded usage.
Just a simple double reference without any overhead.SpecialWordMatcher StaticNGramTrieDictionary Represent a static ngram dictionary where trie node are loaded "on demand" while browsing through the nodes.
This dictionary is read only and cannot be updated or saved : methods likeStaticNGramTrieDictionary.updateProbabilities(double[])
,StaticNGramTrieDictionary.putAndIncrementBy(int[], int)
are not supported by this dictionary.StaticNGramTrieNode Represent a static ngram trie node : when node are used only to retrieve information and compute probabilities, but children are never updated.
This node is particular because children node are loaded on demand from aFileChannel
.
This node is produced in a read only version : to create this node,DynamicNGramTrieNode
andTrainingNGramDictionary
should be used.StopWordDictionary A language specific dictionary : contains every stop words for a languageStringProducer Tag Represent a specific value in a corpus.
Useful to tag specific part of the corpus without any semantic information.
START : represent a sentence start UNKNOWN : represent a word/expression out of vocabularyTagToken TagWord TermMatcherUtils Token Represent the lowest unit when parsing a text.TokenAppender TokenConverter This token converter will convert input token list to another token list, with matchedTokenMatcher
pattern.TokenConverterTrainingDocument TokenFileInputStream TokenFileOutputStream Tokenizer This takes a raw text and to create tokens from it.TokenListAppender TokenListProvider TokenMatcher Represent a matcher that will try to detect if a given token match a specific pattern.
If so, thePatternMatched
contains the the normalized representation of the matched tokens and eventually anEquivalenceClass
.TokenProvider TokenRegexMatcher TokenRegexMatcher.TokenRegexMatcherBuilder TokenRegexResult TrainerTask TrainingConfiguration TrainingCorpus TrainingNGramDictionary Represent a training dictionary : a ngram dictionary used while training an ngram model.
This dictionary is useful because it supports dynamic insertion and probabilities computing...TrainingStep Represent the possible training steps.
This allow training to be stopped and started again at a specific step : going to converted tokens, and then running WORDS_DICTIONARY multiple times.TrieNodeMap<V> Custom implementation copied fromTIntObjectHashMap
but with less attribute to reduce the heap size in Trie.
Source is copied from class hierarchy (with manually merging methods):THash
TPrimitiveHash
TIntHash
TIntObjectHashMap
The implementation is modified to keep the minimum attribute count on this Map because this TrieNodeMap will be created a lot of time !TrieNodeMapConstant Triple<K,T,V> UniquePredictionToCompute UserWord Word Represent a word stored in aWordDictionary
- word are stored with a int ID to optimize memory usage.WordCorrectionGenerator Idées inversion à distance de 2 = "renuméré" Gestion des inversionsWordDictionary Represent a word dictionary.
This dictionary identify each sequence of chars as an unique "word" and keep information for this word (frequency, etc...)WordDictionaryGenerator This will generate a word dictionary from aTrainingCorpus
: this will detect different word in training corpus and try to filter out words : match lower/upper case words, filter on aBaseWordDictionary
, exclude low count words, etc.WordDictionaryMatchingException This exception is mainly thrown if an user dictionary is loaded but is was saved from a previous dictionary.WordDictionaryTrainingDocument WordFileInputStream WordFileOutputStream WordPrediction WordPredictionResult WordPredictor WordPrefixDetected Contains information about a started word (found in dictionary)WordPrefixDetector Useful to detect if a existing word is started in a token list.
It's important to detect if a word is already started when predicting next word, because the prediction result should always takes care of giving prediction result that starts like the already started word.
Because word are allowed to have word separator inside (hyphen, etc...), started word detection is much more complicated that just checking if the token list ends with a token separator.WordToken