Class AbstractNGramDictionary<T extends AbstractNGramTrieNode<T>>
- java.lang.Object
-
- org.predict4all.nlp.ngram.dictionary.AbstractNGramDictionary<T>
-
- Type Parameters:
T
- type of trie node stored in this dictionary.
- All Implemented Interfaces:
java.lang.AutoCloseable
- Direct Known Subclasses:
StaticNGramTrieDictionary
,TrainingNGramDictionary
public abstract class AbstractNGramDictionary<T extends AbstractNGramTrieNode<T>> extends java.lang.Object implements java.lang.AutoCloseable
Represent an ngram dictionary in an abstract way : dictionary can be static or dynamic.
Each type of dictionary can or can't support operation, such as dictionary saving, or updating probabilities.
The dictionary has amaxOrder
that represents the max order gram that can be found in the dictionary. Order in a ngram correspond to the ngram rank : 1 = unigram, 2 bigram, etc... Order in dictionary is not bounded to a maximum value, but in practice, order is never more than 5.
Dictionary are represented as a trie, with also different kind of trie availabe. Each type of dictionary is associated with a different type ofAbstractNGramTrieNode
(e.g. dynamic dictionary is associated with a dynamic trie node).
-
-
Field Summary
Fields Modifier and Type Field Description protected static int
DICTIONARY_INFORMATION_BYTE_COUNT
Byte count needed to save general information about this dictionary.protected int
maxOrder
Max order possible to store in this dictionary.
Could be retrieved by opening the dictionary, or set by user as a limit.protected T
rootNode
Root node of this dictionary (this node contains as children the whole vocabulary)
-
Constructor Summary
Constructors Constructor Description AbstractNGramDictionary(T rootNode, int maxOrderP)
Construct a dictionary with a given root node and a max possible order.
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract boolean
checkChildrenLoading(T node)
To check that the children of a given node are loaded into memory (and can be used)void
compact()
Compact the nodes in this dictionary (this will callAbstractNGramTrieNode.compact()
on root)abstract double[]
computeD(TrainingConfiguration configuration)
Compute the optimal value for d (absolute discounting parameter).
Usually d is computed with formula :
D = C1 / (C1 + 2 * C2)
Where C1 = number of ngram with count == 1, and C2 = number of ngram with count == 2.int
getMaxOrder()
gnu.trove.set.hash.TIntHashSet
getNextWord(int[] prefix)
Return the immediate next words for a given prefix (without any filter)abstract T
getNodeForPrefix(int[] prefix, int index)
Use to retrieve a node for a given prefix.
For example, for prefix = [1,2] will return the trie node corresponding to {2}.
The children of the given node may have not been loaded.double
getProbability(int[] prefix, int index, int length, int wordId)
Return the probability of a word for a given prefix.
Given index = 0 and length = prefix.length will return the maximum order probability (e.g. prefix.length = 3, will return probability for order 3)double
getRawProbability(int[] prefix, int index, int length, int wordId)
T
getRoot()
void
listNextWords(int[] prefix, WordDictionary wordDictionary, PredictionParameter predictionParameter, gnu.trove.set.hash.TIntHashSet wordsToExclude, java.util.Map<BiIntegerKey,NextWord> resultSet, int wantedCount, boolean unigramLevel)
Will go through each ngram dictionary order to find the next possible words for a given prefix
Will first go through the highest order for the given prefix (e.g. prefix length == 3 = order is 4), and if the wantedCount is not reached, will go to the lower order to find new next possible.protected abstract void
openDictionary(java.io.File dictionaryFile)
Open a dictionary from a file.
To use the dictionary, the sameWordDictionary
used to save it should be used.abstract void
putAndIncrementBy(int[] ngram, int increment)
Add a given ngram to the dictionary and to increment its count.
If the ngram is already in the dictionary, will just increment its count.
This will callputAndIncrementBy(int[], int, int)
with a index = 0abstract void
putAndIncrementBy(int[] ngram, int index, int increment)
Add a given ngram to the dictionary and to increment its count.
If the ngram is already in the dictionary, will just increment its count.protected void
readDictionaryInformation(java.nio.ByteBuffer byteBuffer)
Read the general information for this dictionary from a given buffer (doesn't do any check)abstract void
saveDictionary(java.io.File dictionaryFile)
Save this dictionary to a file.
Will save the dictionary relative with id only, this means that the same word dictionary should be loaded if this dictionary is opened later.abstract void
updateProbabilities(double[] d)
Update the whole probabilities in this dictionary.
Can take a while if there is a lot of nodes in the dictionary.abstract void
updateProbabilities(int[] prefix, int prefixIndex, double[] d)
Update probabilities in this dictionary for a specific ngram prefix : this will update the probabilities of the prefix children, and update the backoff weight of the parent node.
This is much more optimized thanupdateProbabilities(double[])
protected void
writeDictionaryInfo(java.nio.ByteBuffer buffWrite)
Write the general information for this dictionary to a given buffer
-
-
-
Field Detail
-
DICTIONARY_INFORMATION_BYTE_COUNT
protected static final int DICTIONARY_INFORMATION_BYTE_COUNT
Byte count needed to save general information about this dictionary. (e.g. max order)- See Also:
- Constant Field Values
-
maxOrder
protected int maxOrder
Max order possible to store in this dictionary.
Could be retrieved by opening the dictionary, or set by user as a limit.
-
rootNode
protected final T extends AbstractNGramTrieNode<T> rootNode
Root node of this dictionary (this node contains as children the whole vocabulary)
-
-
Constructor Detail
-
AbstractNGramDictionary
public AbstractNGramDictionary(T rootNode, int maxOrderP)
Construct a dictionary with a given root node and a max possible order.- Parameters:
rootNode
- the root node to use for this dictionarymaxOrderP
- max possible order for this dictionary.
-
-
Method Detail
-
getRoot
public T getRoot()
- Returns:
- the root for this dictionary
-
getMaxOrder
public int getMaxOrder()
- Returns:
- the max possible order for this dictionary
-
compact
public void compact()
Compact the nodes in this dictionary (this will callAbstractNGramTrieNode.compact()
on root)
-
getNodeForPrefix
public abstract T getNodeForPrefix(int[] prefix, int index)
Use to retrieve a node for a given prefix.
For example, for prefix = [1,2] will return the trie node corresponding to {2}.
The children of the given node may have not been loaded.- Parameters:
prefix
- the node prefixindex
- first word in prefix index (to take the full prefix, index should be = 0)- Returns:
- the node found for the given prefix, or null if there is no existing node for such prefix
-
checkChildrenLoading
public abstract boolean checkChildrenLoading(T node)
To check that the children of a given node are loaded into memory (and can be used)- Parameters:
node
- the node to check children loading on- Returns:
- true if there is children for this node, and these children are loaded.
-
putAndIncrementBy
public abstract void putAndIncrementBy(int[] ngram, int index, int increment)
Add a given ngram to the dictionary and to increment its count.
If the ngram is already in the dictionary, will just increment its count.- Parameters:
ngram
- the ngram to put in dictionaryindex
- index for ngram start (index when the ngram become valid : for example, if we want to skip the first ngram word, just set index = 1)increment
- the increment value
-
putAndIncrementBy
public abstract void putAndIncrementBy(int[] ngram, int increment)
Add a given ngram to the dictionary and to increment its count.
If the ngram is already in the dictionary, will just increment its count.
This will callputAndIncrementBy(int[], int, int)
with a index = 0- Parameters:
ngram
- the ngram to put in dictionaryincrement
- the increment value
-
saveDictionary
public abstract void saveDictionary(java.io.File dictionaryFile) throws java.io.IOException
Save this dictionary to a file.
Will save the dictionary relative with id only, this means that the same word dictionary should be loaded if this dictionary is opened later.- Parameters:
dictionaryFile
- the file where dictionary should be saved.- Throws:
java.io.IOException
- if dictionary can't be saved
-
openDictionary
protected abstract void openDictionary(java.io.File dictionaryFile) throws java.io.IOException
Open a dictionary from a file.
To use the dictionary, the sameWordDictionary
used to save it should be used.- Parameters:
dictionaryFile
- the file containing a dictionary.- Throws:
java.io.IOException
- if dictionary can't be opened
-
updateProbabilities
public abstract void updateProbabilities(double[] d)
Update the whole probabilities in this dictionary.
Can take a while if there is a lot of nodes in the dictionary.- Parameters:
d
- the d parameter for absolute discounting algorithm.
-
updateProbabilities
public abstract void updateProbabilities(int[] prefix, int prefixIndex, double[] d)
Update probabilities in this dictionary for a specific ngram prefix : this will update the probabilities of the prefix children, and update the backoff weight of the parent node.
This is much more optimized thanupdateProbabilities(double[])
- Parameters:
prefix
- prefix of the node that should be updatedprefixIndex
- prefix start index (0 = full prefix, 1 = skip the first word in prefix, etc...)d
- the d parameter for absolute discounting algorithm.
-
computeD
public abstract double[] computeD(TrainingConfiguration configuration)
Compute the optimal value for d (absolute discounting parameter).
Usually d is computed with formula :
D = C1 / (C1 + 2 * C2)
Where C1 = number of ngram with count == 1, and C2 = number of ngram with count == 2. Theses values are computed for each order (0 index = unigram, 1 index = bigram, etc.)- Parameters:
configuration
- configuration to use to compute D (can set min/max values and a D value)- Returns:
- computed d value for this dictionary
-
listNextWords
public void listNextWords(int[] prefix, WordDictionary wordDictionary, PredictionParameter predictionParameter, gnu.trove.set.hash.TIntHashSet wordsToExclude, java.util.Map<BiIntegerKey,NextWord> resultSet, int wantedCount, boolean unigramLevel)
Will go through each ngram dictionary order to find the next possible words for a given prefix
Will first go through the highest order for the given prefix (e.g. prefix length == 3 = order is 4), and if the wantedCount is not reached, will go to the lower order to find new next possible.- Parameters:
prefix
- the prefix to detect word after (words ids, represent a ngram prefix)wordDictionary
- word dictionary (useful only if prefixDetected is not null)predictionParameter
- prediction parameter (can be used to validate words)wordsToExclude
- a list of words that shouldn't be included in the result setresultSet
- set that will contains every next words foundwantedCount
- wanted next word count (an higher count will take more time)unigramLevel
- if true, this will go to unigram level (whole vocabulary) if the is not enough / this can be time consuming as unigram level contains the whole word dictionary
-
getNextWord
public gnu.trove.set.hash.TIntHashSet getNextWord(int[] prefix) throws java.io.IOException
Return the immediate next words for a given prefix (without any filter)- Parameters:
prefix
- the prefix (previous N words)- Returns:
- a set containing the next word for the given prefix, or null if there is no existing ngram in the dictionary for this prefix
- Throws:
java.io.IOException
- if children can't be read
-
getProbability
public double getProbability(int[] prefix, int index, int length, int wordId)
Return the probability of a word for a given prefix.
Given index = 0 and length = prefix.length will return the maximum order probability (e.g. prefix.length = 3, will return probability for order 3)- Parameters:
prefix
- the word before the given word (prefix)index
- the index in the given prefix (will change the result order)length
- the given prefix length (will change the result order).wordId
- the word we want the probability for- Returns:
- the probability for the given word (0.0 - 1.0)
-
getRawProbability
public double getRawProbability(int[] prefix, int index, int length, int wordId)
-
readDictionaryInformation
protected void readDictionaryInformation(java.nio.ByteBuffer byteBuffer)
Read the general information for this dictionary from a given buffer (doesn't do any check)- Parameters:
byteBuffer
- the byte buffer where dictionary information are read
-
writeDictionaryInfo
protected void writeDictionaryInfo(java.nio.ByteBuffer buffWrite)
Write the general information for this dictionary to a given buffer- Parameters:
buffWrite
- the byte buffer where information are written
-
-