Class TrainingNGramDictionary
- java.lang.Object
-
- org.predict4all.nlp.ngram.dictionary.AbstractNGramDictionary<DynamicNGramTrieNode>
-
- org.predict4all.nlp.ngram.dictionary.TrainingNGramDictionary
-
- All Implemented Interfaces:
AutoCloseable
- Direct Known Subclasses:
DynamicNGramDictionary
public class TrainingNGramDictionary extends AbstractNGramDictionary<DynamicNGramTrieNode>
Represent a training dictionary : a ngram dictionary used while training an ngram model.
This dictionary is useful because it supports dynamic insertion and probabilities computing... It is always useDynamicNGramTrieNode
.
The default training dictionary is not meant to be opened : it saves the trie structure into a file to be then loaded as a
StaticNGramTrieDictionary
. However,DynamicNGramDictionary
implements a dynamic dictionary that can be saved/opened with dynamic nodes.
-
-
Field Summary
Fields Modifier and Type Field Description static DecimalFormat
NGRAM_COUNT_FORMAT
-
Fields inherited from class org.predict4all.nlp.ngram.dictionary.AbstractNGramDictionary
DICTIONARY_INFORMATION_BYTE_COUNT, maxOrder, rootNode
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
TrainingNGramDictionary(int maxOrderP)
protected
TrainingNGramDictionary(DynamicNGramTrieNode root, int maxOrderP)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
checkChildrenLoading(DynamicNGramTrieNode node)
To check that the children of a given node are loaded into memory (and can be used)void
close()
double[]
computeD(TrainingConfiguration configuration)
Compute the optimal value for d (absolute discounting parameter).
Usually d is computed with formula :
D = C1 / (C1 + 2 * C2)
Where C1 = number of ngram with count == 1, and C2 = number of ngram with count == 2.Map<Integer,Pair<Integer,Integer>>
countNGrams()
static TrainingNGramDictionary
create(int maxOrder)
Create an empty training ngram trie dictionaryprotected void
executeWriteLevelOnRoot(FileChannel fileChannel, int level)
Call the correct node method to save a trie level to file.DynamicNGramTrieNode
getNodeForPrefix(int[] prefix, int index)
Use to retrieve a node for a given prefix.
For example, for prefix = [1,2] will return the trie node corresponding to {2}.
The children of the given node may have not been loaded.protected long
getRootBlockSize()
protected void
openDictionary(File dictionaryFile)
Open a dictionary from a file.
To use the dictionary, the sameWordDictionary
used to save it should be used.void
pruneNGramsCount(int countThreshold, TrainingConfiguration configuration)
void
pruneNGramsOrderCount(int[] counts, TrainingConfiguration configuration)
void
pruneNGramsWeightedDifference(double thresholdPruning, TrainingConfiguration configuration, NGramPruningMethod pruningMethod)
Execute a pruning on the dictionary.
Pruning is implemented with a "weighted difference" algorithm : difference is computed between high order model and a lower order model (e.g. difference between 4-gram - 3gram, then 3-gram - 2-gram) and if the difference is bellow a certain level (threshold), the high order model is deleted.
Difference pruning is executed for max order to bigram level, probabilities are computed again after the pruning.void
putAndIncrementBy(int[] ngram, int increment)
Add a given ngram to the dictionary and to increment its count.
If the ngram is already in the dictionary, will just increment its count.
This will callAbstractNGramDictionary.putAndIncrementBy(int[], int, int)
with a index = 0void
putAndIncrementBy(int[] ngram, int index, int increment)
Add a given ngram to the dictionary and to increment its count.
If the ngram is already in the dictionary, will just increment its count.void
saveDictionary(File dictionaryFile)
Save this dictionary to a file.
Will save the dictionary relative with id only, this means that the same word dictionary should be loaded if this dictionary is opened later.void
updateProbabilities(double[] d)
Update the whole probabilities in this dictionary.
Can take a while if there is a lot of nodes in the dictionary.void
updateProbabilities(int[] prefix, int prefixIndex, double[] d)
Update probabilities in this dictionary for a specific ngram prefix : this will update the probabilities of the prefix children, and update the backoff weight of the parent node.
This is much more optimized thanAbstractNGramDictionary.updateProbabilities(double[])
-
Methods inherited from class org.predict4all.nlp.ngram.dictionary.AbstractNGramDictionary
compact, getMaxOrder, getNextWord, getProbability, getRawProbability, getRoot, listNextWords, readDictionaryInformation, writeDictionaryInfo
-
-
-
-
Field Detail
-
NGRAM_COUNT_FORMAT
public static final DecimalFormat NGRAM_COUNT_FORMAT
-
-
Constructor Detail
-
TrainingNGramDictionary
protected TrainingNGramDictionary(int maxOrderP)
-
TrainingNGramDictionary
protected TrainingNGramDictionary(DynamicNGramTrieNode root, int maxOrderP)
-
-
Method Detail
-
getNodeForPrefix
public DynamicNGramTrieNode getNodeForPrefix(int[] prefix, int index)
Description copied from class:AbstractNGramDictionary
Use to retrieve a node for a given prefix.
For example, for prefix = [1,2] will return the trie node corresponding to {2}.
The children of the given node may have not been loaded.- Specified by:
getNodeForPrefix
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
prefix
- the node prefixindex
- first word in prefix index (to take the full prefix, index should be = 0)- Returns:
- the node found for the given prefix, or null if there is no existing node for such prefix
-
checkChildrenLoading
public boolean checkChildrenLoading(DynamicNGramTrieNode node)
Description copied from class:AbstractNGramDictionary
To check that the children of a given node are loaded into memory (and can be used)- Specified by:
checkChildrenLoading
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
node
- the node to check children loading on- Returns:
- true if there is children for this node, and these children are loaded.
-
putAndIncrementBy
public void putAndIncrementBy(int[] ngram, int increment)
Description copied from class:AbstractNGramDictionary
Add a given ngram to the dictionary and to increment its count.
If the ngram is already in the dictionary, will just increment its count.
This will callAbstractNGramDictionary.putAndIncrementBy(int[], int, int)
with a index = 0- Specified by:
putAndIncrementBy
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
ngram
- the ngram to put in dictionaryincrement
- the increment value
-
putAndIncrementBy
public void putAndIncrementBy(int[] ngram, int index, int increment)
Description copied from class:AbstractNGramDictionary
Add a given ngram to the dictionary and to increment its count.
If the ngram is already in the dictionary, will just increment its count.- Specified by:
putAndIncrementBy
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
ngram
- the ngram to put in dictionaryindex
- index for ngram start (index when the ngram become valid : for example, if we want to skip the first ngram word, just set index = 1)increment
- the increment value
-
saveDictionary
public void saveDictionary(File dictionaryFile) throws IOException
Description copied from class:AbstractNGramDictionary
Save this dictionary to a file.
Will save the dictionary relative with id only, this means that the same word dictionary should be loaded if this dictionary is opened later.- Specified by:
saveDictionary
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
dictionaryFile
- the file where dictionary should be saved.- Throws:
IOException
- if dictionary can't be saved
-
executeWriteLevelOnRoot
protected void executeWriteLevelOnRoot(FileChannel fileChannel, int level) throws IOException
Call the correct node method to save a trie level to file.- Parameters:
fileChannel
- the file channel where trie is savedlevel
- the level to save- Throws:
IOException
- if writing fail
-
getRootBlockSize
protected long getRootBlockSize()
- Returns:
- should return the byte count needed to save the root block (useful to shift data in file to save the root in first position in file)
-
updateProbabilities
public void updateProbabilities(double[] d)
Description copied from class:AbstractNGramDictionary
Update the whole probabilities in this dictionary.
Can take a while if there is a lot of nodes in the dictionary.- Specified by:
updateProbabilities
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
d
- the d parameter for absolute discounting algorithm.
-
updateProbabilities
public void updateProbabilities(int[] prefix, int prefixIndex, double[] d)
Description copied from class:AbstractNGramDictionary
Update probabilities in this dictionary for a specific ngram prefix : this will update the probabilities of the prefix children, and update the backoff weight of the parent node.
This is much more optimized thanAbstractNGramDictionary.updateProbabilities(double[])
- Specified by:
updateProbabilities
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
prefix
- prefix of the node that should be updatedprefixIndex
- prefix start index (0 = full prefix, 1 = skip the first word in prefix, etc...)d
- the d parameter for absolute discounting algorithm.
-
computeD
public double[] computeD(TrainingConfiguration configuration)
Description copied from class:AbstractNGramDictionary
Compute the optimal value for d (absolute discounting parameter).
Usually d is computed with formula :
D = C1 / (C1 + 2 * C2)
Where C1 = number of ngram with count == 1, and C2 = number of ngram with count == 2. Theses values are computed for each order (0 index = unigram, 1 index = bigram, etc.)- Specified by:
computeD
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
configuration
- configuration to use to compute D (can set min/max values and a D value)- Returns:
- computed d value for this dictionary
-
pruneNGramsWeightedDifference
public void pruneNGramsWeightedDifference(double thresholdPruning, TrainingConfiguration configuration, NGramPruningMethod pruningMethod)
Execute a pruning on the dictionary.
Pruning is implemented with a "weighted difference" algorithm : difference is computed between high order model and a lower order model (e.g. difference between 4-gram - 3gram, then 3-gram - 2-gram) and if the difference is bellow a certain level (threshold), the high order model is deleted.
Difference pruning is executed for max order to bigram level, probabilities are computed again after the pruning.- Parameters:
thresholdPruning
- pruning threshold (every ngram with prob difference bellow this threshold are deleted)configuration
- training configuration (computeD(TrainingConfiguration)
configuration)pruningMethod
- pruning method to use
-
pruneNGramsCount
public void pruneNGramsCount(int countThreshold, TrainingConfiguration configuration)
-
pruneNGramsOrderCount
public void pruneNGramsOrderCount(int[] counts, TrainingConfiguration configuration)
-
openDictionary
protected void openDictionary(File dictionaryFile) throws IOException
Description copied from class:AbstractNGramDictionary
Open a dictionary from a file.
To use the dictionary, the sameWordDictionary
used to save it should be used.- Specified by:
openDictionary
in classAbstractNGramDictionary<DynamicNGramTrieNode>
- Parameters:
dictionaryFile
- the file containing a dictionary.- Throws:
IOException
- if dictionary can't be opened
-
create
public static TrainingNGramDictionary create(int maxOrder)
Create an empty training ngram trie dictionary- Parameters:
maxOrder
- the max possible order for the dictionary- Returns:
- an new empty dictionary
-
-