Class TrainingNGramDictionary

  • All Implemented Interfaces:
    java.lang.AutoCloseable
    Direct Known Subclasses:
    DynamicNGramDictionary

    public class TrainingNGramDictionary
    extends AbstractNGramDictionary<DynamicNGramTrieNode>
    Represent a training dictionary : a ngram dictionary used while training an ngram model.
    This dictionary is useful because it supports dynamic insertion and probabilities computing... It is always use DynamicNGramTrieNode.

    The default training dictionary is not meant to be opened : it saves the trie structure into a file to be then loaded as a StaticNGramTrieDictionary. However, DynamicNGramDictionary implements a dynamic dictionary that can be saved/opened with dynamic nodes.

    • Field Detail

      • NGRAM_COUNT_FORMAT

        public static final java.text.DecimalFormat NGRAM_COUNT_FORMAT
    • Constructor Detail

      • TrainingNGramDictionary

        protected TrainingNGramDictionary​(int maxOrderP)
      • TrainingNGramDictionary

        protected TrainingNGramDictionary​(DynamicNGramTrieNode root,
                                          int maxOrderP)
    • Method Detail

      • getNodeForPrefix

        public DynamicNGramTrieNode getNodeForPrefix​(int[] prefix,
                                                     int index)
        Description copied from class: AbstractNGramDictionary
        Use to retrieve a node for a given prefix.
        For example, for prefix = [1,2] will return the trie node corresponding to {2}.
        The children of the given node may have not been loaded.
        Specified by:
        getNodeForPrefix in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        prefix - the node prefix
        index - first word in prefix index (to take the full prefix, index should be = 0)
        Returns:
        the node found for the given prefix, or null if there is no existing node for such prefix
      • putAndIncrementBy

        public void putAndIncrementBy​(int[] ngram,
                                      int index,
                                      int increment)
        Description copied from class: AbstractNGramDictionary
        Add a given ngram to the dictionary and to increment its count.
        If the ngram is already in the dictionary, will just increment its count.
        Specified by:
        putAndIncrementBy in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        ngram - the ngram to put in dictionary
        index - index for ngram start (index when the ngram become valid : for example, if we want to skip the first ngram word, just set index = 1)
        increment - the increment value
      • saveDictionary

        public void saveDictionary​(java.io.File dictionaryFile)
                            throws java.io.IOException
        Description copied from class: AbstractNGramDictionary
        Save this dictionary to a file.
        Will save the dictionary relative with id only, this means that the same word dictionary should be loaded if this dictionary is opened later.
        Specified by:
        saveDictionary in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        dictionaryFile - the file where dictionary should be saved.
        Throws:
        java.io.IOException - if dictionary can't be saved
      • executeWriteLevelOnRoot

        protected void executeWriteLevelOnRoot​(java.nio.channels.FileChannel fileChannel,
                                               int level)
                                        throws java.io.IOException
        Call the correct node method to save a trie level to file.
        Parameters:
        fileChannel - the file channel where trie is saved
        level - the level to save
        Throws:
        java.io.IOException - if writing fail
      • getRootBlockSize

        protected long getRootBlockSize()
        Returns:
        should return the byte count needed to save the root block (useful to shift data in file to save the root in first position in file)
      • updateProbabilities

        public void updateProbabilities​(int[] prefix,
                                        int prefixIndex,
                                        double[] d)
        Description copied from class: AbstractNGramDictionary
        Update probabilities in this dictionary for a specific ngram prefix : this will update the probabilities of the prefix children, and update the backoff weight of the parent node.
        This is much more optimized than AbstractNGramDictionary.updateProbabilities(double[])
        Specified by:
        updateProbabilities in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        prefix - prefix of the node that should be updated
        prefixIndex - prefix start index (0 = full prefix, 1 = skip the first word in prefix, etc...)
        d - the d parameter for absolute discounting algorithm.
      • computeD

        public double[] computeD​(TrainingConfiguration configuration)
        Description copied from class: AbstractNGramDictionary
        Compute the optimal value for d (absolute discounting parameter).
        Usually d is computed with formula :
        D = C1 / (C1 + 2 * C2)
        Where C1 = number of ngram with count == 1, and C2 = number of ngram with count == 2. Theses values are computed for each order (0 index = unigram, 1 index = bigram, etc.)
        Specified by:
        computeD in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        configuration - configuration to use to compute D (can set min/max values and a D value)
        Returns:
        computed d value for this dictionary
      • pruneNGramsWeightedDifference

        public void pruneNGramsWeightedDifference​(double thresholdPruning,
                                                  TrainingConfiguration configuration,
                                                  NGramPruningMethod pruningMethod)
        Execute a pruning on the dictionary.
        Pruning is implemented with a "weighted difference" algorithm : difference is computed between high order model and a lower order model (e.g. difference between 4-gram - 3gram, then 3-gram - 2-gram) and if the difference is bellow a certain level (threshold), the high order model is deleted.
        Difference pruning is executed for max order to bigram level, probabilities are computed again after the pruning.
        Parameters:
        thresholdPruning - pruning threshold (every ngram with prob difference bellow this threshold are deleted)
        configuration - training configuration (computeD(TrainingConfiguration) configuration)
        pruningMethod - pruning method to use
      • pruneNGramsCount

        public void pruneNGramsCount​(int countThreshold,
                                     TrainingConfiguration configuration)
      • pruneNGramsOrderCount

        public void pruneNGramsOrderCount​(int[] counts,
                                          TrainingConfiguration configuration)
      • close

        public void close()
                   throws java.lang.Exception
        Throws:
        java.lang.Exception
      • openDictionary

        protected void openDictionary​(java.io.File dictionaryFile)
                               throws java.io.IOException
        Description copied from class: AbstractNGramDictionary
        Open a dictionary from a file.
        To use the dictionary, the same WordDictionary used to save it should be used.
        Specified by:
        openDictionary in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        dictionaryFile - the file containing a dictionary.
        Throws:
        java.io.IOException - if dictionary can't be opened
      • create

        public static TrainingNGramDictionary create​(int maxOrder)
        Create an empty training ngram trie dictionary
        Parameters:
        maxOrder - the max possible order for the dictionary
        Returns:
        an new empty dictionary
      • countNGrams

        public java.util.Map<java.lang.Integer,​Pair<java.lang.Integer,​java.lang.Integer>> countNGrams()