Class TrainingNGramDictionary

    • Field Detail

      • NGRAM_COUNT_FORMAT

        public static final DecimalFormat NGRAM_COUNT_FORMAT
    • Constructor Detail

      • TrainingNGramDictionary

        protected TrainingNGramDictionary​(int maxOrderP)
      • TrainingNGramDictionary

        protected TrainingNGramDictionary​(DynamicNGramTrieNode root,
                                          int maxOrderP)
    • Method Detail

      • getNodeForPrefix

        public DynamicNGramTrieNode getNodeForPrefix​(int[] prefix,
                                                     int index)
        Description copied from class: AbstractNGramDictionary
        Use to retrieve a node for a given prefix.
        For example, for prefix = [1,2] will return the trie node corresponding to {2}.
        The children of the given node may have not been loaded.
        Specified by:
        getNodeForPrefix in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        prefix - the node prefix
        index - first word in prefix index (to take the full prefix, index should be = 0)
        Returns:
        the node found for the given prefix, or null if there is no existing node for such prefix
      • putAndIncrementBy

        public void putAndIncrementBy​(int[] ngram,
                                      int index,
                                      int increment)
        Description copied from class: AbstractNGramDictionary
        Add a given ngram to the dictionary and to increment its count.
        If the ngram is already in the dictionary, will just increment its count.
        Specified by:
        putAndIncrementBy in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        ngram - the ngram to put in dictionary
        index - index for ngram start (index when the ngram become valid : for example, if we want to skip the first ngram word, just set index = 1)
        increment - the increment value
      • executeWriteLevelOnRoot

        protected void executeWriteLevelOnRoot​(FileChannel fileChannel,
                                               int level)
                                        throws IOException
        Call the correct node method to save a trie level to file.
        Parameters:
        fileChannel - the file channel where trie is saved
        level - the level to save
        Throws:
        IOException - if writing fail
      • getRootBlockSize

        protected long getRootBlockSize()
        Returns:
        should return the byte count needed to save the root block (useful to shift data in file to save the root in first position in file)
      • updateProbabilities

        public void updateProbabilities​(int[] prefix,
                                        int prefixIndex,
                                        double[] d)
        Description copied from class: AbstractNGramDictionary
        Update probabilities in this dictionary for a specific ngram prefix : this will update the probabilities of the prefix children, and update the backoff weight of the parent node.
        This is much more optimized than AbstractNGramDictionary.updateProbabilities(double[])
        Specified by:
        updateProbabilities in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        prefix - prefix of the node that should be updated
        prefixIndex - prefix start index (0 = full prefix, 1 = skip the first word in prefix, etc...)
        d - the d parameter for absolute discounting algorithm.
      • computeD

        public double[] computeD​(TrainingConfiguration configuration)
        Description copied from class: AbstractNGramDictionary
        Compute the optimal value for d (absolute discounting parameter).
        Usually d is computed with formula :
        D = C1 / (C1 + 2 * C2)
        Where C1 = number of ngram with count == 1, and C2 = number of ngram with count == 2. Theses values are computed for each order (0 index = unigram, 1 index = bigram, etc.)
        Specified by:
        computeD in class AbstractNGramDictionary<DynamicNGramTrieNode>
        Parameters:
        configuration - configuration to use to compute D (can set min/max values and a D value)
        Returns:
        computed d value for this dictionary
      • pruneNGramsWeightedDifference

        public void pruneNGramsWeightedDifference​(double thresholdPruning,
                                                  TrainingConfiguration configuration,
                                                  NGramPruningMethod pruningMethod)
        Execute a pruning on the dictionary.
        Pruning is implemented with a "weighted difference" algorithm : difference is computed between high order model and a lower order model (e.g. difference between 4-gram - 3gram, then 3-gram - 2-gram) and if the difference is bellow a certain level (threshold), the high order model is deleted.
        Difference pruning is executed for max order to bigram level, probabilities are computed again after the pruning.
        Parameters:
        thresholdPruning - pruning threshold (every ngram with prob difference bellow this threshold are deleted)
        configuration - training configuration (computeD(TrainingConfiguration) configuration)
        pruningMethod - pruning method to use
      • pruneNGramsCount

        public void pruneNGramsCount​(int countThreshold,
                                     TrainingConfiguration configuration)
      • pruneNGramsOrderCount

        public void pruneNGramsOrderCount​(int[] counts,
                                          TrainingConfiguration configuration)
      • create

        public static TrainingNGramDictionary create​(int maxOrder)
        Create an empty training ngram trie dictionary
        Parameters:
        maxOrder - the max possible order for the dictionary
        Returns:
        an new empty dictionary