Class AbstractNGramDictionary<T extends AbstractNGramTrieNode<T>>

  • Type Parameters:
    T - type of trie node stored in this dictionary.
    All Implemented Interfaces:
    AutoCloseable
    Direct Known Subclasses:
    StaticNGramTrieDictionary, TrainingNGramDictionary

    public abstract class AbstractNGramDictionary<T extends AbstractNGramTrieNode<T>>
    extends Object
    implements AutoCloseable
    Represent an ngram dictionary in an abstract way : dictionary can be static or dynamic.
    Each type of dictionary can or can't support operation, such as dictionary saving, or updating probabilities.

    The dictionary has a maxOrder that represents the max order gram that can be found in the dictionary. Order in a ngram correspond to the ngram rank : 1 = unigram, 2 bigram, etc... Order in dictionary is not bounded to a maximum value, but in practice, order is never more than 5.

    Dictionary are represented as a trie, with also different kind of trie availabe. Each type of dictionary is associated with a different type of AbstractNGramTrieNode (e.g. dynamic dictionary is associated with a dynamic trie node).
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static int DICTIONARY_INFORMATION_BYTE_COUNT
      Byte count needed to save general information about this dictionary.
      protected int maxOrder
      Max order possible to store in this dictionary.
      Could be retrieved by opening the dictionary, or set by user as a limit.
      protected T rootNode
      Root node of this dictionary (this node contains as children the whole vocabulary)
    • Constructor Summary

      Constructors 
      Constructor Description
      AbstractNGramDictionary​(T rootNode, int maxOrderP)
      Construct a dictionary with a given root node and a max possible order.
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      abstract boolean checkChildrenLoading​(T node)
      To check that the children of a given node are loaded into memory (and can be used)
      void compact()
      Compact the nodes in this dictionary (this will call AbstractNGramTrieNode.compact() on root)
      abstract double[] computeD​(TrainingConfiguration configuration)
      Compute the optimal value for d (absolute discounting parameter).
      Usually d is computed with formula :
      D = C1 / (C1 + 2 * C2)
      Where C1 = number of ngram with count == 1, and C2 = number of ngram with count == 2.
      int getMaxOrder()  
      TIntHashSet getNextWord​(int[] prefix)
      Return the immediate next words for a given prefix (without any filter)
      abstract T getNodeForPrefix​(int[] prefix, int index)
      Use to retrieve a node for a given prefix.
      For example, for prefix = [1,2] will return the trie node corresponding to {2}.
      The children of the given node may have not been loaded.
      double getProbability​(int[] prefix, int index, int length, int wordId)
      Return the probability of a word for a given prefix.
      Given index = 0 and length = prefix.length will return the maximum order probability (e.g. prefix.length = 3, will return probability for order 3)
      double getRawProbability​(int[] prefix, int index, int length, int wordId)  
      T getRoot()  
      void listNextWords​(int[] prefix, WordDictionary wordDictionary, PredictionParameter predictionParameter, Set<Integer> wordsToExclude, Map<BiIntegerKey,​NextWord> resultSet, int wantedCount, boolean unigramLevel)
      Will go through each ngram dictionary order to find the next possible words for a given prefix
      Will first go through the highest order for the given prefix (e.g. prefix length == 3 = order is 4), and if the wantedCount is not reached, will go to the lower order to find new next possible.
      protected abstract void openDictionary​(File dictionaryFile)
      Open a dictionary from a file.
      To use the dictionary, the same WordDictionary used to save it should be used.
      abstract void putAndIncrementBy​(int[] ngram, int increment)
      Add a given ngram to the dictionary and to increment its count.
      If the ngram is already in the dictionary, will just increment its count.
      This will call putAndIncrementBy(int[], int, int) with a index = 0
      abstract void putAndIncrementBy​(int[] ngram, int index, int increment)
      Add a given ngram to the dictionary and to increment its count.
      If the ngram is already in the dictionary, will just increment its count.
      protected void readDictionaryInformation​(ByteBuffer byteBuffer)
      Read the general information for this dictionary from a given buffer (doesn't do any check)
      abstract void saveDictionary​(File dictionaryFile)
      Save this dictionary to a file.
      Will save the dictionary relative with id only, this means that the same word dictionary should be loaded if this dictionary is opened later.
      abstract void updateProbabilities​(double[] d)
      Update the whole probabilities in this dictionary.
      Can take a while if there is a lot of nodes in the dictionary.
      abstract void updateProbabilities​(int[] prefix, int prefixIndex, double[] d)
      Update probabilities in this dictionary for a specific ngram prefix : this will update the probabilities of the prefix children, and update the backoff weight of the parent node.
      This is much more optimized than updateProbabilities(double[])
      protected void writeDictionaryInfo​(ByteBuffer buffWrite)
      Write the general information for this dictionary to a given buffer
    • Field Detail

      • DICTIONARY_INFORMATION_BYTE_COUNT

        protected static final int DICTIONARY_INFORMATION_BYTE_COUNT
        Byte count needed to save general information about this dictionary. (e.g. max order)
        See Also:
        Constant Field Values
      • maxOrder

        protected int maxOrder
        Max order possible to store in this dictionary.
        Could be retrieved by opening the dictionary, or set by user as a limit.
      • rootNode

        protected final T extends AbstractNGramTrieNode<T> rootNode
        Root node of this dictionary (this node contains as children the whole vocabulary)
    • Constructor Detail

      • AbstractNGramDictionary

        public AbstractNGramDictionary​(T rootNode,
                                       int maxOrderP)
        Construct a dictionary with a given root node and a max possible order.
        Parameters:
        rootNode - the root node to use for this dictionary
        maxOrderP - max possible order for this dictionary.
    • Method Detail

      • getRoot

        public T getRoot()
        Returns:
        the root for this dictionary
      • getMaxOrder

        public int getMaxOrder()
        Returns:
        the max possible order for this dictionary
      • getNodeForPrefix

        public abstract T getNodeForPrefix​(int[] prefix,
                                           int index)
        Use to retrieve a node for a given prefix.
        For example, for prefix = [1,2] will return the trie node corresponding to {2}.
        The children of the given node may have not been loaded.
        Parameters:
        prefix - the node prefix
        index - first word in prefix index (to take the full prefix, index should be = 0)
        Returns:
        the node found for the given prefix, or null if there is no existing node for such prefix
      • checkChildrenLoading

        public abstract boolean checkChildrenLoading​(T node)
        To check that the children of a given node are loaded into memory (and can be used)
        Parameters:
        node - the node to check children loading on
        Returns:
        true if there is children for this node, and these children are loaded.
      • putAndIncrementBy

        public abstract void putAndIncrementBy​(int[] ngram,
                                               int index,
                                               int increment)
        Add a given ngram to the dictionary and to increment its count.
        If the ngram is already in the dictionary, will just increment its count.
        Parameters:
        ngram - the ngram to put in dictionary
        index - index for ngram start (index when the ngram become valid : for example, if we want to skip the first ngram word, just set index = 1)
        increment - the increment value
      • putAndIncrementBy

        public abstract void putAndIncrementBy​(int[] ngram,
                                               int increment)
        Add a given ngram to the dictionary and to increment its count.
        If the ngram is already in the dictionary, will just increment its count.
        This will call putAndIncrementBy(int[], int, int) with a index = 0
        Parameters:
        ngram - the ngram to put in dictionary
        increment - the increment value
      • saveDictionary

        public abstract void saveDictionary​(File dictionaryFile)
                                     throws IOException
        Save this dictionary to a file.
        Will save the dictionary relative with id only, this means that the same word dictionary should be loaded if this dictionary is opened later.
        Parameters:
        dictionaryFile - the file where dictionary should be saved.
        Throws:
        IOException - if dictionary can't be saved
      • openDictionary

        protected abstract void openDictionary​(File dictionaryFile)
                                        throws IOException
        Open a dictionary from a file.
        To use the dictionary, the same WordDictionary used to save it should be used.
        Parameters:
        dictionaryFile - the file containing a dictionary.
        Throws:
        IOException - if dictionary can't be opened
      • updateProbabilities

        public abstract void updateProbabilities​(double[] d)
        Update the whole probabilities in this dictionary.
        Can take a while if there is a lot of nodes in the dictionary.
        Parameters:
        d - the d parameter for absolute discounting algorithm.
      • updateProbabilities

        public abstract void updateProbabilities​(int[] prefix,
                                                 int prefixIndex,
                                                 double[] d)
        Update probabilities in this dictionary for a specific ngram prefix : this will update the probabilities of the prefix children, and update the backoff weight of the parent node.
        This is much more optimized than updateProbabilities(double[])
        Parameters:
        prefix - prefix of the node that should be updated
        prefixIndex - prefix start index (0 = full prefix, 1 = skip the first word in prefix, etc...)
        d - the d parameter for absolute discounting algorithm.
      • computeD

        public abstract double[] computeD​(TrainingConfiguration configuration)
        Compute the optimal value for d (absolute discounting parameter).
        Usually d is computed with formula :
        D = C1 / (C1 + 2 * C2)
        Where C1 = number of ngram with count == 1, and C2 = number of ngram with count == 2. Theses values are computed for each order (0 index = unigram, 1 index = bigram, etc.)
        Parameters:
        configuration - configuration to use to compute D (can set min/max values and a D value)
        Returns:
        computed d value for this dictionary
      • listNextWords

        public void listNextWords​(int[] prefix,
                                  WordDictionary wordDictionary,
                                  PredictionParameter predictionParameter,
                                  Set<Integer> wordsToExclude,
                                  Map<BiIntegerKey,​NextWord> resultSet,
                                  int wantedCount,
                                  boolean unigramLevel)
        Will go through each ngram dictionary order to find the next possible words for a given prefix
        Will first go through the highest order for the given prefix (e.g. prefix length == 3 = order is 4), and if the wantedCount is not reached, will go to the lower order to find new next possible.
        Parameters:
        prefix - the prefix to detect word after (words ids, represent a ngram prefix)
        wordDictionary - word dictionary (useful only if prefixDetected is not null)
        predictionParameter - prediction parameter (can be used to validate words)
        wordsToExclude - a list of words that shouldn't be included in the result set
        resultSet - set that will contains every next words found
        wantedCount - wanted next word count (an higher count will take more time)
        unigramLevel - if true, this will go to unigram level (whole vocabulary) if the is not enough / this can be time consuming as unigram level contains the whole word dictionary
      • getNextWord

        public TIntHashSet getNextWord​(int[] prefix)
                                throws IOException
        Return the immediate next words for a given prefix (without any filter)
        Parameters:
        prefix - the prefix (previous N words)
        Returns:
        a set containing the next word for the given prefix, or null if there is no existing ngram in the dictionary for this prefix
        Throws:
        IOException - if children can't be read
      • getProbability

        public double getProbability​(int[] prefix,
                                     int index,
                                     int length,
                                     int wordId)
        Return the probability of a word for a given prefix.
        Given index = 0 and length = prefix.length will return the maximum order probability (e.g. prefix.length = 3, will return probability for order 3)
        Parameters:
        prefix - the word before the given word (prefix)
        index - the index in the given prefix (will change the result order)
        length - the given prefix length (will change the result order).
        wordId - the word we want the probability for
        Returns:
        the probability for the given word (0.0 - 1.0)
      • getRawProbability

        public double getRawProbability​(int[] prefix,
                                        int index,
                                        int length,
                                        int wordId)
      • readDictionaryInformation

        protected void readDictionaryInformation​(ByteBuffer byteBuffer)
        Read the general information for this dictionary from a given buffer (doesn't do any check)
        Parameters:
        byteBuffer - the byte buffer where dictionary information are read
      • writeDictionaryInfo

        protected void writeDictionaryInfo​(ByteBuffer buffWrite)
        Write the general information for this dictionary to a given buffer
        Parameters:
        buffWrite - the byte buffer where information are written