Class CharSequenceSuffixTree

  • All Implemented Interfaces:
    SuffixTree

    public class CharSequenceSuffixTree
    extends AbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
    A suffix tree for a sequence of characters.

    This class builds a suffix tree from a sequence of characters in O(N) time following Ukkonen's algorithm. The first version of this algorithm was followed by closely following Mark Nelson's explanation and C++ algorithm presented in Fast String Searching With Suffix Trees.

    Notes:

    • This implementation uses exclusive end positions rather than inclusive end positions, which are more intuitive, make calculations easier, and interact nicely with the Java API.
    • The original implementation needlessly tied the edge splitting logic to a particular suffix. While this doesn't affect functionality, it doesn't logically isolate the process of splitting an edge at a particular location, which is completely independent from a suffix. It merely needs to be known at what point along the edge the split should occur. This implementation also splits an edge by creating two new edges rather than merely modifying one.
    • The original article used a "suffix" class that with a node and character indexes. This implementation reverts to the more general "state" terminology used by Ukkonen. Furthermore, the end character index has been changed to exclusive, allowing a state "length" property to be more natural. It also allows the "explicit" state to be more readily apparently---the state in which start==end. Finally, these modifications reduce the state canonization logic to simply "consume edges until the next edge is not small enough to consume or the state is explicit".
    • The original implementation kept a record of the current last character being added. With every iteration the suffix/state had its endpoint incremented, making a separate last-character variable redundant.
    • The algorithm here checks to see when the state start has gone past the end; this signals that the current iteration is finished, and there is no need loop around and check for an edge emanating from the current active node, because the state was explicit in the previous iteration so an edge had to have been created.
    • The traditional algorithm has been modified slightly to construct an explicit suffix tree on the last round without the need of a unique "dummy character" appended to the string. This may result in some edges that are empty, as well as an extra, empty edge emanating from the root node representing the empty string suffix.
    Author:
    Garret Wilson
    See Also:
    Mark Nelson: Fast String Searching With Suffix Trees, Mark Nelson: Liberal Code Use Policy
    • Method Detail

      • getCharSequence

        public java.lang.CharSequence getCharSequence()
        Returns:
        The character sequence represented by the suffix tree.
      • getEdges

        public java.util.Collection<? extends SuffixTree.Edge> getEdges()
        Returns:
        A read-only iterable of edges in the tree.
      • startsWith

        public boolean startsWith​(java.lang.CharSequence charSequence,
                                  int start,
                                  int end)
        Compares part of a character sequence with the characters starting at the root node and continuing along child edges.
        Parameters:
        charSequence - The character sequence to compare.
        start - The start of the character sequence to compare, inclusive.
        end - The end of the character sequence to compare, exclusive.
        Returns:
        true if there is a path matching the given character sequence starting at the root node and continuing along child edges.
        Throws:
        java.lang.StringIndexOutOfBoundsException - if start or end are negative or greater than length(), or start is greater than end.
        See Also:
        CharSequenceSuffixTree.CharSequenceNode.startsWith(CharSequence, int, int), CharSequenceSuffixTree.CharSequenceEdge.startsWith(CharSequence, int, int)
      • create

        public static CharSequenceSuffixTree create​(java.lang.CharSequence charSequence)
        Suffix tree builder factory method which creates a new, explicit suffix tree for a given character sequence. The created suffix tree will have one more leaf node than the number of characters in the sequence, because there will exist an empty edge from the root indicating the empty string.
        Parameters:
        charSequence - The character sequence for which a suffix tree should be built.
        Returns:
        The new suffix tree for the given character sequence.
        Throws:
        java.lang.NullPointerException - if the given character sequence is null.
      • create

        protected static CharSequenceSuffixTree create​(java.lang.CharSequence charSequence,
                                                       boolean explicit)
        Suffix tree builder factory method which creates a new suffix tree for a given character sequence. If an explicit suffix tree is requested, the created suffix tree will have one more leaf node than the number of characters in the sequence, because there will exist an empty edge from the root indicating the empty string.
        Parameters:
        charSequence - The character sequence for which a suffix tree should be built.
        explicit - Whether an explicit suffix tree should be constructed.
        Returns:
        The new suffix tree for the given character sequence.
        Throws:
        java.lang.NullPointerException - if the given character sequence is null.