Class CharSequenceSuffixTree

All Implemented Interfaces:
SuffixTree

public class CharSequenceSuffixTree extends AbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
A suffix tree for a sequence of characters.

This class builds a suffix tree from a sequence of characters in O(N) time following Ukkonen's algorithm. The first version of this algorithm was followed by closely following Mark Nelson's explanation and C++ algorithm presented in Fast String Searching With Suffix Trees.

Notes:

  • This implementation uses exclusive end positions rather than inclusive end positions, which are more intuitive, make calculations easier, and interact nicely with the Java API.
  • The original implementation needlessly tied the edge splitting logic to a particular suffix. While this doesn't affect functionality, it doesn't logically isolate the process of splitting an edge at a particular location, which is completely independent from a suffix. It merely needs to be known at what point along the edge the split should occur. This implementation also splits an edge by creating two new edges rather than merely modifying one.
  • The original article used a "suffix" class that with a node and character indexes. This implementation reverts to the more general "state" terminology used by Ukkonen. Furthermore, the end character index has been changed to exclusive, allowing a state "length" property to be more natural. It also allows the "explicit" state to be more readily apparently---the state in which start==end. Finally, these modifications reduce the state canonization logic to simply "consume edges until the next edge is not small enough to consume or the state is explicit".
  • The original implementation kept a record of the current last character being added. With every iteration the suffix/state had its endpoint incremented, making a separate last-character variable redundant.
  • The algorithm here checks to see when the state start has gone past the end; this signals that the current iteration is finished, and there is no need loop around and check for an edge emanating from the current active node, because the state was explicit in the previous iteration so an edge had to have been created.
  • The traditional algorithm has been modified slightly to construct an explicit suffix tree on the last round without the need of a unique "dummy character" appended to the string. This may result in some edges that are empty, as well as an extra, empty edge emanating from the root node representing the empty string suffix.
Author:
Garret Wilson
See Also: