Package com.globalmentor.collections
Class CharSequenceSuffixTree
- java.lang.Object
-
- com.globalmentor.collections.AbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
-
- com.globalmentor.collections.CharSequenceSuffixTree
-
- All Implemented Interfaces:
SuffixTree
public class CharSequenceSuffixTree extends AbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
A suffix tree for a sequence of characters.This class builds a suffix tree from a sequence of characters in O(N) time following Ukkonen's algorithm. The first version of this algorithm was followed by closely following Mark Nelson's explanation and C++ algorithm presented in Fast String Searching With Suffix Trees.
Notes:
- This implementation uses exclusive end positions rather than inclusive end positions, which are more intuitive, make calculations easier, and interact nicely with the Java API.
- The original implementation needlessly tied the edge splitting logic to a particular suffix. While this doesn't affect functionality, it doesn't logically isolate the process of splitting an edge at a particular location, which is completely independent from a suffix. It merely needs to be known at what point along the edge the split should occur. This implementation also splits an edge by creating two new edges rather than merely modifying one.
- The original article used a "suffix" class that with a node and character indexes. This implementation reverts to the more general "state" terminology
used by Ukkonen. Furthermore, the end character index has been changed to exclusive, allowing a state "length" property to be more natural. It also allows
the "explicit" state to be more readily apparently---the state in which
start==end
. Finally, these modifications reduce the state canonization logic to simply "consume edges until the next edge is not small enough to consume or the state is explicit". - The original implementation kept a record of the current last character being added. With every iteration the suffix/state had its endpoint incremented, making a separate last-character variable redundant.
- The algorithm here checks to see when the state start has gone past the end; this signals that the current iteration is finished, and there is no need loop around and check for an edge emanating from the current active node, because the state was explicit in the previous iteration so an edge had to have been created.
- The traditional algorithm has been modified slightly to construct an explicit suffix tree on the last round without the need of a unique "dummy character" appended to the string. This may result in some edges that are empty, as well as an extra, empty edge emanating from the root node representing the empty string suffix.
- Author:
- Garret Wilson
- See Also:
- Mark Nelson: Fast String Searching With Suffix Trees, Mark Nelson: Liberal Code Use Policy
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static class
CharSequenceSuffixTree.AbstractEdgeKey
An abstract base class that implements hashing and equality for an edge key.protected class
CharSequenceSuffixTree.CharSequenceEdge
Represents an edge between a parent node and a child node in a suffix tree.protected class
CharSequenceSuffixTree.CharSequenceNode
Represents an edge between a parent node and a child node in a suffix tree.protected static interface
CharSequenceSuffixTree.EdgeKey
A key identifying an edge of a node, uniquely identified by its parent node and first character (as no node in a suffix tree contains more than one edge starting with the same character).-
Nested classes/interfaces inherited from class com.globalmentor.collections.AbstractSuffixTree
AbstractSuffixTree.AbstractNode
-
Nested classes/interfaces inherited from interface com.globalmentor.collections.SuffixTree
SuffixTree.Edge, SuffixTree.Node
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
addEdge(CharSequenceSuffixTree.CharSequenceEdge edge)
Adds an edge to the tree.static CharSequenceSuffixTree
create(java.lang.CharSequence charSequence)
Suffix tree builder factory method which creates a new, explicit suffix tree for a given character sequence.protected static CharSequenceSuffixTree
create(java.lang.CharSequence charSequence, boolean explicit)
Suffix tree builder factory method which creates a new suffix tree for a given character sequence.protected CharSequenceSuffixTree.CharSequenceEdge
createEdge(SuffixTree.Node parentNode, SuffixTree.Node childNode, int start, int end)
Creates a new edge.protected AbstractSuffixTree.AbstractNode
createNode(int index)
Creates a new node.java.lang.CharSequence
getCharSequence()
java.util.Collection<? extends SuffixTree.Edge>
getEdges()
CharSequenceSuffixTree.CharSequenceNode
getNode(int nodeIndex)
Retrieves the identified node.CharSequenceSuffixTree.CharSequenceNode
getRootNode()
Retrieves the root node of the tree.protected void
removeEdge(CharSequenceSuffixTree.CharSequenceEdge edge)
Removes an edge from the tree.protected CharSequenceSuffixTree.CharSequenceNode
splitEdge(CharSequenceSuffixTree.CharSequenceEdge edge, int length)
Splits an edge into two.boolean
startsWith(java.lang.CharSequence charSequence)
Compares a character sequence with the characters starting at the root node and continuing along child edges.boolean
startsWith(java.lang.CharSequence charSequence, int start, int end)
Compares part of a character sequence with the characters starting at the root node and continuing along child edges.-
Methods inherited from class com.globalmentor.collections.AbstractSuffixTree
addEdge, addNode, getNodeCount, getNodes, isExplicit
-
-
-
-
Method Detail
-
getCharSequence
public java.lang.CharSequence getCharSequence()
- Returns:
- The character sequence represented by the suffix tree.
-
getRootNode
public CharSequenceSuffixTree.CharSequenceNode getRootNode()
Description copied from interface:SuffixTree
Retrieves the root node of the tree. This is a convenience method to retrieve the node with index zero.- Specified by:
getRootNode
in interfaceSuffixTree
- Overrides:
getRootNode
in classAbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
- Returns:
- The identified node.
-
getNode
public CharSequenceSuffixTree.CharSequenceNode getNode(int nodeIndex)
Description copied from interface:SuffixTree
Retrieves the identified node.- Specified by:
getNode
in interfaceSuffixTree
- Overrides:
getNode
in classAbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
- Parameters:
nodeIndex
- The index of the node to retrieve.- Returns:
- The identified node.
-
createNode
protected AbstractSuffixTree.AbstractNode createNode(int index)
Description copied from class:AbstractSuffixTree
Creates a new node. By default a node is considered a leaf node.- Specified by:
createNode
in classAbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
- Parameters:
index
- The index of the node to create.- Returns:
- The index of the newly created node.
-
getEdges
public java.util.Collection<? extends SuffixTree.Edge> getEdges()
- Returns:
- A read-only iterable of edges in the tree.
-
createEdge
protected CharSequenceSuffixTree.CharSequenceEdge createEdge(SuffixTree.Node parentNode, SuffixTree.Node childNode, int start, int end)
Creates a new edge.- Specified by:
createEdge
in classAbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
- Parameters:
parentNode
- The parent node representing the root end of the edge.childNode
- The child node representing the leaf end of the edge.start
- The position of the start element, inclusive.end
- The position of the end element, exclusive.- Returns:
- The tree after it receives a new edge.
- Throws:
java.lang.ClassCastException
- if the given parent node and/or child node is not an instance ofCharSequenceSuffixTree.CharSequenceNode
.
-
addEdge
protected void addEdge(CharSequenceSuffixTree.CharSequenceEdge edge)
Adds an edge to the tree.- Specified by:
addEdge
in classAbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
- Parameters:
edge
- The edge to add.- Throws:
java.lang.NullPointerException
- if the given edge isnull
.java.lang.IllegalStateException
- if there already exists an edge with the same parent node and first character.
-
removeEdge
protected void removeEdge(CharSequenceSuffixTree.CharSequenceEdge edge)
Removes an edge from the tree. If the edge does not exist, no action occurs.- Specified by:
removeEdge
in classAbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
- Parameters:
edge
- The edge to remove.- Throws:
java.lang.NullPointerException
- if the given edge isnull
.
-
splitEdge
protected CharSequenceSuffixTree.CharSequenceNode splitEdge(CharSequenceSuffixTree.CharSequenceEdge edge, int length)
Description copied from class:AbstractSuffixTree
Splits an edge into two. The first, near edge will be of the given length; the second, far edge will be of the remaining length (that is, the length of the original edge minus the given length). A new node will be created as the mid-point between the original edge nodes, becoming the child node of the first edge and the parent node of the second edge.- Overrides:
splitEdge
in classAbstractSuffixTree<CharSequenceSuffixTree.CharSequenceEdge>
- Parameters:
edge
- The edge to split.length
- The position at which to split the edge.- Returns:
- The created node splitting the edge.
-
startsWith
public boolean startsWith(java.lang.CharSequence charSequence)
Compares a character sequence with the characters starting at the root node and continuing along child edges.- Parameters:
charSequence
- The character sequence to compare.- Returns:
true
if there is a path matching the given character sequence starting at this node and continuing along child edges.- See Also:
CharSequenceSuffixTree.CharSequenceNode.startsWith(CharSequence)
,CharSequenceSuffixTree.CharSequenceEdge.startsWith(CharSequence)
-
startsWith
public boolean startsWith(java.lang.CharSequence charSequence, int start, int end)
Compares part of a character sequence with the characters starting at the root node and continuing along child edges.- Parameters:
charSequence
- The character sequence to compare.start
- The start of the character sequence to compare, inclusive.end
- The end of the character sequence to compare, exclusive.- Returns:
true
if there is a path matching the given character sequence starting at the root node and continuing along child edges.- Throws:
java.lang.StringIndexOutOfBoundsException
- ifstart
orend
are negative or greater thanlength()
, orstart
is greater thanend
.- See Also:
CharSequenceSuffixTree.CharSequenceNode.startsWith(CharSequence, int, int)
,CharSequenceSuffixTree.CharSequenceEdge.startsWith(CharSequence, int, int)
-
create
public static CharSequenceSuffixTree create(java.lang.CharSequence charSequence)
Suffix tree builder factory method which creates a new, explicit suffix tree for a given character sequence. The created suffix tree will have one more leaf node than the number of characters in the sequence, because there will exist an empty edge from the root indicating the empty string.- Parameters:
charSequence
- The character sequence for which a suffix tree should be built.- Returns:
- The new suffix tree for the given character sequence.
- Throws:
java.lang.NullPointerException
- if the given character sequence isnull
.
-
create
protected static CharSequenceSuffixTree create(java.lang.CharSequence charSequence, boolean explicit)
Suffix tree builder factory method which creates a new suffix tree for a given character sequence. If an explicit suffix tree is requested, the created suffix tree will have one more leaf node than the number of characters in the sequence, because there will exist an empty edge from the root indicating the empty string.- Parameters:
charSequence
- The character sequence for which a suffix tree should be built.explicit
- Whether an explicit suffix tree should be constructed.- Returns:
- The new suffix tree for the given character sequence.
- Throws:
java.lang.NullPointerException
- if the given character sequence isnull
.
-
-