Class DuplicateByteSequenceSpotter
public class DuplicateByteSequenceSpotter
extends java.lang.Object
The minimum required length for a duplicate sequence detected is 6 bytes.
The design goals are to maximize speed of lookup while minimizing the space required to do so. This has led to a hybrid solution for representing the bytes that make up a sequence in the trie.
If we have 6 bytes in sequence e.g. abcdef then they are represented as object nodes in the tree as follows:
(a)-(b)-(c)-(def as an int)
DuplicateByteSequenceSpotter.RootTreeNode
objects are used for the first two levels of the tree
(representing bytes a and b in the example sequence). The combinations of
objects at these 2 levels are few so internally these objects allocate an
array of 256 child node objects to quickly address children by indexing
directly into the densely packed array using a byte value. The third level in
the tree holds DuplicateByteSequenceSpotter.LightweightTreeNode
nodes that have few children
(typically much less than 256) and so use a dynamically-grown array to hold
child nodes as simple int primitives. These ints represent the final 3 bytes
of a sequence and also hold a count of the number of times the entire sequence
path has been visited (count is a single byte).
The Trie grows indefinitely as more content is added and while theoretically it could be massive (a 6-depth tree could produce 256^6 nodes) non-random content e.g English text contains fewer variations.
In future we may look at using one of these strategies when memory is tight:
- auto-pruning methods to remove less-visited parts of the tree
- auto-reset to wipe the whole tree and restart when a memory threshold is reached
- halting any growth of the tree
-
Field Summary
Fields Modifier and Type Field Description static int
MAX_HIT_COUNT
static int
TREE_DEPTH
-
Constructor Summary
Constructors Constructor Description DuplicateByteSequenceSpotter()
-
Method Summary
Modifier and Type Method Description short
addByte(byte b)
Add a byte to the sequence.long
getEstimatedSizeInBytes()
int[]
getNodesAllocatedByDepth()
int
getNodesResizedByDepth()
void
startNewSequence()
Reset the sequence detection logic to avoid any continuation of the immediately previous bytes.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Field Details
-
TREE_DEPTH
public static final int TREE_DEPTH- See Also:
- Constant Field Values
-
MAX_HIT_COUNT
public static final int MAX_HIT_COUNT- See Also:
- Constant Field Values
-
-
Constructor Details
-
DuplicateByteSequenceSpotter
public DuplicateByteSequenceSpotter()
-
-
Method Details
-
startNewSequence
public void startNewSequence()Reset the sequence detection logic to avoid any continuation of the immediately previous bytes. A minimum of dupSequenceSize bytes need to be added before any new duplicate sequences will be reported. Hit counts are not reset by calling this method. -
addByte
public short addByte(byte b)Add a byte to the sequence.- Parameters:
b
- the next byte in a sequence- Returns:
- number of times this byte and the preceding 6 bytes have been seen before as a sequence (only counts up to 255)
-
getEstimatedSizeInBytes
public final long getEstimatedSizeInBytes() -
getNodesAllocatedByDepth
public int[] getNodesAllocatedByDepth()- Returns:
- Performance info - the number of nodes allocated at each depth
-
getNodesResizedByDepth
public int getNodesResizedByDepth()- Returns:
- Performance info - the number of resizing of children arrays, at each depth
-