public final class ByteQuadsCanonicalizerextends Object
Replacement for BytesToNameCanonicalizer which aims at more localized
memory access due to flattening of name quad data.
Performance improvement modest for simple JSON document data binding (maybe 3%),
but should help more for larger symbol tables, or for binary formats like Smile.
Hash area is divided into 4 sections:
Primary area (1/2 of total size), direct match from hash (LSB)
Secondary area (1/4 of total size), match from hash (LSB) >> 1
Tertiary area (1/8 of total size), match from hash (LSB) >> 2
Spill-over area (remaining 1/8) with linear scan, insertion order
and within every area, entries are 4 ints, where 1 - 3 ints contain 1 - 12
UTF-8 encoded bytes of name (null-padded), and last int is offset in
_names that contains actual name Strings.
Primary hash information area: consists of 2 * _hashSize
entries of 16 bytes (4 ints), arranged in a cascading lookup
structure (details of which may be tweaked depending on expected rates
of collisions).
Number of slots for primary entries within _hashArea; which is
at most 1/8 of actual size of the underlying array (4-int slots,
primary covers only half of the area; plus, additional area for longer
symbols after hash area).
Offset within _hashArea that follows main slots and contains
quads for longer names (13 bytes or longer), and points to the
first available int that may be used for appending quads of the next
long name.
Member that is only used by the root table instance: root
passes immutable state info child instances, and children
may return new state if they add entries to the table.
Constant that determines size of buckets for tertiary entries:
1 << _tertiaryShift is the size, and shift value
is also used for translating from primary offset into
tertiary bucket (shift right by 4 + _tertiaryShift).
Let's only share reasonably sized symbol tables. Max size set to 3/4 of 8k;
this corresponds to 256k main hash index. This should allow for enough distinct
names for almost any case, while preventing ballooning for cases where names
are unique (or close thereof).
Member that is only used by the root table instance: root
passes immutable state info child instances, and children
may return new state if they add entries to the table.
Child tables do NOT use the reference.
_seed
protected finalint_seed
Seed value we use as the base to make hash codes non-static between
different runs, but still stable for lifetime of a single symbol table
instance.
This is done for security reasons, to avoid potential DoS attack via
hash collisions.
_intern
protected finalboolean_intern
Whether canonical symbol Strings are to be intern()ed before added
to the table or not.
NOTE: non-final to allow disabling intern()ing in case of excessive
collisions.
_failOnDoS
protected finalboolean_failOnDoS
Flag that indicates whether we should throw an exception if enough
hash collisions are detected (true); or just worked around (false).
Since:
2.4
_hashArea
protectedint[]_hashArea
Primary hash information area: consists of 2 * _hashSize
entries of 16 bytes (4 ints), arranged in a cascading lookup
structure (details of which may be tweaked depending on expected rates
of collisions).
_hashSize
protectedint_hashSize
Number of slots for primary entries within _hashArea; which is
at most 1/8 of actual size of the underlying array (4-int slots,
primary covers only half of the area; plus, additional area for longer
symbols after hash area).
_secondaryStart
protectedint_secondaryStart
Offset within _hashArea where secondary entries start
_tertiaryStart
protectedint_tertiaryStart
Offset within _hashArea where tertiary entries start
_tertiaryShift
protectedint_tertiaryShift
Constant that determines size of buckets for tertiary entries:
1 << _tertiaryShift is the size, and shift value
is also used for translating from primary offset into
tertiary bucket (shift right by 4 + _tertiaryShift).
Default value is 2, for buckets of 4 slots; grows bigger with
bigger table sizes.
_count
protectedint_count
Total number of Strings in the symbol table; only used for child tables.
Array that contains String instances matching
entries in _hashArea.
Contains nulls for unused entries. Note that this size is twice
that of _hashArea
_spilloverEnd
protectedint_spilloverEnd
Pointer to the offset within spill-over area where there is room
for more spilled over entries (if any).
Spill over area is within fixed-size portion of _hashArea.
_longNameOffset
protectedint_longNameOffset
Offset within _hashArea that follows main slots and contains
quads for longer names (13 bytes or longer), and points to the
first available int that may be used for appending quads of the next
long name.
Note that long name area follows immediately after the fixed-size
main hash area (_hashArea).
_hashShared
protectedboolean_hashShared
Flag that indicates whether underlying data structures for
the main hash area are shared or not. If they are, then they
need to be handled in copy-on-write way, i.e. if they need
to be modified, a copy needs to be made first; at this point
it will not be shared any more, and can be modified.
This flag needs to be checked both when adding new main entries,
and when adding new collision list queues (i.e. creating a new
collision list head entry)
Actual canonicalizer instance that can be used by a parser if (and only if)
canonicalization is enabled; otherwise a non-null "placeholder" instance.
Since:
2.13
release
publicvoidrelease()
Method called by the using code to indicate it is done with this instance.
This lets instance merge accumulated changes into parent (if need be),
safely and efficiently, and without calling code having to know about parent
information.
size
publicintsize()
Returns:
Number of symbol entries contained by this canonicalizer instance
bucketCount
publicintbucketCount()
Returns:
number of primary slots table has currently
maybeDirty
publicbooleanmaybeDirty()
Method called to check to quickly see if a child symbol table
may have gotten additional entries. Used for checking to see
if a child table should be merged into shared table.
Returns:
Whether main hash area has been modified
hashSeed
publicinthashSeed()
isCanonicalizing
publicbooleanisCanonicalizing()
Returns:
True for "real", canonicalizing child tables; false for
root table as well as placeholder "child" tables.
Since:
2.13
primaryCount
publicintprimaryCount()
Method mostly needed by unit tests; calculates number of
entries that are in the primary slot set. These are
"perfect" entries, accessible with a single lookup
Returns:
Number of entries in the primary hash area
secondaryCount
publicintsecondaryCount()
Method mostly needed by unit tests; calculates number of entries
in secondary buckets
Returns:
Number of entries in the secondary hash area
tertiaryCount
publicinttertiaryCount()
Method mostly needed by unit tests; calculates number of entries
in tertiary buckets
Returns:
Number of entries in the tertiary hash area
spilloverCount
publicintspilloverCount()
Method mostly needed by unit tests; calculates number of entries
in shared spill-over area