java.lang.Object
edu.internet2.middleware.grouperClientExt.com.fasterxml.jackson.core.sym.CharsToNameCanonicalizer

public final class CharsToNameCanonicalizer extends Object
This class is a kind of specialized type-safe Map, from char array to String value. Specialization means that in addition to type-safety and specific access patterns (key char array, Value optionally interned String; values added on access if necessary), and that instances are meant to be used concurrently, but by using well-defined mechanisms to obtain such concurrently usable instances. Main use for the class is to store symbol table information for things like compilers and parsers; especially when number of symbols (keywords) is limited.

For optimal performance, usage pattern should be one where matches should be very common (especially after "warm-up"), and as with most hash-based maps/sets, that hash codes are uniformly distributed. Also, collisions are slightly more expensive than with HashMap or HashSet, since hash codes are not used in resolving collisions; that is, equals() comparison is done with all symbols in same bucket index.
Finally, rehashing is also more expensive, as hash codes are not stored; rehashing requires all entries' hash codes to be recalculated. Reason for not storing hash codes is reduced memory usage, hoping for better memory locality.

Usual usage pattern is to create a single "master" instance, and either use that instance in sequential fashion, or to create derived "child" instances, which after use, are asked to return possible symbol additions to master instance. In either case benefit is that symbol table gets initialized so that further uses are more efficient, as eventually all symbols needed will already be in symbol table. At that point no more Symbol String allocations are needed, nor changes to symbol table itself.

Note that while individual SymbolTable instances are NOT thread-safe (much like generic collection classes), concurrently used "child" instances can be freely used without synchronization. However, using master table concurrently with child instances can only be done if access to master instance is read-only (i.e. no modifications done).

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected edu.internet2.middleware.grouperClientExt.com.fasterxml.jackson.core.sym.CharsToNameCanonicalizer.Bucket[]
    Overflow buckets; if primary doesn't match, lookup is done from here.
    protected boolean
    Whether any canonicalization should be attempted (whether using intern or not.
    protected final int
     
    protected boolean
    Flag that indicates whether underlying data structures for the main hash area are shared or not.
    protected int
    Mask used to get index from hash values; equal to _buckets.length - 1, when _buckets.length is a power of two.
    protected int
    We need to keep track of the longest collision list; this is needed both to indicate problems with attacks and to allow flushing for other cases.
    protected BitSet
    Lazily constructed structure that is used to keep track of collision buckets that have overflowed once: this is used to detect likely attempts at denial-of-service attacks that uses hash collisions.
    protected final CharsToNameCanonicalizer
    Sharing of learnt symbols is done by optional linking of symbol table instances with their parents.
    protected final int
    Seed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance.
    protected int
    Current size (number of entries); needed to know if and when rehash.
    protected int
    Limit that indicates maximum size this instance can hold before it needs to be expanded and rehashed.
    protected String[]
    Primary matching symbols; it's expected most match occur from here.
    protected final AtomicReference<edu.internet2.middleware.grouperClientExt.com.fasterxml.jackson.core.sym.CharsToNameCanonicalizer.TableInfo>
    Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table.
    static final int
     
  • Method Summary

    Modifier and Type
    Method
    Description
    int
    _hashToIndex(int rawHash)
    Helper method that takes in a "raw" hash value, shuffles it as necessary, and truncates to be used as the index.
    protected void
     
    int
    Method for checking number of primary hash buckets this symbol table uses.
    int
    calcHash(char[] buffer, int start, int len)
    Implementation of a hashing method for variable length Strings.
    int
     
    int
    Method mostly needed by unit tests; calculates number of entries that are in collision list.
    Method called to create root canonicalizer for a JsonFactory instance.
    protected static CharsToNameCanonicalizer
    createRoot(int seed)
     
    findSymbol(char[] buffer, int start, int len, int h)
     
    int
     
    makeChild(int flags)
    "Factory" method; will create a new child instance of this symbol table.
    int
    Method mostly needed by unit tests; calculates length of the longest collision chain.
    boolean
     
    void
    Method called by the using code to indicate it is done with this instance.
    int
     
    protected void
    Diagnostics method that will verify that internal data structures are consistent; not meant as user-facing method but only for test suites and possible troubleshooting.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • HASH_MULT

      public static final int HASH_MULT
      See Also:
    • _parent

      protected final CharsToNameCanonicalizer _parent
      Sharing of learnt symbols is done by optional linking of symbol table instances with their parents. When parent linkage is defined, and child instance is released (call to release), parent's shared tables may be updated from the child instance.
    • _tableInfo

      protected final AtomicReference<edu.internet2.middleware.grouperClientExt.com.fasterxml.jackson.core.sym.CharsToNameCanonicalizer.TableInfo> _tableInfo
      Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table. Child tables do NOT use the reference.
    • _seed

      protected final int _seed
      Seed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance. This is done for security reasons, to avoid potential DoS attack via hash collisions.
      Since:
      2.1
    • _flags

      protected final int _flags
    • _canonicalize

      protected boolean _canonicalize
      Whether any canonicalization should be attempted (whether using intern or not.

      NOTE: non-final since we may need to disable this with overflow.

    • _symbols

      protected String[] _symbols
      Primary matching symbols; it's expected most match occur from here.
    • _buckets

      protected edu.internet2.middleware.grouperClientExt.com.fasterxml.jackson.core.sym.CharsToNameCanonicalizer.Bucket[] _buckets
      Overflow buckets; if primary doesn't match, lookup is done from here.

      Note: Number of buckets is half of number of symbol entries, on assumption there's less need for buckets.

    • _size

      protected int _size
      Current size (number of entries); needed to know if and when rehash.
    • _sizeThreshold

      protected int _sizeThreshold
      Limit that indicates maximum size this instance can hold before it needs to be expanded and rehashed. Calculated using fill factor passed in to constructor.
    • _indexMask

      protected int _indexMask
      Mask used to get index from hash values; equal to _buckets.length - 1, when _buckets.length is a power of two.
    • _longestCollisionList

      protected int _longestCollisionList
      We need to keep track of the longest collision list; this is needed both to indicate problems with attacks and to allow flushing for other cases.
      Since:
      2.1
    • _hashShared

      protected boolean _hashShared
      Flag that indicates whether underlying data structures for the main hash area are shared or not. If they are, then they need to be handled in copy-on-write way, i.e. if they need to be modified, a copy needs to be made first; at this point it will not be shared any more, and can be modified.

      This flag needs to be checked both when adding new main entries, and when adding new collision list queues (i.e. creating a new collision list head entry)

    • _overflows

      protected BitSet _overflows
      Lazily constructed structure that is used to keep track of collision buckets that have overflowed once: this is used to detect likely attempts at denial-of-service attacks that uses hash collisions.
      Since:
      2.4
  • Method Details

    • createRoot

      public static CharsToNameCanonicalizer createRoot()
      Method called to create root canonicalizer for a JsonFactory instance. Root instance is never used directly; its main use is for storing and sharing underlying symbol arrays as needed.
      Returns:
      Root instance to use for constructing new child instances
    • createRoot

      protected static CharsToNameCanonicalizer createRoot(int seed)
    • makeChild

      public CharsToNameCanonicalizer makeChild(int flags)
      "Factory" method; will create a new child instance of this symbol table. It will be a copy-on-write instance, ie. it will only use read-only copy of parent's data, but when changes are needed, a copy will be created.

      Note: while this method is synchronized, it is generally not safe to both use makeChild/mergeChild, AND to use instance actively. Instead, a separate 'root' instance should be used on which only makeChild/mergeChild are called, but instance itself is not used as a symbol table.

      Parameters:
      flags - Bit flags of active JsonFactory.Features enabled.
      Returns:
      Actual canonicalizer instance that can be used by a parser
    • release

      public void release()
      Method called by the using code to indicate it is done with this instance. This lets instance merge accumulated changes into parent (if need be), safely and efficiently, and without calling code having to know about parent information.
    • size

      public int size()
      Returns:
      Number of symbol entries contained by this canonicalizer instance
    • bucketCount

      public int bucketCount()
      Method for checking number of primary hash buckets this symbol table uses.
      Returns:
      number of primary slots table has currently
    • maybeDirty

      public boolean maybeDirty()
    • hashSeed

      public int hashSeed()
    • collisionCount

      public int collisionCount()
      Method mostly needed by unit tests; calculates number of entries that are in collision list. Value can be at most (size() - 1), but should usually be much lower, ideally 0.
      Returns:
      Number of collisions in the primary hash area
      Since:
      2.1
    • maxCollisionLength

      public int maxCollisionLength()
      Method mostly needed by unit tests; calculates length of the longest collision chain. This should typically be a low number, but may be up to size() - 1 in the pathological case
      Returns:
      Length of the collision chain
      Since:
      2.1
    • findSymbol

      public String findSymbol(char[] buffer, int start, int len, int h)
    • _hashToIndex

      public int _hashToIndex(int rawHash)
      Helper method that takes in a "raw" hash value, shuffles it as necessary, and truncates to be used as the index.
      Parameters:
      rawHash - Raw hash value to use for calculating index
      Returns:
      Index value calculated
    • calcHash

      public int calcHash(char[] buffer, int start, int len)
      Implementation of a hashing method for variable length Strings. Most of the time intention is that this calculation is done by caller during parsing, not here; however, sometimes it needs to be done for parsed "String" too.
      Parameters:
      buffer - Input buffer that contains name to decode
      start - Pointer to the first character of the name
      len - Length of String; has to be at least 1 (caller guarantees)
      Returns:
      Hash code calculated
    • calcHash

      public int calcHash(String key)
    • _reportTooManyCollisions

      protected void _reportTooManyCollisions(int maxLen)
      Parameters:
      maxLen - Maximum allowed length of collision chain
      Since:
      2.1
    • verifyInternalConsistency

      protected void verifyInternalConsistency()
      Diagnostics method that will verify that internal data structures are consistent; not meant as user-facing method but only for test suites and possible troubleshooting.
      Since:
      2.10