Class BytesRefHash


  • public final class BytesRefHash
    extends Object
    BytesRefHash is a special purpose hash-map like data-structure optimized for BytesRef instances. BytesRefHash maintains mappings of byte arrays to ids (Map<BytesRef,int>) storing the hashed bytes efficiently in continuous storage. The mapping to the id is encapsulated inside BytesRefHash and is guaranteed to be increased for each added BytesRef.

    Note: The maximum capacity BytesRef instance passed to add(BytesRef) must not be longer than ByteBlockPool.BYTE_BLOCK_SIZE-2. The internal storage is limited to 2GB total byte storage.

    • Method Detail

      • get

        public BytesRef get​(int bytesID,
                            BytesRef ref)
        Populates and returns a BytesRef with the bytes for the given bytesID.

        Note: the given bytesID must be a positive integer less than the current size (size())

        Parameters:
        bytesID - the id
        ref - the BytesRef to populate
        Returns:
        the given BytesRef instance populated with the bytes for the given bytesID
      • sort

        public int[] sort​(Comparator<BytesRef> comp)
        Returns the values array sorted by the referenced byte values.

        Note: This is a destructive operation. clear() must be called in order to reuse this BytesRefHash instance.

        Parameters:
        comp - the Comparator used for sorting
      • clear

        public void clear​(boolean resetPool)
        Clears the BytesRef which maps to the given BytesRef
      • clear

        public void clear()
      • close

        public void close()
        Closes the BytesRefHash and releases all internally used memory
      • add

        public int add​(BytesRef bytes,
                       int code)
        Adds a new BytesRef with a pre-calculated hash code.
        Parameters:
        bytes - the bytes to hash
        code - the bytes hash code

        Hashcode is defined as:

         int hash = 0;
         for (int i = offset; i < offset + length; i++) {
           hash = 31 * hash + bytes[i];
         }
         
        Returns:
        the id the given bytes are hashed if there was no mapping for the given bytes, otherwise (-(id)-1). This guarantees that the return value will always be >= 0 if the given bytes haven't been hashed before.
        Throws:
        BytesRefHash.MaxBytesLengthExceededException - if the given bytes are > ByteBlockPool.BYTE_BLOCK_SIZE - 2
      • find

        public int find​(BytesRef bytes,
                        int code)
        Returns the id of the given BytesRef with a pre-calculated hash code.
        Parameters:
        bytes - the bytes to look for
        code - the bytes hash code
        Returns:
        the id of the given bytes, or -1 if there is no mapping for the given bytes.
      • addByPoolOffset

        public int addByPoolOffset​(int offset)
        Adds a "arbitrary" int offset instead of a BytesRef term. This is used in the indexer to hold the hash for term vectors, because they do not redundantly store the byte[] term directly and instead reference the byte[] term already stored by the postings BytesRefHash. See add(int textStart) in TermsHashPerField.
      • reinit

        public void reinit()
        reinitializes the BytesRefHash after a previous clear() call. If clear() has not been called previously this method has no effect.
      • byteStart

        public int byteStart​(int bytesID)
        Returns the bytesStart offset into the internally used ByteBlockPool for the given bytesID
        Parameters:
        bytesID - the id to look up
        Returns:
        the bytesStart offset into the internally used ByteBlockPool for the given id