Class TextIndexBunchedSerializer

  • All Implemented Interfaces:
    BunchedSerializer<Tuple,​List<Integer>>

    @API(EXPERIMENTAL)
    public class TextIndexBunchedSerializer
    extends Object
    implements BunchedSerializer<Tuple,​List<Integer>>
    Serializer used by the TextIndexMaintainer to write entries into a BunchedMap. This is specifically designed for writing out the mapping from document ID to position list. As a result, it requires that the lists it serializes be monotonically increasing non-negative integers (which is true for position lists). This allows it to delta compress the integers in its list, which can be a significant space savings.

    Keys are serialized using default Tuple packing. Bunches are serialized as follows: each bunch begins with a prefix (that can be used to version the serialization format), and then each entry in the bunch is serialized by writing the length of the key (using a base-128 variable length integer encoding) followed by the serialized key bytes followed by the length of the serialized position list followed by the (delta compressed) entries of each position list. Additionally, the key of the first entry in the bunch is omitted as that can be determined by using the sign-post key within the BunchedMap.

    For example, suppose one attempts to serialize two entries into a single bunch, one with key (1066,) and value [1, 3, 5, 8] and another with key (1415,) and value [0, 600, 605]. The tuple (1066,) serializes to 16 04 2A (in hex), and the tuple (1415,) serializes to 16 05 87. Most of the deltas are small, but 600 is encoded by taking its binary representation, 1001011000, and separating the lower order groups of 7 bits into their own bytes and then using the most significant bit as a continuation flag, so it becomes 10000100 01011000 = 84 58. So, the full entry is (with 20 as the prefix):

    
         20 (04 (01 02 02 03)) (03 (16 05 87) 04 (00 (84 58) 05))
     

    The parentheses are added for clarity and separate each entry as well as grouping variable length integers together. Note that to add a new entry to the end of a serialized list, one can take the serialized entry and append it to the end of that list rather than deserializing the entry list, appending the new entry, and then serializing the new list.

    See Also:
    TextIndexMaintainer, BunchedMap
    • Method Detail

      • instance

        public static TextIndexBunchedSerializer instance()
        Get the serializer singleton. This serializer maintains no state between serializing different values, so it is safe to maintain as a singleton.
        Returns:
        the TextIndexBunchedSerializer singleton
      • serializeEntry

        @Nonnull
        public byte[] serializeEntry​(@Nonnull
                                     Tuple key,
                                     @Nonnull
                                     List<Integer> value)
        Packs a key and value into a byte array. This will write out the tuple and position list in a way consistent with the way each entry is serialized by serializeEntries(List). Because this serializer supports appending, one can take the output of this function and append it to the end of an already serialized entry list to produce the serialized form of that list with this entry appended to the end.
        Specified by:
        serializeEntry in interface BunchedSerializer<Tuple,​List<Integer>>
        Parameters:
        key - the key of the map entry
        value - the value of the map entry
        Returns:
        the serialized map entry
        Throws:
        BunchedSerializationException - if the value is not monotonically increasing non-negative integers or if packing the tuple fails
      • serializeEntries

        @Nonnull
        public byte[] serializeEntries​(@Nonnull
                                       List<Map.Entry<Tuple,​List<Integer>>> entries)
        Packs an entry list into a single byte array. This does so by combining the serialized forms of each key and value in the entry list with their lengths. Their is a more in-depth explanation of the serialization format in the class-level documentation.
        Specified by:
        serializeEntries in interface BunchedSerializer<Tuple,​List<Integer>>
        Parameters:
        entries - the list of entries to serialize
        Returns:
        the serialized entry list
        Throws:
        BunchedSerializationException - if the entries are invalid such as if the list is empty or contains a list that is not monotonically increasing
      • deserializeKeys

        @Nonnull
        public List<Tuple> deserializeKeys​(@Nonnull
                                           Tuple key,
                                           @Nonnull
                                           byte[] data)
        Deserializes the keys from a serialized entry list. Because the serialization format contains markers with the length of the entries, it can skip the position list while reading through the data, so it is more efficient (in terms of memory and space) to call this method than deserializeEntries() if one only needs to know the keys.
        Specified by:
        deserializeKeys in interface BunchedSerializer<Tuple,​List<Integer>>
        Parameters:
        key - key under which the serialized entry list was stored
        data - source data to deserialize
        Returns:
        the list of keys in the serialized data array
        Throws:
        BunchedSerializationException - if the byte array is malformed