@API(value=EXPERIMENTAL) public class TextIndexBunchedSerializer extends Object implements BunchedSerializer<Tuple,List<Integer>>
TextIndexMaintainer
to write entries into a
BunchedMap
. This is specifically
designed for writing out the mapping from document ID to position list. As a result,
it requires that the lists it serializes be monotonically increasing non-negative
integers (which is true for position lists). This allows it to delta compress the integers
in its list, which can be a significant space savings.
Keys are serialized using default Tuple
packing. Bunches are serialized as follows:
each bunch begins with a prefix (that can be used to version the serialization format), and
then each entry in the bunch is serialized by writing the length of the key (using a base-128
variable length integer encoding) followed by the serialized key bytes followed by the length
of the serialized position list followed by the (delta compressed) entries of each position
list. Additionally, the key of the first entry in the bunch is omitted as that can be determined
by using the sign-post key within the BunchedMap
.
For example, suppose one attempts to serialize two entries into a single bunch, one
with key (1066,)
and value [1, 3, 5, 8]
and another with key
(1415,)
and value [0, 600, 605]
. The tuple (1066,)
serializes to 16 04 2A
(in hex), and the tuple (1415,)
serializes to 16 05 87
. Most of the deltas are small, but 600
is encoded by taking its binary representation, 1001011000
, and separating
the lower order groups of 7 bits into their own bytes and then using the most significant
bit as a continuation flag, so it becomes 10000100 01011000 = 84 58
.
So, the full entry is (with 20
as the prefix):
20 (04 (01 02 02 03)) (03 (16 05 87) 04 (00 (84 58) 05))
The parentheses are added for clarity and separate each entry as well as grouping variable length integers together. Note that to add a new entry to the end of a serialized list, one can take the serialized entry and append it to the end of that list rather than deserializing the entry list, appending the new entry, and then serializing the new list.
TextIndexMaintainer
,
BunchedMap
Modifier and Type | Method and Description |
---|---|
boolean |
canAppend()
Return
true as this serialization format supports appending. |
List<Map.Entry<Tuple,List<Integer>>> |
deserializeEntries(Tuple key,
byte[] data)
Deserializes an entry list from bytes.
|
Tuple |
deserializeKey(byte[] data,
int offset,
int length)
Deserializes a key using standard
Tuple unpacking. |
List<Tuple> |
deserializeKeys(Tuple key,
byte[] data)
Deserializes the keys from a serialized entry list.
|
static TextIndexBunchedSerializer |
instance()
Get the serializer singleton.
|
byte[] |
serializeEntries(List<Map.Entry<Tuple,List<Integer>>> entries)
Packs an entry list into a single byte array.
|
byte[] |
serializeEntry(Tuple key,
List<Integer> value)
Packs a key and value into a byte array.
|
byte[] |
serializeKey(Tuple key)
Packs a key using standard
Tuple encoding. |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
deserializeKey, deserializeKey, serializeEntry
public static TextIndexBunchedSerializer instance()
TextIndexBunchedSerializer
singleton@Nonnull public byte[] serializeKey(@Nonnull Tuple key)
Tuple
encoding. Note that Tuple
s pack
in a way that preserves order, which is a requirement of BunchedSerializer
s.serializeKey
in interface BunchedSerializer<Tuple,List<Integer>>
key
- key to serialize to bytesBunchedSerializationException
- if packing the tuple fails@Nonnull public byte[] serializeEntry(@Nonnull Tuple key, @Nonnull List<Integer> value)
serializeEntries(List)
. Because this serializer supports appending,
one can take the output of this function and append it to the end of an
already serialized entry list to produce the serialized form of that list
with this entry appended to the end.serializeEntry
in interface BunchedSerializer<Tuple,List<Integer>>
key
- the key of the map entryvalue
- the value of the map entryBunchedSerializationException
- if the value is not monotonically increasing
non-negative integers or if packing the tuple fails@Nonnull public byte[] serializeEntries(@Nonnull List<Map.Entry<Tuple,List<Integer>>> entries)
serializeEntries
in interface BunchedSerializer<Tuple,List<Integer>>
entries
- the list of entries to serializeBunchedSerializationException
- if the entries are invalid such as if the list is empty
or contains a list that is not monotonically increasing@Nonnull public Tuple deserializeKey(@Nonnull byte[] data, int offset, int length)
Tuple
unpacking.deserializeKey
in interface BunchedSerializer<Tuple,List<Integer>>
data
- source data to deserializeoffset
- beginning offset of serialized key (indexed from 0)length
- length of serialized keyBunchedSerializationException
- if the byte array is malformed@Nonnull public List<Map.Entry<Tuple,List<Integer>>> deserializeEntries(@Nonnull Tuple key, @Nonnull byte[] data)
deserializeEntries
in interface BunchedSerializer<Tuple,List<Integer>>
key
- key under which the serialized entry list was storeddata
- source list to deserializeBunchedSerializationException
- if the byte array is malformed@Nonnull public List<Tuple> deserializeKeys(@Nonnull Tuple key, @Nonnull byte[] data)
deserializeEntries()
if one only needs to know the keys.deserializeKeys
in interface BunchedSerializer<Tuple,List<Integer>>
key
- key under which the serialized entry list was storeddata
- source data to deserializeBunchedSerializationException
- if the byte array is malformedpublic boolean canAppend()
true
as this serialization format supports appending.canAppend
in interface BunchedSerializer<Tuple,List<Integer>>
true