A serializer for Shark/Hive-specific serialization used in Spark shuffle. Since this is only
used in shuffle operations, only serializeStream and deserializeStream are implemented.
The serialization process is very simple:
- Shark operators use Hive serializers to serialize the data structures into byte arrays
(wrapped in BytesWritable object).
- Shark operators wrap each key (BytesWritable) in a ReduceKeyMapSide object. The values remain
unchanged as BytesWritable.
- ShuffleSerializationStream simply flushes the underlying byte arrays for key/value into the
serialization stream. The length is prepended before the byte array so the deserializer knows
how many bytes to read.
The deserialization process simply reverses the above, with a few caveats:
- The data type for the keys becomes ReduceKeyReduceSide, wrapping around a byte array (rather
than a BytesWritable).
- The data type for the values becomes a byte array, rather than a BytesWritable.
The reason is that during aggregations and joins (post shuffle), the key-value pairs are inserted
into a hash table. We want to reduce the size of the hash table. Having the BytesWritable wrapper
would increase the size of the hash table by another 16 bytes per key-value pair.
A serializer for Shark/Hive-specific serialization used in Spark shuffle. Since this is only used in shuffle operations, only serializeStream and deserializeStream are implemented.
The serialization process is very simple: - Shark operators use Hive serializers to serialize the data structures into byte arrays (wrapped in BytesWritable object). - Shark operators wrap each key (BytesWritable) in a ReduceKeyMapSide object. The values remain unchanged as BytesWritable. - ShuffleSerializationStream simply flushes the underlying byte arrays for key/value into the serialization stream. The length is prepended before the byte array so the deserializer knows how many bytes to read.
The deserialization process simply reverses the above, with a few caveats: - The data type for the keys becomes ReduceKeyReduceSide, wrapping around a byte array (rather than a BytesWritable). - The data type for the values becomes a byte array, rather than a BytesWritable. The reason is that during aggregations and joins (post shuffle), the key-value pairs are inserted into a hash table. We want to reduce the size of the hash table. Having the BytesWritable wrapper would increase the size of the hash table by another 16 bytes per key-value pair.