usage:
CellReader effectively stores a list of byte[] payloads that are retrievable randomly by index. The entirety of
the data is block compressed. For reading, see
CellReader
. Example usage:
StagedSerde<Fuu> fuuSerDe = new ...
// note that cellWriter.close() *must* be called before writeTo() in order to finalize the index
try (CellWriter cellWriter = new CellWriter.Builder(segmentWriteOutMedium).build()) {
fuuList.stream().map(fuuSerDe:serialize).forEach(cellWriter::write);
}
// at this point cellWriter contains the index and compressed data
// transfers the index and compressed data in the format specified below. This method is idempotent and copies
// the data each time.
cellWriter.writeTo(writableChannel, fileSmoosher); // 2nd argument currently unused, may be null
Note that for use with CellReader, the contents written to the writableChannel must be available as a ByteBuffer
Internal Storage Details
serialized data is of the form:
[cell index]
[payload storage]
each of these items is stored in compressed streams of blocks with a block index.
A BlockCompressedPayloadWriter stores byte[] payloads. These may be accessed by creating a
BlockCompressedPayloadReader over the produced ByteBuffer. Reads may be done by giving a location in the
uncompressed stream and a size
NOTE: BlockCompressedPayloadBuffer
does not store nulls on write(). However, the cellIndex stores an entry
with a size of 0 for nulls and CellReader
will return null for any null written
[blockIndexSize:int]
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| block index
| compressed block # -> block start in compressed stream position (relative to data start)
|
| 0: [block position: int]
| 1: [block position: int]
| ...
| i: [block position: int]
| ...
| n: [block position: int]
| n+1: [total compressed size ] // stored to simplify invariant of n+1 - n = length(n)
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[dataSize:int]
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| [compressed payload block 1]
| [compressed payload block 2]
| ...
| [compressed paylod block n]
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
the CellIndexWriter stores an array of longs using the BlockCompressedPayloadWriter
logically this an array of longs
| 0: start_0 : long
| 1: start_1 : long
| ...
| n: start_n : long
| n+1: start_n + length_n : long //ie, next position that would have been written to
| //used again for invariant of length_i = row_i+1 - row_i
|
| but this will be stored as block compressed. Reads are done by addressing it as a long array of bytes
|
| [block index size]
| [block index>
|
| [data stream size]
| [block compressed payload stream]
resulting in
| [cell index size]
| ----cell index------------------------
| [block index size]
| [block index]
| [data stream size]
| [block compressed payload stream]
| -------------------------------------
| [data stream size]
| ----data stream------------------------
| [block index size]
| [block index]
| [data stream size]
| [block compressed payload stream]
| -------------------------------------