Class

org.apache.spark.sql.execution.columnar.encoding

ColumnDeltaEncoder

Related Doc: package encoding

Permalink

final class ColumnDeltaEncoder extends ColumnEncoder

Encodes a delta value for a ColumnFormatValue obtained after an update operation that can change one or more values. This applies the update in an optimized batch manner as far as possible.

The format of delta encoding is straightforward and adds the positions in the full column in addition to the normal column encoding. So the layout looks like below:

 .----------------------- Base encoding scheme (4 bytes)
|    .------------------- Null bitset size as number of longs N (4 bytes)
|   |
|   |   .---------------- Null bitset longs (8 x N bytes,
|   |   |                                    empty if null count is zero)
|   |   |  .------------- Positions in full column value
|   |   |  |
|   |   |  |    .-------- Encoded non-null elements
|   |   |  |    |
V   V   V  V    V
+---+---+--+--- +--------------+
|   |   |  |    |   ...   ...  |
+---+---+--+----+--------------+
 \-----/ \--------------------/
  header           body

Whether the value type is a delta or not is determined by the "deltaHierarchy" field in ColumnFormatValue and the negative columnIndex in ColumnFormatKey. Encoding typeId itself does not store anything for it separately.

An alternative could be storing the position before each encoded element but it will not work properly for schemes like run-length encoding that will not write anything if elements are in that current run-length.

A set of new updated column values results in the merge of those values with the existing encoded values held in the current delta with smallest hierarchy depth (i.e. one that has a maximum size of 100). Each delta can grow to a limit after which it is subsumed in a larger delta of bigger size thus creating a hierarchy of deltas. So base delta will go till 100 entries or so, then the next higher level one will go till say 1000 entries and so on till the full ColumnFormatValue size is attained. This design attempts to minimize disk writes at the cost of some scan overhead for columns that see a large number of updates. The hierarchy is expected to be small not more than 3-4 levels to get a good balance between write overhead and scan overhead.

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. ColumnDeltaEncoder
  2. ColumnEncoder
  3. ColumnEncoding
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new ColumnDeltaEncoder(hierarchyDepth: Int)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final var _lowerDecimal: Decimal

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  5. final var _lowerDouble: Double

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  6. final var _lowerLong: Long

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  7. final var _lowerStr: UTF8String

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  8. final var _upperDecimal: Decimal

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  9. final var _upperDouble: Double

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  10. final var _upperLong: Long

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  11. final var _upperStr: UTF8String

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  12. final var allocator: BufferAllocator

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  13. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  14. final var baseDataOffset: Long

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  15. final def baseOffset: Long

    Permalink
    Definition Classes
    ColumnEncoder
  16. final var baseTypeOffset: Long

    Permalink

    Temporary offset results to be read by generated code immediately after initializeComplexType, so not an issue for nested types.

    Temporary offset results to be read by generated code immediately after initializeComplexType, so not an issue for nested types.

    Attributes
    protected
    Definition Classes
    ColumnEncoder
  17. final def buffer: AnyRef

    Permalink
    Definition Classes
    ColumnEncoder
  18. final def clearSource(newSize: Int, releaseData: Boolean): Unit

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  19. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  20. def close(): Unit

    Permalink

    Close and relinquish all resources of this encoder.

    Close and relinquish all resources of this encoder. The encoder may no longer be usable after this call.

    Definition Classes
    ColumnEncoder
  21. final var columnBeginPosition: Long

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  22. final var columnBytes: AnyRef

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  23. final var columnData: ByteBuffer

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  24. final var columnEndPosition: Long

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  25. final def copyTo(dest: ByteBuffer, srcOffset: Int, endOffset: Int): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  26. def defaultSize(dataType: DataType): Int

    Permalink
    Definition Classes
    ColumnEncoder
  27. def encodedSize(cursor: Long, dataBeginPosition: Long): Long

    Permalink

    The final size of the encoder column (excluding header and nulls) which should match that occupied after finish but without writing anything.

    The final size of the encoder column (excluding header and nulls) which should match that occupied after finish but without writing anything.

    Definition Classes
    ColumnEncoder
  28. final def ensureCapacity(cursor: Long, required: Int): Long

    Permalink
    Definition Classes
    ColumnEncoder
  29. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  30. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  31. final def expand(cursor: Long, required: Int): Long

    Permalink

    Expand the underlying bytes if required and return the new cursor

    Expand the underlying bytes if required and return the new cursor

    Attributes
    protected
    Definition Classes
    ColumnEncoder
  32. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  33. def finish(encoderCursor: Long, numBaseRows: Int): ByteBuffer

    Permalink
  34. def finish(encoderCursor: Long): ByteBuffer

    Permalink

    Finish encoding the current column and return the data as a ByteBuffer.

    Finish encoding the current column and return the data as a ByteBuffer. The encoder can be reused for new column data of same type again.

    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  35. def flushWithoutFinish(cursor: Long): Long

    Permalink

    flush any pending data when finish is not being invoked explicitly

    flush any pending data when finish is not being invoked explicitly

    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  36. final var forComplexType: Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  37. final def getBaseDataOffset: Long

    Permalink
    Definition Classes
    ColumnEncoder
  38. final def getBaseTypeOffset: Long

    Permalink
    Definition Classes
    ColumnEncoder
  39. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  40. def getMaxSizeForHierarchy(numColumnRows: Int): Int

    Permalink
  41. def getNumNullWords: Int

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  42. def getRealEncoder: ColumnEncoder

    Permalink
  43. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  44. val hierarchyDepth: Int

    Permalink
  45. def initSizeInBytes(dataType: DataType, initSize: Long, defSize: Int): Long

    Permalink
    Definition Classes
    ColumnEncoder
  46. def initialize(dataType: DataType, nullable: Boolean, initSize: Int, withHeader: Boolean, allocator: BufferAllocator, minBufferSize: Int = 1): Long

    Permalink

    Initialize this ColumnEncoder.

    Initialize this ColumnEncoder.

    dataType

    DataType of the field to be written

    nullable

    True if the field is nullable, false otherwise

    initSize

    Initial estimated number of elements to be written

    withHeader

    True if header is to be written to data (typeId etc)

    allocator

    the BufferAllocator to use for the data

    minBufferSize

    the minimum size of initial buffer to use (ignored if <= 0)

    returns

    initial position of the cursor that caller must use to write

    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  47. final def initialize(dataType: DataType, nullable: Boolean, initSize: Int, withHeader: Boolean, allocator: BufferAllocator): Long

    Permalink

    Initialize this ColumnEncoder.

    Initialize this ColumnEncoder.

    dataType

    DataType of the field to be written

    nullable

    True if the field is nullable, false otherwise

    initSize

    Initial estimated number of elements to be written

    withHeader

    True if header is to be written to data (typeId etc)

    allocator

    the BufferAllocator to use for the data

    returns

    initial position of the cursor that caller must use to write

    Definition Classes
    ColumnEncoder
  48. final def initialize(field: StructField, initSize: Int, withHeader: Boolean, allocator: BufferAllocator): Long

    Permalink
    Definition Classes
    ColumnEncoder
  49. final def initialize(field: StructField, initSize: Int, withHeader: Boolean): Long

    Permalink
    Definition Classes
    ColumnEncoder
  50. final def initializeComplexType(cursor: Long, numElements: Int, skipBytes: Int, writeNumElements: Boolean): Long

    Permalink

    Complex types are written similar to UnsafeRows while respecting platform endianness (format is always little endian) so appropriate for storage.

    Complex types are written similar to UnsafeRows while respecting platform endianness (format is always little endian) so appropriate for storage. Also have other minor differences related to size writing and interval type handling. General layout looks like below:

    .--------------------------- Optional total size including itself (4 bytes)
    |   .----------------------- Optional number of elements (4 bytes)
    |   |   .------------------- Null bitset longs (8 x (N / 8) bytes)
    |   |   |
    |   |   |     .------------- Offsets+Sizes of elements (8 x N bytes)
    |   |   |     |     .------- Variable length elements
    V   V   V     V     V
    +---+---+-----+-------------+
    |   |   | ... | ... ... ... |
    +---+---+-----+-------------+
     \-----/ \-----------------/
      header      body

    The above generic layout is used for ARRAY and STRUCT types.

    The total size of the data is written for top-level complex types. Nested complex objects write their sizes in the "Offsets+Sizes" portion in the respective parent object.

    ARRAY types also write the number of elements in the array in the header while STRUCT types skip it since it is fixed in the meta-data.

    The null bitset follows the header. To keep the reads aligned at 8 byte boundaries while preserving space, the implementation will combine the header and the null bitset portion, then pad them together at 8 byte boundary (in particular it will consider header as some additional empty fields in the null bitset itself).

    After this follows the "Offsets+Sizes" which keeps the offset and size for variable length elements. Fixed length elements less than 8 bytes in size are written directly in the offset+size portion. Variable length elements have their offsets (from start of this array) and sizes encoded in this portion as a long (4 bytes for each of offset and size). Fixed width elements that are greater than 8 bytes are encoded like variable length elements. CalendarInterval is the only type currently that is of that nature whose "months" portion is encoded into the size while the "microseconds" portion is written into variable length part.

    MAP types are written as an ARRAY of keys followed by ARRAY of values like in Spark. To keep things simpler both ARRAYs always have the optional size header at their respective starts which together determine the total size of the encoded MAP object. For nested MAP types, the total size is skipped from the "Offsets+Sizes" portion and only the offset is written (which is the start of key ARRAY).

    Definition Classes
    ColumnEncoder
  51. def initializeLimits(): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  52. def initializeNulls(initSize: Int): Int

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  53. final def isAllocatorFinal: Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  54. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  55. def isNullable: Boolean

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  56. final def lowerDecimal: Decimal

    Permalink
    Definition Classes
    ColumnEncoder
  57. final def lowerDouble: Double

    Permalink
    Definition Classes
    ColumnEncoder
  58. final def lowerLong: Long

    Permalink
    Definition Classes
    ColumnEncoder
  59. final def lowerString: UTF8String

    Permalink
    Definition Classes
    ColumnEncoder
  60. def merge(newValue: ByteBuffer, existingValue: ByteBuffer, existingIsDelta: Boolean, field: StructField): ByteBuffer

    Permalink
  61. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  62. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  63. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  64. def nullCount: Int

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  65. def offset(cursor: Long): Long

    Permalink
    Definition Classes
    ColumnEncoder
  66. final def releaseForReuse(newSize: Int): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  67. final var reuseUsedSize: Int

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  68. final def setAllocator(allocator: BufferAllocator): Unit

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  69. final def setOffsetAndSize(cursor: Long, fieldOffset: Long, baseOffset: Long, size: Int): Unit

    Permalink
    Definition Classes
    ColumnEncoder
    Annotations
    @inline()
  70. final def setSource(buffer: ByteBuffer, releaseOld: Boolean): Unit

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnEncoder
  71. def setUpdatePosition(position: Int): Unit

    Permalink
  72. def sizeInBytes(cursor: Long): Long

    Permalink
    Definition Classes
    ColumnEncoder
  73. final def storageAllocator: BufferAllocator

    Permalink

    Get the allocator for the final data to be sent for storage.

    Get the allocator for the final data to be sent for storage. It is on-heap for now in embedded mode while off-heap for connector mode to minimize copying in both cases. This should be changed to use the matching allocator as per the storage being used by column store in embedded mode.

    Attributes
    protected
    Definition Classes
    ColumnEncoder
  74. def supports(dataType: DataType): Boolean

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoding
  75. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  76. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  77. def typeId: Int

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoding
  78. final def updateDecimalStats(value: Decimal): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  79. final def updateDoubleStats(value: Double): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  80. final def updateLongStats(value: Long): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  81. final def updateStringStats(value: UTF8String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    ColumnEncoder
  82. final def upperDecimal: Decimal

    Permalink
    Definition Classes
    ColumnEncoder
  83. final def upperDouble: Double

    Permalink
    Definition Classes
    ColumnEncoder
  84. final def upperLong: Long

    Permalink
    Definition Classes
    ColumnEncoder
  85. final def upperString: UTF8String

    Permalink
    Definition Classes
    ColumnEncoder
  86. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  87. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  88. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  89. def writeBinary(cursor: Long, value: Array[Byte]): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  90. def writeBoolean(cursor: Long, value: Boolean): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  91. def writeBooleanUnchecked(cursor: Long, value: Boolean): Long

    Permalink
    Definition Classes
    ColumnEncoder
  92. def writeByte(cursor: Long, value: Byte): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  93. def writeByteUnchecked(cursor: Long, value: Byte): Long

    Permalink
    Definition Classes
    ColumnEncoder
  94. def writeDate(cursor: Long, value: Int): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  95. def writeDecimal(cursor: Long, value: Decimal, position: Int, precision: Int, scale: Int): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  96. def writeDouble(cursor: Long, value: Double): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  97. def writeDoubleUnchecked(cursor: Long, value: Double): Long

    Permalink
    Definition Classes
    ColumnEncoder
  98. def writeFloat(cursor: Long, value: Float): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  99. def writeFloatUnchecked(cursor: Long, value: Float): Long

    Permalink
    Definition Classes
    ColumnEncoder
  100. def writeInt(cursor: Long, value: Int): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  101. def writeIntUnchecked(cursor: Long, value: Int): Long

    Permalink
    Definition Classes
    ColumnEncoder
  102. def writeInternals(columnBytes: AnyRef, cursor: Long): Long

    Permalink

    Write any internal structures (e.g.

    Write any internal structures (e.g. dictionary) of the encoder that would normally be written by finish after the header and null bit mask.

    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  103. def writeInterval(cursor: Long, value: CalendarInterval): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  104. def writeIsNull(position: Int): Unit

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  105. def writeLong(cursor: Long, value: Long): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  106. def writeLongDecimal(cursor: Long, value: Decimal, position: Int, precision: Int, scale: Int): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  107. def writeLongUnchecked(cursor: Long, value: Long): Long

    Permalink
    Definition Classes
    ColumnEncoder
  108. def writeNulls(columnBytes: AnyRef, cursor: Long, numWords: Int): Long

    Permalink
    Attributes
    protected[org.apache.spark.sql]
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  109. def writeShort(cursor: Long, value: Short): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  110. def writeShortUnchecked(cursor: Long, value: Short): Long

    Permalink
    Definition Classes
    ColumnEncoder
  111. final def writeStructBinary(cursor: Long, value: Array[Byte], fieldOffset: Long, baseOffset: Long): Long

    Permalink
    Definition Classes
    ColumnEncoder
  112. final def writeStructDecimal(cursor: Long, value: Decimal, fieldOffset: Long, baseOffset: Long): Long

    Permalink
    Definition Classes
    ColumnEncoder
  113. final def writeStructInterval(cursor: Long, value: CalendarInterval, fieldOffset: Long, baseOffset: Long): Long

    Permalink
    Definition Classes
    ColumnEncoder
  114. final def writeStructUTF8String(cursor: Long, value: UTF8String, fieldOffset: Long, baseOffset: Long): Long

    Permalink
    Definition Classes
    ColumnEncoder
  115. def writeTimestamp(cursor: Long, value: Long): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  116. def writeUTF8String(cursor: Long, value: UTF8String): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder
  117. def writeUnsafeData(cursor: Long, baseObject: AnyRef, baseOffset: Long, numBytes: Int): Long

    Permalink
    Definition Classes
    ColumnDeltaEncoderColumnEncoder

Inherited from ColumnEncoder

Inherited from ColumnEncoding

Inherited from AnyRef

Inherited from Any

Ungrouped