Main abstraction for an audit table that a client application must use to store records with a timestamp.
Implementation of the AuditTable which is backed up by append only block storage like HDFS.
Implementation of the AuditTable which is backed up by append only block storage like HDFS.
Created by Alexei Perelighin on 2018/03/03
Static information about the table, that is persisted when audit table is initialised.
Static information about the table, that is persisted when audit table is initialised.
name of the table
list of columns that make up primary key, these will be used for snapshot generation and record deduplication
application/custom metadata that will not be used in this library.
whether to retain history for this table. If set to false, the table will be deduplicated on every compaction
name of the table
cold or hot, appended regions are added to hot and after compaction make it into cold. Cold regions can also be compacted
id of the region, for simplicity, at least for now it will be GUID
timestamp when region was created as a result of an append or compact operation
true - its data was compacted into another region, false - it was not compacted
number of records in the region, can be used for optimisation and compaction decisions
all records in the audit table will contain column that shows the last updated time, this will be used to generated ingestion queries
Contains operations that interact with physical storage.
Contains operations that interact with physical storage. Will also handle commit to the file system.
Created by Alexei Perelighin on 2018/03/05
Implementation around FileSystem and SparkSession with temporary and trash folders.
Is thrown by storage layer.
Is thrown by storage layer.
Created by Alexei Perelighin on 2018/03/04
Contains methods to create tables, open tables.
Contains methods to create tables, open tables.
Created by Alexei Perelighin on 2018/04/11
Created by Vicky Avison on 11/05/18.
A compaction partitioner that partitions on the approximate maximum number of bytes to be in each partition file
A compaction partitioner that partitions on the approximate maximum number of cells (numRows * numColumns) to be in each partition file
Main abstraction for an audit table that a client application must use to store records with a timestamp. It hides all details of the physical storage, so that client apps can use various file systems (Ex: HDFS, ADLS, S3, Local, etc) or key value (Ex: HBase).
Also this abstraction can produce a snapshot of data de-duplicated on the primary key and true to the specified moment in time.
Also surfaces custom attributes initialised during table creation, so that client applications do not need to worry about storing the relevant metadata in a separate storage. It also will simplify backup, restore and sharing of data between environments.
Some storage layers might be quite inefficient when it comes to storing lots of appends in multiple files and storage optimisation, aka compaction, should not intervene with normal operation of the application. Therefore application should be able to control when compaction can take place.
An instance of AuditTable represents a functional state, if data was modified, do not use it again.
There are 2 types of operations on the table:
Created by Alexei Perelighin on 2018/03/03