During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.
E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.
Starting state:
/data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14
Final state:
/data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14
name of the table
the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath
path into which combined and repartitioned data from the dataset will be committed into
list of sub-folders to remove once the writing and committing of the combined data is successful
Timestamp of the compaction/append. Used to date the Trash folders.
Delete a given path
Delete a given path
File or directory to delete
Recurse into directories
Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform
the FileStatus to any type A
Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform
the FileStatus to any type A
return type of final sequence
parent folder which contains folders with table names
list of table names to search under
list of partition columns to include in the path
a partition function to transform FileStatus to any type A
Lists tables in the basePath.
Lists tables in the basePath. It will ignore any folder/table that starts with '.'
parent folder which contains folders with table names
Creates folders on the physical storage.
Creates folders on the physical storage.
path to create
true if the folder exists or was created without problems, false if there were problems creating all folders in the path
Opens parquet file from the path, which can be folder or a file.
Opens parquet file from the path, which can be folder or a file. If there are partitioned sub-folders with file with slightly different schema, it will attempt to merge schema to accommodate for the schema evolution.
path to open
Some with dataset if there is data, None if path does not exist or can not be opened
Exception
in cases of connectivity
Checks if the path exists in the physical storage.
Checks if the path exists in the physical storage.
true if path exists in the storage layer
Purge the trash folder for a given table.
Purge the trash folder for a given table. All trashed region folders that were placed into the trash older than the given maximum age will be deleted.
Name of the table to purge the trash for
Timestamp of the current compaction/append. All ages will be compared relative to this timestamp
Maximum age of trashed regions to keep relative to the above timestamp
Reads the table info back.
Reads the table info back.
parent folder which contains folders with table names
name of the table to read for
Writes out static data about the audit table into basePath/table_name/.table_info file.
Writes out static data about the audit table into basePath/table_name/.table_info file.
parent folder which contains folders with table names
static information about table, that will not change during table's existence
Commits data set into full path.
Commits data set into full path. The path is the full path into which the parquet will be placed after it is fully written into the temp folder.
name of the table, will only be used to write into tmp
full destination path
dataset to write out. no partitioning will be performed on it
whether to overwrite the existing data in path
. If false folder contents will be merged
an optional subfolder used for writing temporary data, used like $temp/$tableName/$tempSubFolder
.
If not given, then path becomes: $temp/$tableName/${path.getName}
Exception
can be thrown due to access permissions, connectivity, spark UDFs (as datasets are lazily executed)
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.
E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.
Starting state:
/data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14
Final state:
/data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14
name of the table
the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath
path into which combined and repartitioned data from the dataset will be committed into
parent folder from which to remove the cleanUpFolders
list of sub-folders to remove once the writing and committing of the combined data is successful
Timestamp of the compaction/append. Used to date the Trash folders.
Contains operations that interact with physical storage. Will also handle commit to the file system.
Created by Alexei Perelighin on 2018/03/05