FileStorageOps

Abstract Value Members

abstract def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpPaths: Seq[Path], appendTimestamp: Timestamp): Unit

During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.
E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.
Starting state:
/data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14
Final state:
/data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14
tableName
name of the table
compactedData
the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath
newDataPath
path into which combined and repartitioned data from the dataset will be committed into
cleanUpPaths
list of sub-folders to remove once the writing and committing of the combined data is successful
appendTimestamp
Timestamp of the compaction/append. Used to date the Trash folders.
abstract def deletePath(path: Path, recursive: Boolean): Unit

Delete a given path
Delete a given path
path
File or directory to delete
recursive
Recurse into directories
abstract def globTablePaths[A](basePath: Path, tableNames: Seq[String], tablePartitions: Seq[String], parFun: PartialFunction[FileStatus, A]): Seq[A]

Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A
Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A
A
return type of final sequence
basePath
parent folder which contains folders with table names
tableNames
list of table names to search under
tablePartitions
list of partition columns to include in the path
parFun
a partition function to transform FileStatus to any type A
abstract def listTables(basePath: Path): Seq[String]

Lists tables in the basePath.
Lists tables in the basePath. It will ignore any folder/table that starts with '.'
basePath
parent folder which contains folders with table names
abstract def mkdirs(path: Path): Boolean

Creates folders on the physical storage.
Creates folders on the physical storage.
path
path to create
returns
true if the folder exists or was created without problems, false if there were problems creating all folders in the path
abstract def openParquet(path: Path, paths: Path*): Option[Dataset[_]]

Opens parquet file from the path, which can be folder or a file.
Opens parquet file from the path, which can be folder or a file. If there are partitioned sub-folders with file with slightly different schema, it will attempt to merge schema to accommodate for the schema evolution.
path
path to open
returns
Some with dataset if there is data, None if path does not exist or can not be opened

Exceptions thrown
Exception in cases of connectivity
abstract def pathExists(path: Path): Boolean

Checks if the path exists in the physical storage.
Checks if the path exists in the physical storage.
returns
true if path exists in the storage layer
abstract def purgeTrash(tableName: String, appendTimestamp: Timestamp, trashMaxAge: Duration): Unit

Purge the trash folder for a given table.
Purge the trash folder for a given table. All trashed region folders that were placed into the trash older than the given maximum age will be deleted.
tableName
Name of the table to purge the trash for
appendTimestamp
Timestamp of the current compaction/append. All ages will be compared relative to this timestamp
trashMaxAge
Maximum age of trashed regions to keep relative to the above timestamp
abstract def readAuditTableInfo(basePath: Path, tableName: String): Try[AuditTableInfo]

Reads the table info back.
Reads the table info back.
basePath
parent folder which contains folders with table names
tableName
name of the table to read for
abstract def sparkSession: SparkSession
abstract def writeAuditTableInfo(basePath: Path, info: AuditTableInfo): Try[AuditTableInfo]

Writes out static data about the audit table into basePath/table_name/.table_info file.
Writes out static data about the audit table into basePath/table_name/.table_info file.
basePath
parent folder which contains folders with table names
info
static information about table, that will not change during table's existence
abstract def writeParquet(tableName: String, path: Path, ds: Dataset[_], overwrite: Boolean = true, tempSubfolder: Option[String] = None): Unit

Commits data set into full path.
Commits data set into full path. The path is the full path into which the parquet will be placed after it is fully written into the temp folder.
tableName
name of the table, will only be used to write into tmp
path
full destination path
ds
dataset to write out. no partitioning will be performed on it
overwrite
whether to overwrite the existing data in path. If false folder contents will be merged
tempSubfolder
an optional subfolder used for writing temporary data, used like $temp/$tableName/$tempSubFolder. If not given, then path becomes: $temp/$tableName/${path.getName}

Exceptions thrown
Exception can be thrown due to access permissions, connectivity, spark UDFs (as datasets are lazily executed)

Concrete Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
final def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpBase: Path, cleanUpFolders: Seq[String], appendTimestamp: Timestamp): Unit

During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.
E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.
Starting state:
/data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14
Final state:
/data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14
tableName
name of the table
compactedData
the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath
newDataPath
path into which combined and repartitioned data from the dataset will be committed into
cleanUpBase
parent folder from which to remove the cleanUpFolders
cleanUpFolders
list of sub-folders to remove once the writing and committing of the combined data is successful
appendTimestamp
Timestamp of the compaction/append. Used to date the Trash folders.
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package storage

trait FileStorageOps extends AnyRef

Abstract Value Members

abstract def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpPaths: Seq[Path], appendTimestamp: Timestamp): Unit

abstract def deletePath(path: Path, recursive: Boolean): Unit

abstract def globTablePaths[A](basePath: Path, tableNames: Seq[String], tablePartitions: Seq[String], parFun: PartialFunction[FileStatus, A]): Seq[A]

abstract def listTables(basePath: Path): Seq[String]

abstract def mkdirs(path: Path): Boolean

abstract def openParquet(path: Path, paths: Path*): Option[Dataset[_]]

abstract def pathExists(path: Path): Boolean

abstract def purgeTrash(tableName: String, appendTimestamp: Timestamp, trashMaxAge: Duration): Unit

abstract def readAuditTableInfo(basePath: Path, tableName: String): Try[AuditTableInfo]

abstract def sparkSession: SparkSession

abstract def writeAuditTableInfo(basePath: Path, info: AuditTableInfo): Try[AuditTableInfo]

abstract def writeParquet(tableName: String, path: Path, ds: Dataset[_], overwrite: Boolean = true, tempSubfolder: Option[String] = None): Unit

Concrete Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

final def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpBase: Path, cleanUpFolders: Seq[String], appendTimestamp: Timestamp): Unit

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped