Trait

com.coxautodata.waimak.storage

FileStorageOps

Related Doc: package storage

Permalink

trait FileStorageOps extends AnyRef

Contains operations that interact with physical storage. Will also handle commit to the file system.

Created by Alexei Perelighin on 2018/03/05

Linear Supertypes
AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. FileStorageOps
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Abstract Value Members

  1. abstract def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpPaths: Seq[Path], appendTimestamp: Timestamp): Unit

    Permalink

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.

    E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.

    Starting state:

    /data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14

    Final state:

    /data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14

    tableName

    name of the table

    compactedData

    the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath

    newDataPath

    path into which combined and repartitioned data from the dataset will be committed into

    cleanUpPaths

    list of sub-folders to remove once the writing and committing of the combined data is successful

    appendTimestamp

    Timestamp of the compaction/append. Used to date the Trash folders.

  2. abstract def deletePath(path: Path, recursive: Boolean): Unit

    Permalink

    Delete a given path

    Delete a given path

    path

    File or directory to delete

    recursive

    Recurse into directories

  3. abstract def globTablePaths[A](basePath: Path, tableNames: Seq[String], tablePartitions: Seq[String], parFun: PartialFunction[FileStatus, A]): Seq[A]

    Permalink

    Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A

    Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A

    A

    return type of final sequence

    basePath

    parent folder which contains folders with table names

    tableNames

    list of table names to search under

    tablePartitions

    list of partition columns to include in the path

    parFun

    a partition function to transform FileStatus to any type A

  4. abstract def listTables(basePath: Path): Seq[String]

    Permalink

    Lists tables in the basePath.

    Lists tables in the basePath. It will ignore any folder/table that starts with '.'

    basePath

    parent folder which contains folders with table names

  5. abstract def mkdirs(path: Path): Boolean

    Permalink

    Creates folders on the physical storage.

    Creates folders on the physical storage.

    path

    path to create

    returns

    true if the folder exists or was created without problems, false if there were problems creating all folders in the path

  6. abstract def openParquet(path: Path, paths: Path*): Option[Dataset[_]]

    Permalink

    Opens parquet file from the path, which can be folder or a file.

    Opens parquet file from the path, which can be folder or a file. If there are partitioned sub-folders with file with slightly different schema, it will attempt to merge schema to accommodate for the schema evolution.

    path

    path to open

    returns

    Some with dataset if there is data, None if path does not exist or can not be opened

    Exceptions thrown

    Exception in cases of connectivity

  7. abstract def pathExists(path: Path): Boolean

    Permalink

    Checks if the path exists in the physical storage.

    Checks if the path exists in the physical storage.

    returns

    true if path exists in the storage layer

  8. abstract def purgeTrash(tableName: String, appendTimestamp: Timestamp, trashMaxAge: Duration): Unit

    Permalink

    Purge the trash folder for a given table.

    Purge the trash folder for a given table. All trashed region folders that were placed into the trash older than the given maximum age will be deleted.

    tableName

    Name of the table to purge the trash for

    appendTimestamp

    Timestamp of the current compaction/append. All ages will be compared relative to this timestamp

    trashMaxAge

    Maximum age of trashed regions to keep relative to the above timestamp

  9. abstract def readAuditTableInfo(basePath: Path, tableName: String): Try[AuditTableInfo]

    Permalink

    Reads the table info back.

    Reads the table info back.

    basePath

    parent folder which contains folders with table names

    tableName

    name of the table to read for

  10. abstract def sparkSession: SparkSession

    Permalink
  11. abstract def writeAuditTableInfo(basePath: Path, info: AuditTableInfo): Try[AuditTableInfo]

    Permalink

    Writes out static data about the audit table into basePath/table_name/.table_info file.

    Writes out static data about the audit table into basePath/table_name/.table_info file.

    basePath

    parent folder which contains folders with table names

    info

    static information about table, that will not change during table's existence

  12. abstract def writeParquet(tableName: String, path: Path, ds: Dataset[_], overwrite: Boolean = true, tempSubfolder: Option[String] = None): Unit

    Permalink

    Commits data set into full path.

    Commits data set into full path. The path is the full path into which the parquet will be placed after it is fully written into the temp folder.

    tableName

    name of the table, will only be used to write into tmp

    path

    full destination path

    ds

    dataset to write out. no partitioning will be performed on it

    overwrite

    whether to overwrite the existing data in path. If false folder contents will be merged

    tempSubfolder

    an optional subfolder used for writing temporary data, used like $temp/$tableName/$tempSubFolder. If not given, then path becomes: $temp/$tableName/${path.getName}

    Exceptions thrown

    Exception can be thrown due to access permissions, connectivity, spark UDFs (as datasets are lazily executed)

Concrete Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. final def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpBase: Path, cleanUpFolders: Seq[String], appendTimestamp: Timestamp): Unit

    Permalink

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.

    E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.

    Starting state:

    /data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14

    Final state:

    /data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14

    tableName

    name of the table

    compactedData

    the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath

    newDataPath

    path into which combined and repartitioned data from the dataset will be committed into

    cleanUpBase

    parent folder from which to remove the cleanUpFolders

    cleanUpFolders

    list of sub-folders to remove once the writing and committing of the combined data is successful

    appendTimestamp

    Timestamp of the compaction/append. Used to date the Trash folders.

  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  8. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  9. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  10. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  11. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  12. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  13. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  14. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  15. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  16. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  17. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  18. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  19. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  20. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped