Object

io.smartdatalake.util.misc

CompactionUtil

Related Doc: package misc

Permalink

object CompactionUtil extends SmartDataLakeLogger

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. CompactionUtil
  2. SmartDataLakeLogger
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. def compactHadoopStandardPartitions(dataObject: DataObject with CanHandlePartitions with CanCreateDataFrame with CanWriteDataFrame with HasHadoopStandardFilestore, partitionValues: Seq[PartitionValues])(implicit session: SparkSession, actionPipelineContext: ActionPipelineContext): Seq[PartitionValues]

    Permalink

    Compacting hadoop partitions is not supported out-of-the-box by hadoop, as files need to be read with the correct format and written again.

    Compacting hadoop partitions is not supported out-of-the-box by hadoop, as files need to be read with the correct format and written again. The following steps are used to compact partitions with Spark: 1. Check if compaction is already in progress by looking for a special file "_SDL_COMPACTING" in data objects root hadoop path. If it exists and is not older than 12h exit compaction with Exception. Otherwise create/update special file "_COMPACTION". If the file is older than 12h the compaction process is assumed to be crashed. 2. As step 5 is not atomic (delete and move are two operations), we need to check for possibly incomplete compactions of previous crashed runs and fix them. Incomplete compactions are marked with a special file "_SDL_MOVING" in the temporary path. Incomplete compacted partitions must be moved from temporary path to hadoop path (see step 5) and marked as compacted (see step 6). 3. Filter already compacted partitions from given partitions by looking for "_SDL_COMPACTED" file, see step 5 4. Data from partitions to be compacted is rewritten into a temporary path under this data objects hadoop path. 5. Partitions to be compacted are deleted from the hadoop path and moved from the temporary path to the hadoop path. This should be done one-by-one to reduce risk of data loss. To recover in case of unexpected abort between delete and move, a special file "_SDL_MOVING" is created in temporary path before deleting hadoop path. After moving the temporary path, this file is deleted again. Mark compacted partitions by creating a special file "_SDL_COMPACTED" and 6. Delete "_SDL_COMPACTING" file created in step 1.

  7. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  8. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  9. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  10. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  11. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  12. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  13. lazy val logger: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    SmartDataLakeLogger
  14. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  15. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  16. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  17. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  18. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  19. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  20. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  21. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from SmartDataLakeLogger

Inherited from AnyRef

Inherited from Any

Ungrouped