AbstractS3ACommitter (Apache Hadoop Amazon Web Services support 3.3.1 API)

java.lang.Object
- org.apache.hadoop.mapreduce.OutputCommitter
- - org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
  - - org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitter

All Implemented Interfaces:

org.apache.hadoop.fs.statistics.IOStatisticsSource

Direct Known Subclasses:

MagicS3GuardCommitter, StagingCommitter
```
public abstract class AbstractS3ACommitter
extends org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
implements org.apache.hadoop.fs.statistics.IOStatisticsSource
```
Abstract base class for S3A committers; allows for any commonality between different architectures. Although the committer APIs allow for a committer to be created without an output path, this is not supported in this class or its subclasses: a destination must be supplied. It is left to the committer factory to handle the creation of a committer when the destination is unknown. Requiring an output directory simplifies coding and testing. The original implementation loaded all .pendingset files before attempting any commit/abort operations. While straightforward and guaranteeing that no changes were made to the destination until all files had successfully been loaded -it didn't scale; the list grew until it exceeded heap size. The second iteration builds up an AbstractS3ACommitter.ActiveCommit class with the list of .pendingset files to load and then commit; that can be done incrementally and in parallel. As a side effect of this change, unless/until changed, the commit/abort/revert of all files uploaded by a single task will be serialized. This may slow down these operations if there are many files created by a few tasks, and the HTTP connection pool in the S3A committer was large enough for more all the parallel POST requests.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`AbstractS3ACommitter.ActiveCommit` State of the active commit operation.
`static class`	`AbstractS3ACommitter.JobUUIDSource` Enumeration of Job UUID source.

Field Summary

Fields
Modifier and Type Field and Description

static String E_SELF_GENERATED_JOB_UUID
Error string when task setup fails.

static String THREAD_PREFIX

Fields
Modifier and Type	Field and Description
`static String`	`E_SELF_GENERATED_JOB_UUID` Error string when task setup fails.
`static String`	`THREAD_PREFIX`

Constructor Summary

Constructors
Modifier	Constructor and Description
`protected`	`AbstractS3ACommitter(org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.mapreduce.TaskAttemptContext context)` Create a committer.

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`abortJob(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.mapreduce.JobStatus.State state)`
`protected void`	`abortJobInternal(org.apache.hadoop.mapreduce.JobContext context, boolean suppressExceptions)` The internal job abort operation; can be overridden in tests.
`protected void`	`abortPendingUploads(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending, boolean suppressExceptions, boolean deleteRemoteFiles)` Abort all pending uploads in the list.
`protected void`	`abortPendingUploads(org.apache.hadoop.mapreduce.JobContext context, List<SinglePendingCommit> pending, boolean suppressExceptions)` Abort all pending uploads in the list.
`protected void`	`abortPendingUploadsInCleanup(boolean suppressExceptions)` Abort all pending uploads to the destination directory during job cleanup operations.
`static org.apache.commons.lang3.tuple.Pair<String,AbstractS3ACommitter.JobUUIDSource>`	`buildJobUUID(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.mapreduce.JobID jobId)` Build the job UUID.
`protected Tasks.Submitter`	`buildSubmitter(org.apache.hadoop.mapreduce.JobContext context)` Returns an `Tasks.Submitter` for parallel tasks.
`protected void`	`cleanup(org.apache.hadoop.mapreduce.JobContext context, boolean suppressExceptions)` Cleanup the job context, including aborting anything pending and destroying the thread pool.
`void`	`cleanupJob(org.apache.hadoop.mapreduce.JobContext context)`
`abstract void`	`cleanupStagingDirs()` Clean up any staging directories.
`void`	`commitJob(org.apache.hadoop.mapreduce.JobContext context)` Commit work.
`protected void`	`commitJobInternal(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending)` Internal Job commit operation: where the S3 requests are made (potentially in parallel).
`protected void`	`commitPendingUploads(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending)` Commit all the pending uploads.
`protected void`	`deleteTaskAttemptPathQuietly(org.apache.hadoop.mapreduce.TaskAttemptContext context)` Delete the task attempt path without raising any errors.
`protected void`	`destroyThreadPool()` Destroy any thread pools; wait for that to finish, but don't overreact if it doesn't finish in time.
`protected abstract org.apache.hadoop.fs.Path`	`getBaseTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)` Compute the base path where the output of a task attempt is written.
`protected CommitOperations`	`getCommitOperations()` Get the commit actions instance.
`org.apache.hadoop.conf.Configuration`	`getConf()`
`org.apache.hadoop.fs.FileSystem`	`getDestFS()` Get the destination FS, creating it on demand if needed.
`protected org.apache.hadoop.fs.FileSystem`	`getDestinationFS(org.apache.hadoop.fs.Path out, org.apache.hadoop.conf.Configuration config)` Get the destination filesystem from the output path and the configuration.
`S3AFileSystem`	`getDestS3AFS()` Get the destination as an S3A Filesystem; casting it.
`org.apache.hadoop.fs.statistics.IOStatistics`	`getIOStatistics()`
`protected abstract org.apache.hadoop.fs.Path`	`getJobAttemptPath(int appAttemptId)` Compute the path where the output of a given job attempt will be placed.
`org.apache.hadoop.fs.Path`	`getJobAttemptPath(org.apache.hadoop.mapreduce.JobContext context)` Compute the path where the output of a given job attempt will be placed.
`org.apache.hadoop.mapreduce.JobContext`	`getJobContext()` Get the job/task context this committer was instantiated with.
`abstract String`	`getName()` Get the name of this committer.
`org.apache.hadoop.fs.Path`	`getOutputPath()` Final path of output, in the destination FS.
`protected String`	`getRole()` Used in logging and reporting to help disentangle messages.
`protected org.apache.hadoop.fs.FileSystem`	`getTaskAttemptFilesystem(org.apache.hadoop.mapreduce.TaskAttemptContext context)` Get the task attempt path filesystem.
`org.apache.hadoop.fs.Path`	`getTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)` Compute the path where the output of a task attempt is stored until that task is committed.
`abstract org.apache.hadoop.fs.Path`	`getTempTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)` Get a temporary directory for data.
`String`	`getUUID()` The Job UUID, as passed in or generated.
`AbstractS3ACommitter.JobUUIDSource`	`getUUIDSource()` Source of the UUID.
`org.apache.hadoop.fs.Path`	`getWorkPath()` This is the critical method for `FileOutputFormat`; it declares the path for work.
`boolean`	`hasThreadPool()` Does this committer have a thread pool?
`protected CommitOperations.CommitContext`	`initiateCommitOperation()` Start the final commit/abort commit operations.
`protected void`	`initOutput(org.apache.hadoop.fs.Path out)` Init the output filesystem and path.
`protected void`	`jobCompleted(boolean success)` Job completion outcome; this may be subclassed in tests.
`protected abstract AbstractS3ACommitter.ActiveCommit`	`listPendingUploadsToCommit(org.apache.hadoop.mapreduce.JobContext context)` Get the list of pending uploads for this job attempt.
`protected void`	`maybeCreateSuccessMarker(org.apache.hadoop.mapreduce.JobContext context, List<String> filenames, org.apache.hadoop.fs.statistics.IOStatisticsSnapshot ioStatistics)` if the job requires a success marker on a successful job, create the file `CommitConstants._SUCCESS`.
`protected void`	`maybeCreateSuccessMarkerFromCommits(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending)` if the job requires a success marker on a successful job, create the file `CommitConstants._SUCCESS`.
`protected void`	`maybeIgnore(boolean suppress, String action, Invoker.VoidOperation operation)` Execute an operation; maybe suppress any raised IOException.
`protected void`	`maybeIgnore(boolean suppress, String action, IOException ex)` Log or rethrow a caught IOException.
`protected void`	`precommitCheckPendingFiles(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending)` Run a precommit check that all files are loadable.
`void`	`preCommitJob(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending)` Subclass-specific pre-Job-commit actions.
`void`	`recoverTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)` Task recovery considered unsupported: Warn and fail.
`protected boolean`	`requiresDelayedCommitOutputInFileSystem()` Flag to indicate whether or not the destination filesystem needs to be configured to support magic paths where the output isn't immediately visible.
`protected void`	`setConf(org.apache.hadoop.conf.Configuration conf)`
`protected void`	`setDestFS(org.apache.hadoop.fs.FileSystem destFS)` Set the destination FS: the FS of the final output.
`protected void`	`setOutputPath(org.apache.hadoop.fs.Path outputPath)` Set the output path.
`void`	`setupJob(org.apache.hadoop.mapreduce.JobContext context)` Base job setup (optionally) deletes the success marker and always creates the destination directory.
`void`	`setupTask(org.apache.hadoop.mapreduce.TaskAttemptContext context)` Task setup.
`protected void`	`setWorkPath(org.apache.hadoop.fs.Path workPath)` Set the work path for this committer.
`protected Tasks.Submitter`	`singleThreadSubmitter()` Get the thread pool for executing the single file commit/revert within the commit of all uploads of a single task.
`String`	`toString()`
`protected void`	`warnOnActiveUploads(org.apache.hadoop.fs.Path path)` Scan for active uploads and list them along with a warning message.

Methods inherited from class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
hasOutputPath

Methods inherited from class org.apache.hadoop.mapreduce.OutputCommitter
abortTask, commitTask, isCommitJobRepeatable, isRecoverySupported, isRecoverySupported, needsTaskCommit

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - THREAD_PREFIX
```
public static final String THREAD_PREFIX
```
    See Also:
    
    Constant Field Values
  - E_SELF_GENERATED_JOB_UUID
```
public static final String E_SELF_GENERATED_JOB_UUID
```
    Error string when task setup fails.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - AbstractS3ACommitter
```
protected AbstractS3ACommitter(org.apache.hadoop.fs.Path outputPath,
                               org.apache.hadoop.mapreduce.TaskAttemptContext context)
                        throws IOException
```
    Create a committer. This constructor binds the destination directory and configuration, but does not update the work path: That must be calculated by the implementation; It is omitted here to avoid subclass methods being called too early.
    
    Parameters:
    
    outputPath - the job's output path: MUST NOT be null.
    
    context - the task's context
    
    Throws:
    
    IOException - on a failure
- Method Detail
  - initOutput
```
protected void initOutput(org.apache.hadoop.fs.Path out)
                   throws IOException
```
    Init the output filesystem and path. TESTING ONLY; allows mock FS to cheat.
    
    Parameters:
    
    out - output path
    
    Throws:
    
    IOException - failure to create the FS.
  - getJobContext
```
public final org.apache.hadoop.mapreduce.JobContext getJobContext()
```
    Get the job/task context this committer was instantiated with.
    
    Returns:
    
    the context.
  - getOutputPath
```
public final org.apache.hadoop.fs.Path getOutputPath()
```
    Final path of output, in the destination FS.
    
    Specified by:
    
    getOutputPath in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
    
    Returns:
    
    the path
  - setOutputPath
```
protected final void setOutputPath(org.apache.hadoop.fs.Path outputPath)
```
    Set the output path.
    
    Parameters:
    
    outputPath - new value
  - getWorkPath
```
public final org.apache.hadoop.fs.Path getWorkPath()
```
    This is the critical method for FileOutputFormat; it declares the path for work.
    
    Specified by:
    
    getWorkPath in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
    
    Returns:
    
    the working path.
  - setWorkPath
```
protected final void setWorkPath(org.apache.hadoop.fs.Path workPath)
```
    Set the work path for this committer.
    
    Parameters:
    
    workPath - the work path to use.
  - getConf
```
public final org.apache.hadoop.conf.Configuration getConf()
```
  - setConf
```
protected final void setConf(org.apache.hadoop.conf.Configuration conf)
```
  - getDestFS
```
public org.apache.hadoop.fs.FileSystem getDestFS()
                                          throws IOException
```
    Get the destination FS, creating it on demand if needed.
    
    Returns:
    
    the filesystem; requires the output path to be set up
    
    Throws:
    
    IOException - if the FS cannot be instantiated.
  - getDestS3AFS
```
public S3AFileSystem getDestS3AFS()
                           throws IOException
```
    Get the destination as an S3A Filesystem; casting it.
    
    Returns:
    
    the dest S3A FS.
    
    Throws:
    
    IOException - if the FS cannot be instantiated.
  - setDestFS
```
protected void setDestFS(org.apache.hadoop.fs.FileSystem destFS)
```
    Set the destination FS: the FS of the final output.
    
    Parameters:
    
    destFS - destination FS.
  - getJobAttemptPath
```
public org.apache.hadoop.fs.Path getJobAttemptPath(org.apache.hadoop.mapreduce.JobContext context)
```
    Compute the path where the output of a given job attempt will be placed.
    
    Parameters:
    
    context - the context of the job. This is used to get the application attempt ID.
    
    Returns:
    
    the path to store job attempt data.
  - getJobAttemptPath
```
protected abstract org.apache.hadoop.fs.Path getJobAttemptPath(int appAttemptId)
```
    Compute the path where the output of a given job attempt will be placed.
    
    Parameters:
    
    appAttemptId - the ID of the application attempt for this job.
    
    Returns:
    
    the path to store job attempt data.
  - getTaskAttemptPath
```
public org.apache.hadoop.fs.Path getTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
```
    Compute the path where the output of a task attempt is stored until that task is committed. This may be the normal Task attempt path or it may be a subdirectory. The default implementation returns the value of getBaseTaskAttemptPath(TaskAttemptContext); subclasses may return different values.
    
    Parameters:
    
    context - the context of the task attempt.
    
    Returns:
    
    the path where a task attempt should be stored.
  - getBaseTaskAttemptPath
```
protected abstract org.apache.hadoop.fs.Path getBaseTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
```
    Compute the base path where the output of a task attempt is written. This is the path which will be deleted when a task is cleaned up and aborted.
    
    Parameters:
    
    context - the context of the task attempt.
    
    Returns:
    
    the path where a task attempt should be stored.
  - getTempTaskAttemptPath
```
public abstract org.apache.hadoop.fs.Path getTempTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
```
    Get a temporary directory for data. When a task is aborted/cleaned up, the contents of this directory are all deleted.
    
    Parameters:
    
    context - task context
    
    Returns:
    
    a path for temporary data.
  - getName
```
public abstract String getName()
```
    Get the name of this committer.
    
    Returns:
    
    the committer name.
  - getUUID
```
public final String getUUID()
```
    The Job UUID, as passed in or generated.
    
    Returns:
    
    the UUID for the job.
  - getUUIDSource
```
public final AbstractS3ACommitter.JobUUIDSource getUUIDSource()
```
    Source of the UUID.
    
    Returns:
    
    how the job UUID was retrieved/generated.
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
  - getDestinationFS
```
protected org.apache.hadoop.fs.FileSystem getDestinationFS(org.apache.hadoop.fs.Path out,
                                                           org.apache.hadoop.conf.Configuration config)
                                                    throws IOException
```
    Get the destination filesystem from the output path and the configuration.
    
    Parameters:
    
    out - output path
    
    config - job/task config
    
    Returns:
    
    the associated FS
    
    Throws:
    
    PathCommitException - output path isn't to an S3A FS instance.
    
    IOException - failure to instantiate the FS.
  - requiresDelayedCommitOutputInFileSystem
```
protected boolean requiresDelayedCommitOutputInFileSystem()
```
    Flag to indicate whether or not the destination filesystem needs to be configured to support magic paths where the output isn't immediately visible. If the committer returns true, then committer setup will fail if the FS doesn't have the capability. Base implementation returns false.
    
    Returns:
    
    what the requirements of the committer are of the filesystem.
  - recoverTask
```
public void recoverTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
                 throws IOException
```
    Task recovery considered unsupported: Warn and fail.
    
    Overrides:
    
    recoverTask in class org.apache.hadoop.mapreduce.OutputCommitter
    
    Parameters:
    
    taskContext - Context of the task whose output is being recovered
    
    Throws:
    
    IOException - always.
  - maybeCreateSuccessMarkerFromCommits
```
protected void maybeCreateSuccessMarkerFromCommits(org.apache.hadoop.mapreduce.JobContext context,
                                                   AbstractS3ACommitter.ActiveCommit pending)
                                            throws IOException
```
    if the job requires a success marker on a successful job, create the file CommitConstants._SUCCESS. While the classic committers create a 0-byte file, the S3Guard committers PUT up a the contents of a SuccessData file.
    
    Parameters:
    
    context - job context
    
    pending - the pending commits
    
    Throws:
    
    IOException - IO failure
  - maybeCreateSuccessMarker
```
protected void maybeCreateSuccessMarker(org.apache.hadoop.mapreduce.JobContext context,
                                        List<String> filenames,
                                        org.apache.hadoop.fs.statistics.IOStatisticsSnapshot ioStatistics)
                                 throws IOException
```
    if the job requires a success marker on a successful job, create the file CommitConstants._SUCCESS. While the classic committers create a 0-byte file, the S3Guard committers PUT up a the contents of a SuccessData file.
    
    Parameters:
    
    context - job context
    
    filenames - list of filenames.
    
    ioStatistics - any IO Statistics to include
    
    Throws:
    
    IOException - IO failure
  - setupJob
```
public void setupJob(org.apache.hadoop.mapreduce.JobContext context)
              throws IOException
```
    Base job setup (optionally) deletes the success marker and always creates the destination directory. When objects are committed that dest dir marker will inevitably be deleted; creating it now ensures there is something at the end while the job is in progress -and if nothing is created, that it is still there.
    The option InternalCommitterConstants.FS_S3A_COMMITTER_UUID is set to the job UUID; if generated locally InternalCommitterConstants.SPARK_WRITE_UUID is also patched. The field jobSetup is set to true to note that this specific committer instance was used to set up a job.
    
    Specified by:
    
    setupJob in class org.apache.hadoop.mapreduce.OutputCommitter
    
    Parameters:
    
    context - context
    
    Throws:
    
    IOException - IO failure
  - setupTask
```
public void setupTask(org.apache.hadoop.mapreduce.TaskAttemptContext context)
               throws IOException
```
    Task setup. Fails if the the UUID was generated locally, and the same committer wasn't used for job setup.
    
    Specified by:
    
    setupTask in class org.apache.hadoop.mapreduce.OutputCommitter
    
    Throws:
    
    PathCommitException - if the task UUID options are unsatisfied.
    
    IOException
  - getTaskAttemptFilesystem
```
protected org.apache.hadoop.fs.FileSystem getTaskAttemptFilesystem(org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                            throws IOException
```
    Get the task attempt path filesystem. This may not be the same as the final destination FS, and so may not be an S3A FS.
    
    Parameters:
    
    context - task attempt
    
    Returns:
    
    the filesystem
    
    Throws:
    
    IOException - failure to instantiate
  - commitPendingUploads
```
protected void commitPendingUploads(org.apache.hadoop.mapreduce.JobContext context,
                                    AbstractS3ACommitter.ActiveCommit pending)
                             throws IOException
```
    Commit all the pending uploads. Each file listed in the ActiveCommit instance is queued for processing in a separate thread; its contents are loaded and then (sequentially) committed. On a failure or abort of a single file's commit, all its uploads are aborted. The revert operation lists the files already committed and deletes them.
    
    Parameters:
    
    context - job context
    
    pending - pending uploads
    
    Throws:
    
    IOException - on any failure
  - precommitCheckPendingFiles
```
protected void precommitCheckPendingFiles(org.apache.hadoop.mapreduce.JobContext context,
                                          AbstractS3ACommitter.ActiveCommit pending)
                                   throws IOException
```
    Run a precommit check that all files are loadable. This check avoids the situation where the inability to read a file only surfaces partway through the job commit, so results in the destination being tainted.
    
    Parameters:
    
    context - job context
    
    pending - the pending operations
    
    Throws:
    
    IOException - any failure
  - initiateCommitOperation
```
protected CommitOperations.CommitContext initiateCommitOperation()
                                                          throws IOException
```
    Start the final commit/abort commit operations.
    
    Returns:
    
    a commit context through which the operations can be invoked.
    
    Throws:
    
    IOException - failure.
  - commitJobInternal
```
protected void commitJobInternal(org.apache.hadoop.mapreduce.JobContext context,
                                 AbstractS3ACommitter.ActiveCommit pending)
                          throws IOException
```
    Internal Job commit operation: where the S3 requests are made (potentially in parallel).
    
    Parameters:
    
    context - job context
    
    pending - pending commits
    
    Throws:
    
    IOException - any failure
  - abortJob
```
public void abortJob(org.apache.hadoop.mapreduce.JobContext context,
                     org.apache.hadoop.mapreduce.JobStatus.State state)
              throws IOException
```
    Overrides:
    
    abortJob in class org.apache.hadoop.mapreduce.OutputCommitter
    
    Throws:
    
    IOException
  - abortJobInternal
```
protected void abortJobInternal(org.apache.hadoop.mapreduce.JobContext context,
                                boolean suppressExceptions)
                         throws IOException
```
    The internal job abort operation; can be overridden in tests. This must clean up operations; it is called when a commit fails, as well as in an abortJob(JobContext, JobStatus.State) call. The base implementation calls cleanup(JobContext, boolean) so cleans up the filesystems and destroys the thread pool. Subclasses must always invoke this superclass method after their own operations.
    
    Parameters:
    
    context - job context
    
    suppressExceptions - should exceptions be suppressed?
    
    Throws:
    
    IOException - any IO problem raised when suppressExceptions is false.
  - abortPendingUploadsInCleanup
```
protected void abortPendingUploadsInCleanup(boolean suppressExceptions)
                                     throws IOException
```
    Abort all pending uploads to the destination directory during job cleanup operations. Note: this instantiates the thread pool if required -so destroyThreadPool() must be called after this.
    
    Parameters:
    
    suppressExceptions - should exceptions be suppressed
    
    Throws:
    
    IOException - IO problem
  - preCommitJob
```
public void preCommitJob(org.apache.hadoop.mapreduce.JobContext context,
                         AbstractS3ACommitter.ActiveCommit pending)
                  throws IOException
```
    Subclass-specific pre-Job-commit actions. The staging committers all load the pending files to verify that they can be loaded. The Magic committer does not, because of the overhead of reading files from S3 makes it too expensive.
    
    Parameters:
    
    context - job context
    
    pending - the pending operations
    
    Throws:
    
    IOException - any failure
  - commitJob
```
public void commitJob(org.apache.hadoop.mapreduce.JobContext context)
               throws IOException
```
    Commit work. This consists of two stages: precommit and commit.
    Precommit: identify pending uploads, then allow subclasses to validate the state of the destination and the pending uploads. Any failure here triggers an abort of all pending uploads.
    Commit internal: do the final commit sequence.
    The final commit action is to build the _SUCCESS file entry.
    
    Overrides:
    
    commitJob in class org.apache.hadoop.mapreduce.OutputCommitter
    
    Parameters:
    
    context - job context
    
    Throws:
    
    IOException - any failure
  - jobCompleted
```
protected void jobCompleted(boolean success)
```
    Job completion outcome; this may be subclassed in tests.
    
    Parameters:
    
    success - did the job succeed.
  - cleanupStagingDirs
```
public abstract void cleanupStagingDirs()
```
    Clean up any staging directories. IOEs must be caught and swallowed.
  - listPendingUploadsToCommit
```
protected abstract AbstractS3ACommitter.ActiveCommit listPendingUploadsToCommit(org.apache.hadoop.mapreduce.JobContext context)
                                                                         throws IOException
```
    Get the list of pending uploads for this job attempt.
    
    Parameters:
    
    context - job context
    
    Returns:
    
    a list of pending uploads.
    
    Throws:
    
    IOException - Any IO failure
  - cleanup
```
protected void cleanup(org.apache.hadoop.mapreduce.JobContext context,
                       boolean suppressExceptions)
                throws IOException
```
    Cleanup the job context, including aborting anything pending and destroying the thread pool.
    
    Parameters:
    
    context - job context
    
    suppressExceptions - should exceptions be suppressed?
    
    Throws:
    
    IOException - any failure if exceptions were not suppressed.
  - cleanupJob
```
public void cleanupJob(org.apache.hadoop.mapreduce.JobContext context)
                throws IOException
```
    Overrides:
    
    cleanupJob in class org.apache.hadoop.mapreduce.OutputCommitter
    
    Throws:
    
    IOException
  - maybeIgnore
```
protected void maybeIgnore(boolean suppress,
                           String action,
                           Invoker.VoidOperation operation)
                    throws IOException
```
    Execute an operation; maybe suppress any raised IOException.
    
    Parameters:
    
    suppress - should raised IOEs be suppressed?
    
    action - action (for logging when the IOE is supressed.
    
    operation - operation
    
    Throws:
    
    IOException - if operation raised an IOE and suppress == false
  - maybeIgnore
```
protected void maybeIgnore(boolean suppress,
                           String action,
                           IOException ex)
                    throws IOException
```
    Log or rethrow a caught IOException.
    
    Parameters:
    
    suppress - should raised IOEs be suppressed?
    
    action - action (for logging when the IOE is suppressed.
    
    ex - exception
    
    Throws:
    
    IOException - if suppress == false
  - getCommitOperations
```
protected CommitOperations getCommitOperations()
```
    Get the commit actions instance. Subclasses may provide a mock version of this.
    
    Returns:
    
    the commit actions instance to use for operations.
  - getRole
```
protected String getRole()
```
    Used in logging and reporting to help disentangle messages.
    
    Returns:
    
    the committer's role.
  - buildSubmitter
```
protected Tasks.Submitter buildSubmitter(org.apache.hadoop.mapreduce.JobContext context)
```
    Returns an Tasks.Submitter for parallel tasks. The number of threads in the thread-pool is set by fs.s3a.committer.threads. If num-threads is 0, this will return null; this is used in Tasks as a cue to switch to single-threaded execution.
    
    Parameters:
    
    context - the JobContext for this commit
    
    Returns:
    
    a submitter or null
  - destroyThreadPool
```
protected void destroyThreadPool()
```
    Destroy any thread pools; wait for that to finish, but don't overreact if it doesn't finish in time.
  - singleThreadSubmitter
```
protected final Tasks.Submitter singleThreadSubmitter()
```
    Get the thread pool for executing the single file commit/revert within the commit of all uploads of a single task. This is currently null; it is here to allow the Tasks class to provide the logic for execute/revert.
    
    Returns:
    
    null. always.
  - hasThreadPool
```
public boolean hasThreadPool()
```
    Does this committer have a thread pool?
    
    Returns:
    
    true if a thread pool exists.
  - deleteTaskAttemptPathQuietly
```
protected void deleteTaskAttemptPathQuietly(org.apache.hadoop.mapreduce.TaskAttemptContext context)
```
    Delete the task attempt path without raising any errors.
    
    Parameters:
    
    context - task context
  - abortPendingUploads
```
protected void abortPendingUploads(org.apache.hadoop.mapreduce.JobContext context,
                                   List<SinglePendingCommit> pending,
                                   boolean suppressExceptions)
                            throws IOException
```
    Abort all pending uploads in the list. This operation is used by the magic committer as part of its rollback after a failure during task commit.
    
    Parameters:
    
    context - job context
    
    pending - pending uploads
    
    suppressExceptions - should exceptions be suppressed
    
    Throws:
    
    IOException - any exception raised
  - abortPendingUploads
```
protected void abortPendingUploads(org.apache.hadoop.mapreduce.JobContext context,
                                   AbstractS3ACommitter.ActiveCommit pending,
                                   boolean suppressExceptions,
                                   boolean deleteRemoteFiles)
                            throws IOException
```
    Abort all pending uploads in the list.
    
    Parameters:
    
    context - job context
    
    pending - pending uploads
    
    suppressExceptions - should exceptions be suppressed?
    
    deleteRemoteFiles - should remote files be deleted?
    
    Throws:
    
    IOException - any exception raised
  - getIOStatistics
```
public org.apache.hadoop.fs.statistics.IOStatistics getIOStatistics()
```
    Specified by:
    
    getIOStatistics in interface org.apache.hadoop.fs.statistics.IOStatisticsSource
  - warnOnActiveUploads
```
protected void warnOnActiveUploads(org.apache.hadoop.fs.Path path)
```
    Scan for active uploads and list them along with a warning message. Errors are ignored.
    
    Parameters:
    
    path - output path of job.
  - buildJobUUID
```
public static org.apache.commons.lang3.tuple.Pair<String,AbstractS3ACommitter.JobUUIDSource> buildJobUUID(org.apache.hadoop.conf.Configuration conf,
                                                                                                          org.apache.hadoop.mapreduce.JobID jobId)
                                                                                                   throws PathCommitException
```
    Build the job UUID.
    In MapReduce jobs, the application ID is issued by YARN, and unique across all jobs.
    
    Spark will use a fake app ID based on the current time. This can lead to collisions on busy clusters unless the specific spark release has SPARK-33402 applied. This appends a random long value to the timestamp, so is unique enough that the risk of collision is almost nonexistent.
    
    The order of selection of a uuid is
    1. Value of InternalCommitterConstants.FS_S3A_COMMITTER_UUID.
    2. Value of InternalCommitterConstants.SPARK_WRITE_UUID.
    3. If enabled through CommitConstants.FS_S3A_COMMITTER_GENERATE_UUID: Self-generated uuid.
    4. If CommitConstants.FS_S3A_COMMITTER_REQUIRE_UUID is not set: Application ID
    The UUID bonding takes place during construction; the staging committers use it to set up their wrapped committer to a path in the cluster FS which is unique to the job.
    In MapReduce jobs, the application ID is issued by YARN, and unique across all jobs.
    In setupJob(JobContext) the job context's configuration will be patched be valid in all sequences where the job has been set up for the configuration passed in.
    If the option CommitConstants.FS_S3A_COMMITTER_REQUIRE_UUID is set, then an external UUID MUST be passed in. This can be used to verify that the spark engine is reliably setting unique IDs for staging.
    Parameters:
    
    conf - job/task configuration
    
    jobId - job ID from YARN or spark.
    
    Returns:
    
    Job UUID and source of it.
    
    Throws:
    
    PathCommitException - no UUID was found and it was required

Class AbstractS3ACommitter

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter

Methods inherited from class org.apache.hadoop.mapreduce.OutputCommitter

Methods inherited from class java.lang.Object

Field Detail

THREAD_PREFIX

E_SELF_GENERATED_JOB_UUID

Constructor Detail

AbstractS3ACommitter

Method Detail

initOutput

getJobContext

getOutputPath

setOutputPath

getWorkPath

setWorkPath

getConf

setConf

getDestFS

getDestS3AFS

setDestFS

getJobAttemptPath

getJobAttemptPath

getTaskAttemptPath

getBaseTaskAttemptPath

getTempTaskAttemptPath

getName

getUUID

getUUIDSource

toString

getDestinationFS

requiresDelayedCommitOutputInFileSystem

recoverTask

maybeCreateSuccessMarkerFromCommits

maybeCreateSuccessMarker

setupJob

setupTask

getTaskAttemptFilesystem

commitPendingUploads

precommitCheckPendingFiles

initiateCommitOperation

commitJobInternal

abortJob

abortJobInternal

abortPendingUploadsInCleanup

preCommitJob

commitJob

jobCompleted

cleanupStagingDirs

listPendingUploadsToCommit

cleanup

cleanupJob

maybeIgnore

maybeIgnore

getCommitOperations

getRole

buildSubmitter

destroyThreadPool

singleThreadSubmitter

hasThreadPool

deleteTaskAttemptPathQuietly

abortPendingUploads

abortPendingUploads

getIOStatistics

warnOnActiveUploads

buildJobUUID