public abstract class AbstractS3ACommitter
extends org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
implements org.apache.hadoop.fs.statistics.IOStatisticsSource
AbstractS3ACommitter.ActiveCommit
class with the
list of .pendingset files to load and then commit; that can be done
incrementally and in parallel.
As a side effect of this change, unless/until changed,
the commit/abort/revert of all files uploaded by a single task will be
serialized. This may slow down these operations if there are many files
created by a few tasks, and the HTTP connection pool in the S3A
committer was large enough for more all the parallel POST requests.Modifier and Type | Class and Description |
---|---|
static class |
AbstractS3ACommitter.ActiveCommit
State of the active commit operation.
|
static class |
AbstractS3ACommitter.JobUUIDSource
Enumeration of Job UUID source.
|
Modifier and Type | Field and Description |
---|---|
static String |
E_SELF_GENERATED_JOB_UUID
Error string when task setup fails.
|
static String |
THREAD_PREFIX |
Modifier | Constructor and Description |
---|---|
protected |
AbstractS3ACommitter(org.apache.hadoop.fs.Path outputPath,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
Create a committer.
|
Modifier and Type | Method and Description |
---|---|
void |
abortJob(org.apache.hadoop.mapreduce.JobContext context,
org.apache.hadoop.mapreduce.JobStatus.State state) |
protected void |
abortJobInternal(org.apache.hadoop.mapreduce.JobContext context,
boolean suppressExceptions)
The internal job abort operation; can be overridden in tests.
|
protected void |
abortPendingUploads(org.apache.hadoop.mapreduce.JobContext context,
AbstractS3ACommitter.ActiveCommit pending,
boolean suppressExceptions,
boolean deleteRemoteFiles)
Abort all pending uploads in the list.
|
protected void |
abortPendingUploads(org.apache.hadoop.mapreduce.JobContext context,
List<SinglePendingCommit> pending,
boolean suppressExceptions)
Abort all pending uploads in the list.
|
protected void |
abortPendingUploadsInCleanup(boolean suppressExceptions)
Abort all pending uploads to the destination directory during
job cleanup operations.
|
static org.apache.commons.lang3.tuple.Pair<String,AbstractS3ACommitter.JobUUIDSource> |
buildJobUUID(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.mapreduce.JobID jobId)
Build the job UUID.
|
protected Tasks.Submitter |
buildSubmitter(org.apache.hadoop.mapreduce.JobContext context)
Returns an
Tasks.Submitter for parallel tasks. |
protected void |
cleanup(org.apache.hadoop.mapreduce.JobContext context,
boolean suppressExceptions)
Cleanup the job context, including aborting anything pending
and destroying the thread pool.
|
void |
cleanupJob(org.apache.hadoop.mapreduce.JobContext context) |
abstract void |
cleanupStagingDirs()
Clean up any staging directories.
|
void |
commitJob(org.apache.hadoop.mapreduce.JobContext context)
Commit work.
|
protected void |
commitJobInternal(org.apache.hadoop.mapreduce.JobContext context,
AbstractS3ACommitter.ActiveCommit pending)
Internal Job commit operation: where the S3 requests are made
(potentially in parallel).
|
protected void |
commitPendingUploads(org.apache.hadoop.mapreduce.JobContext context,
AbstractS3ACommitter.ActiveCommit pending)
Commit all the pending uploads.
|
protected void |
deleteTaskAttemptPathQuietly(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Delete the task attempt path without raising any errors.
|
protected void |
destroyThreadPool()
Destroy any thread pools; wait for that to finish,
but don't overreact if it doesn't finish in time.
|
protected abstract org.apache.hadoop.fs.Path |
getBaseTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Compute the base path where the output of a task attempt is written.
|
protected CommitOperations |
getCommitOperations()
Get the commit actions instance.
|
org.apache.hadoop.conf.Configuration |
getConf() |
org.apache.hadoop.fs.FileSystem |
getDestFS()
Get the destination FS, creating it on demand if needed.
|
protected org.apache.hadoop.fs.FileSystem |
getDestinationFS(org.apache.hadoop.fs.Path out,
org.apache.hadoop.conf.Configuration config)
Get the destination filesystem from the output path and the configuration.
|
S3AFileSystem |
getDestS3AFS()
Get the destination as an S3A Filesystem; casting it.
|
org.apache.hadoop.fs.statistics.IOStatistics |
getIOStatistics() |
protected abstract org.apache.hadoop.fs.Path |
getJobAttemptPath(int appAttemptId)
Compute the path where the output of a given job attempt will be placed.
|
org.apache.hadoop.fs.Path |
getJobAttemptPath(org.apache.hadoop.mapreduce.JobContext context)
Compute the path where the output of a given job attempt will be placed.
|
org.apache.hadoop.mapreduce.JobContext |
getJobContext()
Get the job/task context this committer was instantiated with.
|
abstract String |
getName()
Get the name of this committer.
|
org.apache.hadoop.fs.Path |
getOutputPath()
Final path of output, in the destination FS.
|
protected String |
getRole()
Used in logging and reporting to help disentangle messages.
|
protected org.apache.hadoop.fs.FileSystem |
getTaskAttemptFilesystem(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Get the task attempt path filesystem.
|
org.apache.hadoop.fs.Path |
getTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Compute the path where the output of a task attempt is stored until
that task is committed.
|
abstract org.apache.hadoop.fs.Path |
getTempTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Get a temporary directory for data.
|
String |
getUUID()
The Job UUID, as passed in or generated.
|
AbstractS3ACommitter.JobUUIDSource |
getUUIDSource()
Source of the UUID.
|
org.apache.hadoop.fs.Path |
getWorkPath()
This is the critical method for
FileOutputFormat ; it declares
the path for work. |
boolean |
hasThreadPool()
Does this committer have a thread pool?
|
protected CommitOperations.CommitContext |
initiateCommitOperation()
Start the final commit/abort commit operations.
|
protected void |
initOutput(org.apache.hadoop.fs.Path out)
Init the output filesystem and path.
|
protected void |
jobCompleted(boolean success)
Job completion outcome; this may be subclassed in tests.
|
protected abstract AbstractS3ACommitter.ActiveCommit |
listPendingUploadsToCommit(org.apache.hadoop.mapreduce.JobContext context)
Get the list of pending uploads for this job attempt.
|
protected void |
maybeCreateSuccessMarker(org.apache.hadoop.mapreduce.JobContext context,
List<String> filenames,
org.apache.hadoop.fs.statistics.IOStatisticsSnapshot ioStatistics)
if the job requires a success marker on a successful job,
create the file
CommitConstants._SUCCESS . |
protected void |
maybeCreateSuccessMarkerFromCommits(org.apache.hadoop.mapreduce.JobContext context,
AbstractS3ACommitter.ActiveCommit pending)
if the job requires a success marker on a successful job,
create the file
CommitConstants._SUCCESS . |
protected void |
maybeIgnore(boolean suppress,
String action,
Invoker.VoidOperation operation)
Execute an operation; maybe suppress any raised IOException.
|
protected void |
maybeIgnore(boolean suppress,
String action,
IOException ex)
Log or rethrow a caught IOException.
|
protected void |
precommitCheckPendingFiles(org.apache.hadoop.mapreduce.JobContext context,
AbstractS3ACommitter.ActiveCommit pending)
Run a precommit check that all files are loadable.
|
void |
preCommitJob(org.apache.hadoop.mapreduce.JobContext context,
AbstractS3ACommitter.ActiveCommit pending)
Subclass-specific pre-Job-commit actions.
|
void |
recoverTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
Task recovery considered unsupported: Warn and fail.
|
protected boolean |
requiresDelayedCommitOutputInFileSystem()
Flag to indicate whether or not the destination filesystem needs
to be configured to support magic paths where the output isn't immediately
visible.
|
protected void |
setConf(org.apache.hadoop.conf.Configuration conf) |
protected void |
setDestFS(org.apache.hadoop.fs.FileSystem destFS)
Set the destination FS: the FS of the final output.
|
protected void |
setOutputPath(org.apache.hadoop.fs.Path outputPath)
Set the output path.
|
void |
setupJob(org.apache.hadoop.mapreduce.JobContext context)
Base job setup (optionally) deletes the success marker and
always creates the destination directory.
|
void |
setupTask(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Task setup.
|
protected void |
setWorkPath(org.apache.hadoop.fs.Path workPath)
Set the work path for this committer.
|
protected Tasks.Submitter |
singleThreadSubmitter()
Get the thread pool for executing the single file commit/revert
within the commit of all uploads of a single task.
|
String |
toString() |
protected void |
warnOnActiveUploads(org.apache.hadoop.fs.Path path)
Scan for active uploads and list them along with a warning message.
|
hasOutputPath
public static final String THREAD_PREFIX
public static final String E_SELF_GENERATED_JOB_UUID
protected AbstractS3ACommitter(org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
outputPath
- the job's output path: MUST NOT be null.context
- the task's contextIOException
- on a failureprotected void initOutput(org.apache.hadoop.fs.Path out) throws IOException
out
- output pathIOException
- failure to create the FS.public final org.apache.hadoop.mapreduce.JobContext getJobContext()
public final org.apache.hadoop.fs.Path getOutputPath()
getOutputPath
in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
protected final void setOutputPath(org.apache.hadoop.fs.Path outputPath)
outputPath
- new valuepublic final org.apache.hadoop.fs.Path getWorkPath()
FileOutputFormat
; it declares
the path for work.getWorkPath
in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
protected final void setWorkPath(org.apache.hadoop.fs.Path workPath)
workPath
- the work path to use.public final org.apache.hadoop.conf.Configuration getConf()
protected final void setConf(org.apache.hadoop.conf.Configuration conf)
public org.apache.hadoop.fs.FileSystem getDestFS() throws IOException
IOException
- if the FS cannot be instantiated.public S3AFileSystem getDestS3AFS() throws IOException
IOException
- if the FS cannot be instantiated.protected void setDestFS(org.apache.hadoop.fs.FileSystem destFS)
destFS
- destination FS.public org.apache.hadoop.fs.Path getJobAttemptPath(org.apache.hadoop.mapreduce.JobContext context)
context
- the context of the job. This is used to get the
application attempt ID.protected abstract org.apache.hadoop.fs.Path getJobAttemptPath(int appAttemptId)
appAttemptId
- the ID of the application attempt for this job.public org.apache.hadoop.fs.Path getTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
getBaseTaskAttemptPath(TaskAttemptContext)
;
subclasses may return different values.context
- the context of the task attempt.protected abstract org.apache.hadoop.fs.Path getBaseTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
context
- the context of the task attempt.public abstract org.apache.hadoop.fs.Path getTempTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
context
- task contextpublic abstract String getName()
public final String getUUID()
public final AbstractS3ACommitter.JobUUIDSource getUUIDSource()
public String toString()
toString
in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
protected org.apache.hadoop.fs.FileSystem getDestinationFS(org.apache.hadoop.fs.Path out, org.apache.hadoop.conf.Configuration config) throws IOException
out
- output pathconfig
- job/task configPathCommitException
- output path isn't to an S3A FS instance.IOException
- failure to instantiate the FS.protected boolean requiresDelayedCommitOutputInFileSystem()
public void recoverTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext) throws IOException
recoverTask
in class org.apache.hadoop.mapreduce.OutputCommitter
taskContext
- Context of the task whose output is being recoveredIOException
- always.protected void maybeCreateSuccessMarkerFromCommits(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending) throws IOException
CommitConstants._SUCCESS
.
While the classic committers create a 0-byte file, the S3Guard committers
PUT up a the contents of a SuccessData
file.context
- job contextpending
- the pending commitsIOException
- IO failureprotected void maybeCreateSuccessMarker(org.apache.hadoop.mapreduce.JobContext context, List<String> filenames, org.apache.hadoop.fs.statistics.IOStatisticsSnapshot ioStatistics) throws IOException
CommitConstants._SUCCESS
.
While the classic committers create a 0-byte file, the S3Guard committers
PUT up a the contents of a SuccessData
file.context
- job contextfilenames
- list of filenames.ioStatistics
- any IO Statistics to includeIOException
- IO failurepublic void setupJob(org.apache.hadoop.mapreduce.JobContext context) throws IOException
The option InternalCommitterConstants.FS_S3A_COMMITTER_UUID
is set to the job UUID; if generated locally
InternalCommitterConstants.SPARK_WRITE_UUID
is also patched.
The field jobSetup
is set to true to note that
this specific committer instance was used to set up a job.
setupJob
in class org.apache.hadoop.mapreduce.OutputCommitter
context
- contextIOException
- IO failurepublic void setupTask(org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
setupTask
in class org.apache.hadoop.mapreduce.OutputCommitter
PathCommitException
- if the task UUID options are unsatisfied.IOException
protected org.apache.hadoop.fs.FileSystem getTaskAttemptFilesystem(org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
context
- task attemptIOException
- failure to instantiateprotected void commitPendingUploads(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending) throws IOException
context
- job contextpending
- pending uploadsIOException
- on any failureprotected void precommitCheckPendingFiles(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending) throws IOException
context
- job contextpending
- the pending operationsIOException
- any failureprotected CommitOperations.CommitContext initiateCommitOperation() throws IOException
IOException
- failure.protected void commitJobInternal(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending) throws IOException
context
- job contextpending
- pending commitsIOException
- any failurepublic void abortJob(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.mapreduce.JobStatus.State state) throws IOException
abortJob
in class org.apache.hadoop.mapreduce.OutputCommitter
IOException
protected void abortJobInternal(org.apache.hadoop.mapreduce.JobContext context, boolean suppressExceptions) throws IOException
abortJob(JobContext, JobStatus.State)
call.
The base implementation calls cleanup(JobContext, boolean)
so cleans up the filesystems and destroys the thread pool.
Subclasses must always invoke this superclass method after their
own operations.context
- job contextsuppressExceptions
- should exceptions be suppressed?IOException
- any IO problem raised when suppressExceptions is false.protected void abortPendingUploadsInCleanup(boolean suppressExceptions) throws IOException
destroyThreadPool()
must be called after this.suppressExceptions
- should exceptions be suppressedIOException
- IO problempublic void preCommitJob(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending) throws IOException
context
- job contextpending
- the pending operationsIOException
- any failurepublic void commitJob(org.apache.hadoop.mapreduce.JobContext context) throws IOException
Precommit: identify pending uploads, then allow subclasses to validate the state of the destination and the pending uploads. Any failure here triggers an abort of all pending uploads.
Commit internal: do the final commit sequence.
The final commit action is to build the _SUCCESS
file entry.
commitJob
in class org.apache.hadoop.mapreduce.OutputCommitter
context
- job contextIOException
- any failureprotected void jobCompleted(boolean success)
success
- did the job succeed.public abstract void cleanupStagingDirs()
protected abstract AbstractS3ACommitter.ActiveCommit listPendingUploadsToCommit(org.apache.hadoop.mapreduce.JobContext context) throws IOException
context
- job contextIOException
- Any IO failureprotected void cleanup(org.apache.hadoop.mapreduce.JobContext context, boolean suppressExceptions) throws IOException
context
- job contextsuppressExceptions
- should exceptions be suppressed?IOException
- any failure if exceptions were not suppressed.public void cleanupJob(org.apache.hadoop.mapreduce.JobContext context) throws IOException
cleanupJob
in class org.apache.hadoop.mapreduce.OutputCommitter
IOException
protected void maybeIgnore(boolean suppress, String action, Invoker.VoidOperation operation) throws IOException
suppress
- should raised IOEs be suppressed?action
- action (for logging when the IOE is supressed.operation
- operationIOException
- if operation raised an IOE and suppress == falseprotected void maybeIgnore(boolean suppress, String action, IOException ex) throws IOException
suppress
- should raised IOEs be suppressed?action
- action (for logging when the IOE is suppressed.ex
- exceptionIOException
- if suppress == falseprotected CommitOperations getCommitOperations()
protected String getRole()
protected Tasks.Submitter buildSubmitter(org.apache.hadoop.mapreduce.JobContext context)
Tasks.Submitter
for parallel tasks. The number of
threads in the thread-pool is set by fs.s3a.committer.threads.
If num-threads is 0, this will return null;
this is used in Tasks as a cue
to switch to single-threaded execution.context
- the JobContext for this commitprotected void destroyThreadPool()
protected final Tasks.Submitter singleThreadSubmitter()
public boolean hasThreadPool()
protected void deleteTaskAttemptPathQuietly(org.apache.hadoop.mapreduce.TaskAttemptContext context)
context
- task contextprotected void abortPendingUploads(org.apache.hadoop.mapreduce.JobContext context, List<SinglePendingCommit> pending, boolean suppressExceptions) throws IOException
context
- job contextpending
- pending uploadssuppressExceptions
- should exceptions be suppressedIOException
- any exception raisedprotected void abortPendingUploads(org.apache.hadoop.mapreduce.JobContext context, AbstractS3ACommitter.ActiveCommit pending, boolean suppressExceptions, boolean deleteRemoteFiles) throws IOException
context
- job contextpending
- pending uploadssuppressExceptions
- should exceptions be suppressed?deleteRemoteFiles
- should remote files be deleted?IOException
- any exception raisedpublic org.apache.hadoop.fs.statistics.IOStatistics getIOStatistics()
getIOStatistics
in interface org.apache.hadoop.fs.statistics.IOStatisticsSource
protected void warnOnActiveUploads(org.apache.hadoop.fs.Path path)
path
- output path of job.public static org.apache.commons.lang3.tuple.Pair<String,AbstractS3ACommitter.JobUUIDSource> buildJobUUID(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.mapreduce.JobID jobId) throws PathCommitException
In MapReduce jobs, the application ID is issued by YARN, and unique across all jobs.
Spark will use a fake app ID based on the current time. This can lead to collisions on busy clusters unless the specific spark release has SPARK-33402 applied. This appends a random long value to the timestamp, so is unique enough that the risk of collision is almost nonexistent.
The order of selection of a uuid is
InternalCommitterConstants.FS_S3A_COMMITTER_UUID
.InternalCommitterConstants.SPARK_WRITE_UUID
.CommitConstants.FS_S3A_COMMITTER_GENERATE_UUID
:
Self-generated uuid.CommitConstants.FS_S3A_COMMITTER_REQUIRE_UUID
is not set: Application IDIn MapReduce jobs, the application ID is issued by YARN, and unique across all jobs.
InsetupJob(JobContext)
the job context's configuration
will be patched
be valid in all sequences where the job has been set up for the
configuration passed in.
If the option CommitConstants.FS_S3A_COMMITTER_REQUIRE_UUID
is set, then an external UUID MUST be passed in.
This can be used to verify that the spark engine is reliably setting
unique IDs for staging.
conf
- job/task configurationjobId
- job ID from YARN or spark.PathCommitException
- no UUID was found and it was requiredCopyright © 2008–2021 Apache Software Foundation. All rights reserved.