public abstract class AbstractS3ACommitter
extends org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
implements org.apache.hadoop.fs.statistics.IOStatisticsSource
AbstractS3ACommitter.ActiveCommit
class with the
list of .pendingset files to load and then commit; that can be done
incrementally and in parallel.
As a side effect of this change, unless/until changed,
the commit/abort/revert of all files uploaded by a single task will be
serialized. This may slow down these operations if there are many files
created by a few tasks, and the HTTP connection pool in the S3A
committer was large enough for more all the parallel POST requests.Modifier and Type | Class and Description |
---|---|
static class |
AbstractS3ACommitter.ActiveCommit
State of the active commit operation.
|
static class |
AbstractS3ACommitter.JobUUIDSource
Enumeration of Job UUID source.
|
Modifier and Type | Field and Description |
---|---|
static String |
E_SELF_GENERATED_JOB_UUID
Error string when task setup fails.
|
static String |
THREAD_PREFIX |
Modifier | Constructor and Description |
---|---|
protected |
AbstractS3ACommitter(org.apache.hadoop.fs.Path outputPath,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
Create a committer.
|
Modifier and Type | Method and Description |
---|---|
void |
abortJob(org.apache.hadoop.mapreduce.JobContext context,
org.apache.hadoop.mapreduce.JobStatus.State state) |
protected void |
abortJobInternal(CommitContext commitContext,
boolean suppressExceptions)
The internal job abort operation; can be overridden in tests.
|
protected void |
abortPendingUploads(CommitContext commitContext,
AbstractS3ACommitter.ActiveCommit pending,
boolean suppressExceptions,
boolean deleteRemoteFiles)
Abort all pending uploads in the list.
|
protected void |
abortPendingUploads(CommitContext commitContext,
List<SinglePendingCommit> pending,
boolean suppressExceptions)
Abort all pending uploads in the list.
|
protected void |
abortPendingUploadsInCleanup(boolean suppressExceptions,
CommitContext commitContext)
Abort all pending uploads to the destination directory during
job cleanup operations.
|
static org.apache.commons.lang3.tuple.Pair<String,AbstractS3ACommitter.JobUUIDSource> |
buildJobUUID(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.mapreduce.JobID jobId)
Build the job UUID.
|
protected void |
cleanup(CommitContext commitContext,
boolean suppressExceptions)
Cleanup the job context, including aborting anything pending
and destroying the thread pool.
|
void |
cleanupJob(org.apache.hadoop.mapreduce.JobContext context) |
abstract void |
cleanupStagingDirs()
Clean up any staging directories.
|
void |
commitJob(org.apache.hadoop.mapreduce.JobContext context)
Commit work.
|
protected void |
commitJobInternal(CommitContext commitContext,
AbstractS3ACommitter.ActiveCommit pending)
Internal Job commit operation: where the S3 requests are made
(potentially in parallel).
|
protected void |
commitPendingUploads(CommitContext commitContext,
AbstractS3ACommitter.ActiveCommit pending)
Commit all the pending uploads.
|
protected void |
deleteTaskAttemptPathQuietly(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Delete the task attempt path without raising any errors.
|
protected org.apache.hadoop.fs.store.audit.AuditSpanSource |
getAuditSpanSource() |
protected abstract org.apache.hadoop.fs.Path |
getBaseTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Compute the base path where the output of a task attempt is written.
|
protected CommitOperations |
getCommitOperations()
Get the commit actions instance.
|
org.apache.hadoop.conf.Configuration |
getConf() |
org.apache.hadoop.fs.FileSystem |
getDestFS()
Get the destination FS, creating it on demand if needed.
|
protected org.apache.hadoop.fs.FileSystem |
getDestinationFS(org.apache.hadoop.fs.Path out,
org.apache.hadoop.conf.Configuration config)
Get the destination filesystem from the output path and the configuration.
|
S3AFileSystem |
getDestS3AFS()
Get the destination as an S3A Filesystem; casting it.
|
org.apache.hadoop.fs.statistics.IOStatistics |
getIOStatistics() |
protected abstract org.apache.hadoop.fs.Path |
getJobAttemptPath(int appAttemptId)
Compute the path where the output of a given job attempt will be placed.
|
org.apache.hadoop.fs.Path |
getJobAttemptPath(org.apache.hadoop.mapreduce.JobContext context)
Compute the path where the output of a given job attempt will be placed.
|
org.apache.hadoop.mapreduce.JobContext |
getJobContext()
Get the job/task context this committer was instantiated with.
|
protected abstract org.apache.hadoop.fs.Path |
getJobPath()
Compute the path under which all job attempts will be placed.
|
abstract String |
getName()
Get the name of this committer.
|
org.apache.hadoop.fs.Path |
getOutputPath()
Final path of output, in the destination FS.
|
protected String |
getRole()
Used in logging and reporting to help disentangle messages.
|
protected org.apache.hadoop.fs.FileSystem |
getTaskAttemptFilesystem(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Get the task attempt path filesystem.
|
org.apache.hadoop.fs.Path |
getTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Compute the path where the output of a task attempt is stored until
that task is committed.
|
abstract org.apache.hadoop.fs.Path |
getTempTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Get a temporary directory for data.
|
String |
getUUID()
The Job UUID, as passed in or generated.
|
AbstractS3ACommitter.JobUUIDSource |
getUUIDSource()
Source of the UUID.
|
org.apache.hadoop.fs.Path |
getWorkPath()
This is the critical method for
FileOutputFormat ; it declares
the path for work. |
protected CommitContext |
initiateJobOperation(org.apache.hadoop.mapreduce.JobContext context)
Start the final job commit/abort commit operations.
|
protected CommitContext |
initiateTaskOperation(org.apache.hadoop.mapreduce.JobContext context)
Start a ask commit/abort commit operations.
|
protected void |
initOutput(org.apache.hadoop.fs.Path out)
Init the output filesystem and path.
|
protected void |
jobCompleted(boolean success)
Job completion outcome; this may be subclassed in tests.
|
protected abstract AbstractS3ACommitter.ActiveCommit |
listPendingUploadsToCommit(CommitContext commitContext)
Get the list of pending uploads for this job attempt.
|
protected SuccessData |
maybeCreateSuccessMarker(org.apache.hadoop.mapreduce.JobContext context,
List<String> filenames,
org.apache.hadoop.fs.statistics.IOStatisticsSnapshot ioStatistics)
if the job requires a success marker on a successful job,
create the
_SUCCESS file. |
protected SuccessData |
maybeCreateSuccessMarkerFromCommits(CommitContext commitContext,
AbstractS3ACommitter.ActiveCommit pending)
if the job requires a success marker on a successful job,
create the file
CommitConstants._SUCCESS . |
protected void |
maybeIgnore(boolean suppress,
String action,
org.apache.hadoop.util.functional.InvocationRaisingIOE operation)
Execute an operation; maybe suppress any raised IOException.
|
protected void |
maybeIgnore(boolean suppress,
String action,
IOException ex)
Log or rethrow a caught IOException.
|
protected void |
precommitCheckPendingFiles(CommitContext commitContext,
AbstractS3ACommitter.ActiveCommit pending)
Run a precommit check that all files are loadable.
|
void |
preCommitJob(CommitContext commitContext,
AbstractS3ACommitter.ActiveCommit pending)
Subclass-specific pre-Job-commit actions.
|
void |
recoverTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
Task recovery considered Unsupported: Warn and fail.
|
protected boolean |
requiresDelayedCommitOutputInFileSystem()
Flag to indicate whether or not the destination filesystem needs
to be configured to support magic paths where the output isn't immediately
visible.
|
protected void |
setConf(org.apache.hadoop.conf.Configuration conf) |
protected void |
setDestFS(org.apache.hadoop.fs.FileSystem destFS)
Set the destination FS: the FS of the final output.
|
protected void |
setOutputPath(org.apache.hadoop.fs.Path outputPath)
Set the output path.
|
void |
setupJob(org.apache.hadoop.mapreduce.JobContext context)
Base job setup (optionally) deletes the success marker and
always creates the destination directory.
|
void |
setupTask(org.apache.hadoop.mapreduce.TaskAttemptContext context)
Task setup.
|
protected void |
setWorkPath(org.apache.hadoop.fs.Path workPath)
Set the work path for this committer.
|
protected org.apache.hadoop.fs.store.audit.AuditSpan |
startOperation(String name,
String path1,
String path2)
Start an operation; retrieve an audit span.
|
String |
toString() |
protected void |
updateCommonContext()
Add jobID to current context.
|
protected void |
warnOnActiveUploads(org.apache.hadoop.fs.Path path)
Scan for active uploads and list them along with a warning message.
|
hasOutputPath
public static final String THREAD_PREFIX
@VisibleForTesting public static final String E_SELF_GENERATED_JOB_UUID
protected AbstractS3ACommitter(org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
outputPath
- the job's output path: MUST NOT be null.context
- the task's contextIOException
- on a failure@VisibleForTesting protected void initOutput(org.apache.hadoop.fs.Path out) throws IOException
out
- output pathIOException
- failure to create the FS.public final org.apache.hadoop.mapreduce.JobContext getJobContext()
public final org.apache.hadoop.fs.Path getOutputPath()
getOutputPath
in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
protected final void setOutputPath(org.apache.hadoop.fs.Path outputPath)
outputPath
- new valuepublic final org.apache.hadoop.fs.Path getWorkPath()
FileOutputFormat
; it declares
the path for work.getWorkPath
in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
protected final void setWorkPath(org.apache.hadoop.fs.Path workPath)
workPath
- the work path to use.public final org.apache.hadoop.conf.Configuration getConf()
protected final void setConf(org.apache.hadoop.conf.Configuration conf)
public org.apache.hadoop.fs.FileSystem getDestFS() throws IOException
IOException
- if the FS cannot be instantiated.public S3AFileSystem getDestS3AFS() throws IOException
IOException
- if the FS cannot be instantiated.protected void setDestFS(org.apache.hadoop.fs.FileSystem destFS)
destFS
- destination FS.public org.apache.hadoop.fs.Path getJobAttemptPath(org.apache.hadoop.mapreduce.JobContext context)
context
- the context of the job. This is used to get the
application attempt ID.protected abstract org.apache.hadoop.fs.Path getJobPath()
protected abstract org.apache.hadoop.fs.Path getJobAttemptPath(int appAttemptId)
appAttemptId
- the ID of the application attempt for this job.public org.apache.hadoop.fs.Path getTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
getBaseTaskAttemptPath(TaskAttemptContext)
;
subclasses may return different values.context
- the context of the task attempt.protected abstract org.apache.hadoop.fs.Path getBaseTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
context
- the context of the task attempt.public abstract org.apache.hadoop.fs.Path getTempTaskAttemptPath(org.apache.hadoop.mapreduce.TaskAttemptContext context)
context
- task contextpublic abstract String getName()
@VisibleForTesting public final String getUUID()
@VisibleForTesting public final AbstractS3ACommitter.JobUUIDSource getUUIDSource()
public String toString()
toString
in class org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
protected org.apache.hadoop.fs.FileSystem getDestinationFS(org.apache.hadoop.fs.Path out, org.apache.hadoop.conf.Configuration config) throws IOException
out
- output pathconfig
- job/task configPathCommitException
- output path isn't to an S3A FS instance.IOException
- failure to instantiate the FS.protected boolean requiresDelayedCommitOutputInFileSystem()
public void recoverTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext) throws IOException
recoverTask
in class org.apache.hadoop.mapreduce.OutputCommitter
taskContext
- Context of the task whose output is being recoveredIOException
- always.protected SuccessData maybeCreateSuccessMarkerFromCommits(CommitContext commitContext, AbstractS3ACommitter.ActiveCommit pending) throws IOException
CommitConstants._SUCCESS
.
While the classic committers create a 0-byte file, the S3A committers
PUT up a the contents of a SuccessData
file.commitContext
- commit contextpending
- the pending commitsIOException
- IO failureprotected SuccessData maybeCreateSuccessMarker(org.apache.hadoop.mapreduce.JobContext context, List<String> filenames, org.apache.hadoop.fs.statistics.IOStatisticsSnapshot ioStatistics) throws IOException
_SUCCESS
file.
While the classic committers create a 0-byte file, the S3A committers
PUT up a the contents of a SuccessData
file.
The file is returned, even if no marker is created.
This is so it can be saved to a report directory.context
- job contextfilenames
- list of filenames.ioStatistics
- any IO Statistics to includeIOException
- IO failurepublic void setupJob(org.apache.hadoop.mapreduce.JobContext context) throws IOException
The option InternalCommitterConstants.FS_S3A_COMMITTER_UUID
is set to the job UUID; if generated locally
InternalCommitterConstants.SPARK_WRITE_UUID
is also patched.
The field jobSetup
is set to true to note that
this specific committer instance was used to set up a job.
setupJob
in class org.apache.hadoop.mapreduce.OutputCommitter
context
- contextIOException
- IO failurepublic void setupTask(org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
setupTask
in class org.apache.hadoop.mapreduce.OutputCommitter
PathCommitException
- if the task UUID options are unsatisfied.IOException
protected org.apache.hadoop.fs.FileSystem getTaskAttemptFilesystem(org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
context
- task attemptIOException
- failure to instantiateprotected void commitPendingUploads(CommitContext commitContext, AbstractS3ACommitter.ActiveCommit pending) throws IOException
commitContext
- commit contextpending
- pending uploadsIOException
- on any failureprotected void precommitCheckPendingFiles(CommitContext commitContext, AbstractS3ACommitter.ActiveCommit pending) throws IOException
commitContext
- commit contextpending
- the pending operationsIOException
- any failureprotected CommitContext initiateJobOperation(org.apache.hadoop.mapreduce.JobContext context) throws IOException
context
- job contextIOException
- failure.protected CommitContext initiateTaskOperation(org.apache.hadoop.mapreduce.JobContext context) throws IOException
context
- job or task contextIOException
- failure.protected void commitJobInternal(CommitContext commitContext, AbstractS3ACommitter.ActiveCommit pending) throws IOException
commitContext
- commit contextpending
- pending commitsIOException
- any failurepublic void abortJob(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.mapreduce.JobStatus.State state) throws IOException
abortJob
in class org.apache.hadoop.mapreduce.OutputCommitter
IOException
protected void abortJobInternal(CommitContext commitContext, boolean suppressExceptions) throws IOException
abortJob(JobContext, JobStatus.State)
call.
The base implementation calls cleanup(CommitContext, boolean)
so cleans up the filesystems and destroys the thread pool.
Subclasses must always invoke this superclass method after their
own operations.
Creates and closes its own commit context.commitContext
- commit contextsuppressExceptions
- should exceptions be suppressed?IOException
- any IO problem raised when suppressExceptions is false.protected void abortPendingUploadsInCleanup(boolean suppressExceptions, CommitContext commitContext) throws IOException
suppressExceptions
- should exceptions be suppressedcommitContext
- commit contextIOException
- IO problem@VisibleForTesting public void preCommitJob(CommitContext commitContext, AbstractS3ACommitter.ActiveCommit pending) throws IOException
commitContext
- commit contextpending
- the pending operationsIOException
- any failurepublic void commitJob(org.apache.hadoop.mapreduce.JobContext context) throws IOException
Precommit: identify pending uploads, then allow subclasses to validate the state of the destination and the pending uploads. Any failure here triggers an abort of all pending uploads.
Commit internal: do the final commit sequence.
The final commit action is to build the _SUCCESS
file entry.
commitJob
in class org.apache.hadoop.mapreduce.OutputCommitter
context
- job contextIOException
- any failureprotected void jobCompleted(boolean success)
success
- did the job succeed.public abstract void cleanupStagingDirs()
protected abstract AbstractS3ACommitter.ActiveCommit listPendingUploadsToCommit(CommitContext commitContext) throws IOException
commitContext
- commit contextIOException
- Any IO failureprotected void cleanup(CommitContext commitContext, boolean suppressExceptions) throws IOException
commitContext
- commit contextsuppressExceptions
- should exceptions be suppressed?IOException
- any failure if exceptions were not suppressed.public void cleanupJob(org.apache.hadoop.mapreduce.JobContext context) throws IOException
cleanupJob
in class org.apache.hadoop.mapreduce.OutputCommitter
IOException
protected void maybeIgnore(boolean suppress, String action, org.apache.hadoop.util.functional.InvocationRaisingIOE operation) throws IOException
suppress
- should raised IOEs be suppressed?action
- action (for logging when the IOE is supressed.operation
- operationIOException
- if operation raised an IOE and suppress == falseprotected void maybeIgnore(boolean suppress, String action, IOException ex) throws IOException
suppress
- should raised IOEs be suppressed?action
- action (for logging when the IOE is suppressed.ex
- exceptionIOException
- if suppress == falseprotected CommitOperations getCommitOperations()
protected String getRole()
protected void deleteTaskAttemptPathQuietly(org.apache.hadoop.mapreduce.TaskAttemptContext context)
context
- task contextprotected void abortPendingUploads(CommitContext commitContext, List<SinglePendingCommit> pending, boolean suppressExceptions) throws IOException
commitContext
- commit contextpending
- pending uploadssuppressExceptions
- should exceptions be suppressedIOException
- any exception raisedprotected void abortPendingUploads(CommitContext commitContext, AbstractS3ACommitter.ActiveCommit pending, boolean suppressExceptions, boolean deleteRemoteFiles) throws IOException
commitContext
- commit contextpending
- pending uploadssuppressExceptions
- should exceptions be suppressed?deleteRemoteFiles
- should remote files be deleted?IOException
- any exception raisedpublic org.apache.hadoop.fs.statistics.IOStatistics getIOStatistics()
getIOStatistics
in interface org.apache.hadoop.fs.statistics.IOStatisticsSource
protected void warnOnActiveUploads(org.apache.hadoop.fs.Path path)
path
- output path of job.public static org.apache.commons.lang3.tuple.Pair<String,AbstractS3ACommitter.JobUUIDSource> buildJobUUID(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.mapreduce.JobID jobId) throws PathCommitException
In MapReduce jobs, the application ID is issued by YARN, and unique across all jobs.
Spark will use a fake app ID based on the current time. This can lead to collisions on busy clusters unless the specific spark release has SPARK-33402 applied. This appends a random long value to the timestamp, so is unique enough that the risk of collision is almost nonexistent.
The order of selection of a uuid is
InternalCommitterConstants.FS_S3A_COMMITTER_UUID
.InternalCommitterConstants.SPARK_WRITE_UUID
.CommitConstants.FS_S3A_COMMITTER_GENERATE_UUID
:
Self-generated uuid.CommitConstants.FS_S3A_COMMITTER_REQUIRE_UUID
is not set: Application IDIn MapReduce jobs, the application ID is issued by YARN, and unique across all jobs.
InsetupJob(JobContext)
the job context's configuration
will be patched
be valid in all sequences where the job has been set up for the
configuration passed in.
If the option CommitConstants.FS_S3A_COMMITTER_REQUIRE_UUID
is set, then an external UUID MUST be passed in.
This can be used to verify that the spark engine is reliably setting
unique IDs for staging.
conf
- job/task configurationjobId
- job ID from YARN or spark.PathCommitException
- no UUID was found and it was requiredprotected final void updateCommonContext()
protected org.apache.hadoop.fs.store.audit.AuditSpanSource getAuditSpanSource()
protected org.apache.hadoop.fs.store.audit.AuditSpan startOperation(String name, @Nullable String path1, @Nullable String path2) throws IOException
StoreStatisticNames
or
StreamStatisticNames
.name
- operation name.path1
- first path of operationpath2
- second path of operationIOException
- failureCopyright © 2008–2024 Apache Software Foundation. All rights reserved.