public class DeleteOperation extends ExecutingStoreOperation<Boolean>
How S3Guard/Store inconsistency is handled:
extraFilesDeleted
counter will be incremented here.
Performance tuning:
The operation to POST a delete request (or issue many individual DELETE calls) then update the S3Guard table is done in an async operation so that it can overlap with the LIST calls for data. However, only one single operation is queued at a time.
Executing more than one batch delete is possible, it just
adds complexity in terms of error handling as well as in
the datastructures used to track outstanding operations.
If this is done, then it may be good to experiment with different
page sizes. The default value is
InternalConstants.MAX_ENTRIES_TO_DELETE
, the maximum a single
POST permits.
1. Smaller pages executed in parallel may have different performance characteristics when deleting very large directories, because it will be the DynamoDB calls which will come to dominate. Any exploration of options here MUST be done with performance measurements taken from test runs in EC2 against local DDB and S3 stores, so as to ensure network latencies do not skew the results.
2. Note that as the DDB thread/connection pools will be shared across all active delete operations, speedups will be minimal unless those pools are large enough to cope the extra load.
There are also some opportunities to explore in
DynamoDBMetadataStore
with batching delete requests
in the DDB APIs.
Constructor and Description |
---|
DeleteOperation(StoreContext context,
S3AFileStatus status,
boolean recursive,
OperationCallbacks callbacks,
int pageSize)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
protected void |
deleteDirectoryTree(org.apache.hadoop.fs.Path path,
String dirKey)
Delete a directory tree.
|
Boolean |
execute()
Delete a file or directory tree.
|
long |
getExtraFilesDeleted() |
long |
getFilesDeleted() |
executeOnlyOnce
getStoreContext
public DeleteOperation(StoreContext context, S3AFileStatus status, boolean recursive, OperationCallbacks callbacks, int pageSize)
context
- store contextstatus
- pre-fetched source statusrecursive
- recursive delete?callbacks
- callback providerpageSize
- size of delete pagespublic long getFilesDeleted()
public long getExtraFilesDeleted()
@Retries.RetryTranslated public Boolean execute() throws IOException
This call does not create any fake parent directory; that is left to the caller. The actual delete call is done in a separate thread. Only one delete at a time is submitted, however, to reduce the complexity of recovering from failures.
The DynamoDB store deletes paths in parallel itself, so that potentially slow part of the process is somewhat speeded up. The extra parallelization here is to list files from the store/DDB while that delete operation is in progress.
execute
in class ExecutingStoreOperation<Boolean>
org.apache.hadoop.fs.PathIsNotEmptyDirectoryException
- if the path is a dir and this
is not a recursive delete.IOException
- list failures or an inability to delete a file.@Retries.RetryTranslated protected void deleteDirectoryTree(org.apache.hadoop.fs.Path path, String dirKey) throws IOException
This is done by asking the filesystem for a list of all objects under the directory path, without using any S3Guard tombstone markers to hide objects which may be returned in S3 listings but which are considered deleted.
Once the first pageSize
worth of objects has been listed, a batch
delete is queued for execution in a separate thread; subsequent batches
block waiting for the first call to complete or fail before again,
being deleted in the separate thread.
After all listed objects are queued for deletion, if the path is considered authoritative in the client, a final scan of S3 without S3Guard is executed, so as to find and delete any out-of-band objects in the tree.
path
- directory pathdirKey
- directory keyIOException
- failureCopyright © 2008–2020 Apache Software Foundation. All rights reserved.