Class FileBasedSink<UserT,DestinationT,OutputT>
- java.lang.Object
-
- org.apache.beam.sdk.io.FileBasedSink<UserT,DestinationT,OutputT>
-
- Type Parameters:
OutputT
- the type of values written to the sink.
- All Implemented Interfaces:
java.io.Serializable
,HasDisplayData
- Direct Known Subclasses:
AvroSink
@Experimental(FILESYSTEM) public abstract class FileBasedSink<UserT,DestinationT,OutputT> extends java.lang.Object implements java.io.Serializable, HasDisplayData
Abstract class for file-based output. An implementation of FileBasedSink writes file-based output and defines the format of output files (how values are written, headers/footers, MIME type, etc.).At pipeline construction time, the methods of FileBasedSink are called to validate the sink and to create a
FileBasedSink.WriteOperation
that manages the process of writing to the sink.The process of writing to file-based sink is as follows:
- An optional subclass-defined initialization,
- a parallel write of bundles to temporary files, and finally,
- these temporary files are renamed with final output filenames.
In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the event of failure/retry or for redundancy). However, exactly one of these executions will have its result passed to the finalize method. Each call to
FileBasedSink.Writer.open(java.lang.String)
is passed a unique bundle id when it is called by the WriteFiles transform, so even redundant or retried bundles will have a unique way of identifying their output.The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness guarantee is important; if a bundle is to be output to a file, for example, the name of the file will encode the unique bundle id to avoid conflicts with other writers.
FileBasedSink
can take a customFileBasedSink.FilenamePolicy
object to determine output filenames, and this policy object can be used to write windowed or triggered PCollections into separate files per window pane. This allows file output from unbounded PCollections, and also works for bounded PCollecctions.Supported file systems are those registered with
FileSystems
.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
FileBasedSink.CompressionType
Deprecated.useCompression
.static class
FileBasedSink.DynamicDestinations<UserT,DestinationT,OutputT>
A class that allows value-dependent writes inFileBasedSink
.static class
FileBasedSink.FilenamePolicy
A naming policy for output files.static class
FileBasedSink.FileResult<DestinationT>
Result of a single bundle write.static class
FileBasedSink.FileResultCoder<DestinationT>
A coder forFileBasedSink.FileResult
objects.static interface
FileBasedSink.OutputFileHints
Provides hints about how to generate output files, such as a suggested filename suffix (e.g.static interface
FileBasedSink.WritableByteChannelFactory
Implementations create instances ofWritableByteChannel
used byFileBasedSink
and related classes to allow decorating, or otherwise transforming, the raw data that would normally be written directly to theWritableByteChannel
passed intoFileBasedSink.WritableByteChannelFactory.create(WritableByteChannel)
.static class
FileBasedSink.WriteOperation<DestinationT,OutputT>
Abstract operation that manages the process of writing toFileBasedSink
.static class
FileBasedSink.Writer<DestinationT,OutputT>
Abstract writer that writes a bundle to aFileBasedSink
.
-
Constructor Summary
Constructors Constructor Description FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?,DestinationT,OutputT> dynamicDestinations)
Construct aFileBasedSink
with the given temp directory, producing uncompressed files.FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?,DestinationT,OutputT> dynamicDestinations, Compression compression)
Construct aFileBasedSink
with the given temp directory and output channel type.FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?,DestinationT,OutputT> dynamicDestinations, FileBasedSink.WritableByteChannelFactory writableByteChannelFactory)
Construct aFileBasedSink
with the given temp directory and output channel type.
-
Method Summary
-
-
-
Constructor Detail
-
FileBasedSink
@Experimental(FILESYSTEM) public FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?,DestinationT,OutputT> dynamicDestinations)
Construct aFileBasedSink
with the given temp directory, producing uncompressed files.
-
FileBasedSink
@Experimental(FILESYSTEM) public FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?,DestinationT,OutputT> dynamicDestinations, FileBasedSink.WritableByteChannelFactory writableByteChannelFactory)
Construct aFileBasedSink
with the given temp directory and output channel type.
-
FileBasedSink
@Experimental(FILESYSTEM) public FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?,DestinationT,OutputT> dynamicDestinations, Compression compression)
Construct aFileBasedSink
with the given temp directory and output channel type.
-
-
Method Detail
-
convertToFileResourceIfPossible
@Experimental(FILESYSTEM) public static ResourceId convertToFileResourceIfPossible(java.lang.String outputPrefix)
This is a helper function for turning a user-provided output filename prefix and converting it into aResourceId
for writing output files. SeeTextIO.Write.to(String)
for an example use case.Typically, the input prefix will be something like
/tmp/foo/bar
, and the user would like output files to be named as/tmp/foo/bar-0-of-3.txt
. Thus, this function tries to interpret the provided string as a fileResourceId
path.However, this may fail, for example if the user gives a prefix that is a directory. E.g.,
/
,gs://my-bucket
, orc://
. In that case, interpreting the string as a file will fail and this function will return a directoryResourceId
instead.
-
getDynamicDestinations
public FileBasedSink.DynamicDestinations<UserT,DestinationT,OutputT> getDynamicDestinations()
Return theFileBasedSink.DynamicDestinations
used.
-
getTempDirectoryProvider
@Experimental(FILESYSTEM) public ValueProvider<ResourceId> getTempDirectoryProvider()
Returns the directory inside which temporary files will be written according to the configuredFileBasedSink.FilenamePolicy
.
-
validate
public void validate(PipelineOptions options)
-
createWriteOperation
public abstract FileBasedSink.WriteOperation<DestinationT,OutputT> createWriteOperation()
Return a subclass ofFileBasedSink.WriteOperation
that will manage the write to the sink.
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Description copied from interface:HasDisplayData
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
getWritableByteChannelFactory
protected final FileBasedSink.WritableByteChannelFactory getWritableByteChannelFactory()
Returns theFileBasedSink.WritableByteChannelFactory
used.
-
-