Class FileBasedSink<UserT,​DestinationT,​OutputT>

  • Type Parameters:
    OutputT - the type of values written to the sink.
    All Implemented Interfaces:
    java.io.Serializable, HasDisplayData
    Direct Known Subclasses:
    AvroSink

    @Experimental(FILESYSTEM)
    public abstract class FileBasedSink<UserT,​DestinationT,​OutputT>
    extends java.lang.Object
    implements java.io.Serializable, HasDisplayData
    Abstract class for file-based output. An implementation of FileBasedSink writes file-based output and defines the format of output files (how values are written, headers/footers, MIME type, etc.).

    At pipeline construction time, the methods of FileBasedSink are called to validate the sink and to create a FileBasedSink.WriteOperation that manages the process of writing to the sink.

    The process of writing to file-based sink is as follows:

    1. An optional subclass-defined initialization,
    2. a parallel write of bundles to temporary files, and finally,
    3. these temporary files are renamed with final output filenames.

    In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the event of failure/retry or for redundancy). However, exactly one of these executions will have its result passed to the finalize method. Each call to FileBasedSink.Writer.open(java.lang.String) is passed a unique bundle id when it is called by the WriteFiles transform, so even redundant or retried bundles will have a unique way of identifying their output.

    The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness guarantee is important; if a bundle is to be output to a file, for example, the name of the file will encode the unique bundle id to avoid conflicts with other writers.

    FileBasedSink can take a custom FileBasedSink.FilenamePolicy object to determine output filenames, and this policy object can be used to write windowed or triggered PCollections into separate files per window pane. This allows file output from unbounded PCollections, and also works for bounded PCollecctions.

    Supported file systems are those registered with FileSystems.

    See Also:
    Serialized Form