IN
- Type of the elements emitted by this sink@PublicEvolving public class StreamingFileSink<IN> extends RichSinkFunction<IN> implements CheckpointedFunction, org.apache.flink.runtime.state.CheckpointListener, ProcessingTimeCallback
FileSystem
files within buckets. This is
integrated with the checkpointing mechanism to provide exactly once semantics.
When creating the sink a basePath
must be specified. The base directory contains
one directory for every bucket. The bucket directories themselves contain several part files,
with at least one for each parallel subtask of the sink which is writing data to that bucket.
These part files contain the actual output data.
The sink uses a BucketAssigner
to determine in which bucket directory each element should
be written to inside the base directory. The BucketAssigner
can, for example, use time or
a property of the element to determine the bucket directory. The default BucketAssigner
is a
DateTimeBucketAssigner
which will create one new bucket every hour. You can specify
a custom BucketAssigner
using the setBucketAssigner(bucketAssigner)
method, after calling
forRowFormat(Path, Encoder)
or
forBulkFormat(Path, BulkWriter.Factory)
.
The filenames of the part files contain the part prefix, "part-", the parallel subtask index of the sink
and a rolling counter. For example the file "part-1-17"
contains the data from
subtask 1
of the sink and is the 17th
bucket created by that subtask.
Part files roll based on the user-specified RollingPolicy
. By default, a DefaultRollingPolicy
is used.
In some scenarios, the open buckets are required to change based on time. In these cases, the user
can specify a bucketCheckInterval
(by default 1m) and the sink will check periodically and roll
the part file if the specified rolling policy says so.
Part files can be in one of three states: in-progress
, pending
or finished
.
The reason for this is how the sink works together with the checkpointing mechanism to provide exactly-once
semantics and fault-tolerance. The part file that is currently being written to is in-progress
. Once
a part file is closed for writing it becomes pending
. When a checkpoint is successful the currently
pending files will be moved to finished
.
If case of a failure, and in order to guarantee exactly-once semantics, the sink should roll back to the state it
had when that last successful checkpoint occurred. To this end, when restoring, the restored files in pending
state are transferred into the finished
state while any in-progress
files are rolled back, so that
they do not contain data that arrived after the checkpoint from which we restore.
限定符和类型 | 类和说明 |
---|---|
protected static class |
StreamingFileSink.BucketsBuilder<IN,BucketID>
The base abstract class for the
StreamingFileSink.RowFormatBuilder and StreamingFileSink.BulkFormatBuilder . |
static class |
StreamingFileSink.BulkFormatBuilder<IN,BucketID>
A builder for configuring the sink for bulk-encoding formats, e.g.
|
static class |
StreamingFileSink.RowFormatBuilder<IN,BucketID>
A builder for configuring the sink for row-wise encoding formats.
|
SinkFunction.Context<T>
限定符 | 构造器和说明 |
---|---|
protected |
StreamingFileSink(StreamingFileSink.BucketsBuilder<IN,?> bucketsBuilder,
long bucketCheckInterval)
Creates a new
StreamingFileSink that writes files to the given base directory. |
限定符和类型 | 方法和说明 |
---|---|
void |
close() |
static <IN> StreamingFileSink.BulkFormatBuilder<IN,String> |
forBulkFormat(org.apache.flink.core.fs.Path basePath,
org.apache.flink.api.common.serialization.BulkWriter.Factory<IN> writerFactory)
Creates the builder for a
StreamingFileSink with row-encoding format. |
static <IN> StreamingFileSink.RowFormatBuilder<IN,String> |
forRowFormat(org.apache.flink.core.fs.Path basePath,
org.apache.flink.api.common.serialization.Encoder<IN> encoder)
Creates the builder for a
StreamingFileSink with row-encoding format. |
void |
initializeState(org.apache.flink.runtime.state.FunctionInitializationContext context)
This method is called when the parallel function instance is created during distributed
execution.
|
void |
invoke(IN value,
SinkFunction.Context context)
Writes the given value to the sink.
|
void |
notifyCheckpointComplete(long checkpointId) |
void |
onProcessingTime(long timestamp)
This method is invoked with the timestamp for which the trigger was scheduled.
|
void |
open(org.apache.flink.configuration.Configuration parameters) |
void |
snapshotState(org.apache.flink.runtime.state.FunctionSnapshotContext context)
This method is called when a snapshot for a checkpoint is requested.
|
getIterationRuntimeContext, getRuntimeContext, setRuntimeContext
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
invoke
protected StreamingFileSink(StreamingFileSink.BucketsBuilder<IN,?> bucketsBuilder, long bucketCheckInterval)
StreamingFileSink
that writes files to the given base directory.public static <IN> StreamingFileSink.RowFormatBuilder<IN,String> forRowFormat(org.apache.flink.core.fs.Path basePath, org.apache.flink.api.common.serialization.Encoder<IN> encoder)
StreamingFileSink
with row-encoding format.IN
- the type of incoming elementsbasePath
- the base path where all the buckets are going to be created as sub-directories.encoder
- the Encoder
to be used when writing elements in the buckets.StreamingFileSink.RowFormatBuilder.build()
after specifying the desired parameters.public static <IN> StreamingFileSink.BulkFormatBuilder<IN,String> forBulkFormat(org.apache.flink.core.fs.Path basePath, org.apache.flink.api.common.serialization.BulkWriter.Factory<IN> writerFactory)
StreamingFileSink
with row-encoding format.IN
- the type of incoming elementsbasePath
- the base path where all the buckets are going to be created as sub-directories.writerFactory
- the BulkWriter.Factory
to be used when writing elements in the buckets.StreamingFileSink.RowFormatBuilder.build()
after specifying the desired parameters.public void initializeState(org.apache.flink.runtime.state.FunctionInitializationContext context) throws Exception
CheckpointedFunction
initializeState
在接口中 CheckpointedFunction
context
- the context for initializing the operatorException
public void notifyCheckpointComplete(long checkpointId) throws Exception
notifyCheckpointComplete
在接口中 org.apache.flink.runtime.state.CheckpointListener
Exception
public void snapshotState(org.apache.flink.runtime.state.FunctionSnapshotContext context) throws Exception
CheckpointedFunction
FunctionInitializationContext
when
the Function was initialized, or offered now by FunctionSnapshotContext
itself.snapshotState
在接口中 CheckpointedFunction
context
- the context for drawing a snapshot of the operatorException
public void open(org.apache.flink.configuration.Configuration parameters) throws Exception
open
在接口中 org.apache.flink.api.common.functions.RichFunction
open
在类中 org.apache.flink.api.common.functions.AbstractRichFunction
Exception
public void onProcessingTime(long timestamp) throws Exception
ProcessingTimeCallback
If the triggering is delayed for whatever reason (trigger timer was blocked, JVM stalled due to a garbage collection), the timestamp supplied to this function will still be the original timestamp for which the trigger was scheduled.
onProcessingTime
在接口中 ProcessingTimeCallback
timestamp
- The timestamp for which the trigger event was scheduled.Exception
public void invoke(IN value, SinkFunction.Context context) throws Exception
SinkFunction
You have to override this method when implementing a SinkFunction
, this is a
default
method for backward compatibility with the old-style method only.
invoke
在接口中 SinkFunction<IN>
value
- The input record.context
- Additional context about the input record.Exception
- This method may throw exceptions. Throwing an exception will cause the operation
to fail and may trigger recovery.Copyright © 2014–2019 The Apache Software Foundation. All rights reserved.