Class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark>
- java.lang.Object
-
- org.apache.beam.sdk.io.Source<OutputT>
-
- org.apache.beam.sdk.io.UnboundedSource<OutputT,CheckpointMarkT>
-
- Type Parameters:
OutputT
- Type of records output by this source.CheckpointMarkT
- Type of checkpoint marks used by the readers of this source.
- All Implemented Interfaces:
java.io.Serializable
,HasDisplayData
public abstract class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark> extends Source<OutputT>
ASource
that reads an unbounded amount of input and, because of that, supports some additional operations such as checkpointing, watermarks, and record ids.- Checkpointing allows sources to not re-read the same data again in the case of failures.
- Watermarks allow for downstream parts of the pipeline to know up to what point in time the data is complete.
- Record ids allow for efficient deduplication of input records; many streaming sources do not guarantee that a given record will only be read a single time.
See
Window
andTrigger
for more information on timestamps and watermarks.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static interface
UnboundedSource.CheckpointMark
A marker representing the progress and state of anUnboundedSource.UnboundedReader
.static class
UnboundedSource.UnboundedReader<OutputT>
AReader
that reads an unbounded amount of input.-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
-
Constructor Summary
Constructors Constructor Description UnboundedSource()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract UnboundedSource.UnboundedReader<OutputT>
createReader(PipelineOptions options, @Nullable CheckpointMarkT checkpointMark)
Create a newUnboundedSource.UnboundedReader
to read from this source, resuming from the given checkpoint if present.abstract Coder<CheckpointMarkT>
getCheckpointMarkCoder()
Returns aCoder
for encoding and decoding the checkpoints for this source.boolean
requiresDeduping()
Returns whether this source requires explicit deduping.abstract java.util.List<? extends UnboundedSource<OutputT,CheckpointMarkT>>
split(int desiredNumSplits, PipelineOptions options)
Returns a list ofUnboundedSource
objects representing the instances of this source that should be used when executing the workflow.-
Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder, populateDisplayData, validate
-
-
-
-
Method Detail
-
split
public abstract java.util.List<? extends UnboundedSource<OutputT,CheckpointMarkT>> split(int desiredNumSplits, PipelineOptions options) throws java.lang.Exception
Returns a list ofUnboundedSource
objects representing the instances of this source that should be used when executing the workflow. Each split should return a separate partition of the input data.For example, for a source reading from a growing directory of files, each split could correspond to a prefix of file names.
Some sources are not splittable, such as reading from a single TCP stream. In that case, only a single split should be returned.
Some data sources automatically partition their data among readers. For these types of inputs,
n
identical replicas of the top-level source can be returned.The size of the returned list should be as close to
desiredNumSplits
as possible, but does not have to match exactly. A low number of splits will limit the amount of parallelism in the source.- Throws:
java.lang.Exception
-
createReader
public abstract UnboundedSource.UnboundedReader<OutputT> createReader(PipelineOptions options, @Nullable CheckpointMarkT checkpointMark) throws java.io.IOException
Create a newUnboundedSource.UnboundedReader
to read from this source, resuming from the given checkpoint if present.- Throws:
java.io.IOException
-
getCheckpointMarkCoder
public abstract Coder<CheckpointMarkT> getCheckpointMarkCoder()
Returns aCoder
for encoding and decoding the checkpoints for this source.
-
requiresDeduping
public boolean requiresDeduping()
Returns whether this source requires explicit deduping.This is needed if the underlying data source can return the same record multiple times, such a queuing system with a pull-ack model. Sources where the records read are uniquely identified by the persisted state in the CheckpointMark do not need this.
Generally, if
UnboundedSource.CheckpointMark.finalizeCheckpoint()
is overridden, this method should return true. Checkpoint finalization is best-effort, and readers can be resumed from a checkpoint that has not been finalized.
-
-