T - Type of records represented by the source.public abstract class OffsetBasedSource<T> extends BoundedSource<T>
Source that uses offsets to define starting and ending positions.
Extend this class to implement your own offset based custom source.
FileBasedSource, which is a subclass of this, adds additional functionality useful for
custom sources that are based on files. If possible implementors should start from
FileBasedSource instead of OffsetBasedSource.
This is a common base class for all sources that use an offset range. It stores the range and implements splitting into bundles. This should be used for sources that can be cheaply read starting at any given offset.
Consult RangeTracker for important semantics common to all sources defined by a range
of positions of a certain type, including the semantics of split points
(OffsetBasedSource.OffsetBasedReader.isAtSplitPoint()).
| Modifier and Type | Class and Description |
|---|---|
static class |
OffsetBasedSource.OffsetBasedReader<T>
A
Source.Reader that implements code common to readers of all
OffsetBasedSources. |
BoundedSource.BoundedReader<T>Source.Reader<T>| Constructor and Description |
|---|
OffsetBasedSource(long startOffset,
long endOffset,
long minBundleSize) |
| Modifier and Type | Method and Description |
|---|---|
abstract OffsetBasedSource<T> |
createSourceForSubrange(long start,
long end)
Returns an
OffsetBasedSource for a subrange of the current source. |
long |
getBytesPerOffset()
Returns approximately how many bytes of data correspond to a single offset in this source.
|
long |
getEndOffset()
Returns the specified ending offset of the source.
|
long |
getEstimatedSizeBytes(PipelineOptions options)
An estimate of the total size (in bytes) of the data that would be read from this source.
|
abstract long |
getMaxEndOffset(PipelineOptions options)
Returns the exact ending offset of the current source.
|
long |
getMinBundleSize()
Returns the minimum bundle size that should be used when splitting the source into sub-sources.
|
long |
getStartOffset()
Returns the starting offset of the source.
|
List<? extends OffsetBasedSource<T>> |
splitIntoBundles(long desiredBundleSizeBytes,
PipelineOptions options)
Splits the source into bundles of approximately given size (in bytes).
|
String |
toString() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
createReader, producesSortedKeysgetDefaultOutputCoderpublic OffsetBasedSource(long startOffset,
long endOffset,
long minBundleSize)
startOffset - starting offset (inclusive) of the source. Must be non-negative.endOffset - ending offset (exclusive) of the source. Any
offset >= getMaxEndOffset(), e.g., Long.MAX_VALUE, means the same as
getMaxEndOffset(). Must be >= startOffset.minBundleSize - minimum bundle size in offset units that should be used when splitting the
source into sub-sources. This will not be respected if the total range of
the source is smaller than the specified minBundleSize.
Must be non-negative.public long getStartOffset()
public long getEndOffset()
>= getMaxEndOffset(),
e.g. Long.MAX_VALUE, this implies getMaxEndOffset().public long getMinBundleSize()
minBundleSize.public long getEstimatedSizeBytes(PipelineOptions options) throws Exception
BoundedSourcegetEstimatedSizeBytes in class BoundedSource<T>Exceptionpublic List<? extends OffsetBasedSource<T>> splitIntoBundles(long desiredBundleSizeBytes, PipelineOptions options) throws Exception
BoundedSourcesplitIntoBundles in class BoundedSource<T>Exceptionpublic void validate()
SourceIt is recommended to use Preconditions for implementing
this method.
public long getBytesPerOffset()
getEstimatedSizeBytes(com.google.cloud.dataflow.sdk.options.PipelineOptions) and splitIntoBundles(long, com.google.cloud.dataflow.sdk.options.PipelineOptions).public abstract long getMaxEndOffset(PipelineOptions options) throws Exception
Long.MAX_VALUE.Exceptionpublic abstract OffsetBasedSource<T> createSourceForSubrange(long start, long end)
OffsetBasedSource for a subrange of the current source. [start, end) will
be within the range [startOffset, endOffset] of the current source.