T - Type of records represented by the source.public abstract class ByteOffsetBasedSource<T> extends Source<T>
FileBasedSource which is a subclass
of this adds additional functionality useful for custom sources that are based on files. If
possible implementors should start from FileBasedSource instead of
ByteOffsetBasedSource.
This is a common base class for all sources that use a byte offset range. It stores the range and implements splitting into bundles. This should be used for sources which can be cheaply read starting at any given byte offset.
The byte offset range of the source is between startOffset (inclusive) and endOffset
(exclusive), i.e. [startOffset, endOffset). The source may include a record if
its offset is at the range [startOffset, endOffset) even if the record extend
past the range. The source does not include any record at offsets before this range even if it
extend into this range because the previous range will include this record. A source may choose
to include records at offsets after this range. For example, a source may choose to set offset
boundaries based on blocks of records in which case certain records may start after
endOffset. But for any given source type the combined set of data read by two sources for
ranges [A, B) and [B, C) must be the same as the records read by a single source of the same type
for the range [A, C).
| Modifier and Type | Class and Description |
|---|---|
static class |
ByteOffsetBasedSource.ByteOffsetBasedReader<T>
A reader that implements code common to readers of all
ByteOffsetBasedSources. |
Source.Reader<T>| Constructor and Description |
|---|
ByteOffsetBasedSource(long startOffset,
long endOffset,
long minBundleSize) |
| Modifier and Type | Method and Description |
|---|---|
abstract ByteOffsetBasedSource<T> |
createSourceForSubrange(long start,
long end)
Returns a
ByteOffsetBasedSource for a subrange of the current source. |
long |
getEndOffset()
Returns the specified ending offset of the source.
|
abstract long |
getMaxEndOffset(PipelineOptions options)
Returns the exact ending offset of the current source.
|
long |
getMinBundleSize()
Returns the minimum bundle size that should be used when splitting the source into sub-sources.
|
long |
getStartOffset()
Returns the starting offset of the source.
|
java.util.List<? extends ByteOffsetBasedSource<T>> |
splitIntoBundles(long desiredBundleSizeBytes,
PipelineOptions options)
Splits the source into bundles.
|
java.lang.String |
toString() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
createBasicReader, createWindowedReader, getDefaultOutputCoder, getEstimatedSizeBytes, producesSortedKeyspublic ByteOffsetBasedSource(long startOffset,
long endOffset,
long minBundleSize)
startOffset - starting byte offset (inclusive) of the source. Must be non-negative.endOffset - ending byte offset (exclusive) of the source. Any
offset >= getMaxEndOffset(), e.g., Long.MAX_VALUE, means the same as
getMaxEndOffset(). Must be >= startOffset.minBundleSize - minimum bundle size in bytes that should be used when splitting the source
into sub-sources. This will not be respected if the total range of the source is smaller
than the specified minBundleSize. Must be non-negative.public long getStartOffset()
public long getEndOffset()
>= getMaxEndOffset(),
e.g. Long.MAX_VALUE, this implies getMaxEndOffset().public long getMinBundleSize()
minBundleSize.public java.util.List<? extends ByteOffsetBasedSource<T>> splitIntoBundles(long desiredBundleSizeBytes, PipelineOptions options) throws java.lang.Exception
Source PipelineOptions can be used to get information such as
credentials for accessing an external storage.
splitIntoBundles in class Source<T>java.lang.Exceptionpublic void validate()
SourceIt is recommended to use Preconditions for implementing
this method.
public java.lang.String toString()
toString in class java.lang.Objectpublic abstract long getMaxEndOffset(PipelineOptions options) throws java.lang.Exception
Long.MAX_VALUE.java.lang.Exceptionpublic abstract ByteOffsetBasedSource<T> createSourceForSubrange(long start, long end)
ByteOffsetBasedSource for a subrange of the current source. [start, end) will
be within the range [startOffset, endOffset] of the current source.