Class OffsetBasedSource<T>
- java.lang.Object
-
- org.apache.beam.sdk.io.Source<T>
-
- org.apache.beam.sdk.io.BoundedSource<T>
-
- org.apache.beam.sdk.io.OffsetBasedSource<T>
-
- Type Parameters:
T
- Type of records represented by the source.
- All Implemented Interfaces:
java.io.Serializable
,HasDisplayData
- Direct Known Subclasses:
FileBasedSource
public abstract class OffsetBasedSource<T> extends BoundedSource<T>
ABoundedSource
that uses offsets to define starting and ending positions.OffsetBasedSource
is a common base class for all bounded sources where the input can be represented as a single range, and an input can be efficiently processed in parallel by splitting the range into a set of disjoint ranges whose union is the original range. This class should be used for sources that can be cheaply read starting at any given offset.OffsetBasedSource
stores the range and implements splitting into bundles.Extend
OffsetBasedSource
to implement your own offset-based custom source.FileBasedSource
, which is a subclass of this, adds additional functionality useful for custom sources that are based on files. If possible implementors should start fromFileBasedSource
instead ofOffsetBasedSource
.Consult
RangeTracker
for important semantics common to all sources defined by a range of positions of a certain type, including the semantics of split points (OffsetBasedSource.OffsetBasedReader.isAtSplitPoint()
).- See Also:
BoundedSource
,FileBasedSource
,RangeTracker
, Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
OffsetBasedSource.OffsetBasedReader<T>
ASource.Reader
that implements code common to readers of allOffsetBasedSource
s.-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
-
Constructor Summary
Constructors Constructor Description OffsetBasedSource(long startOffset, long endOffset, long minBundleSize)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract OffsetBasedSource<T>
createSourceForSubrange(long start, long end)
Returns anOffsetBasedSource
for a subrange of the current source.long
getBytesPerOffset()
Returns approximately how many bytes of data correspond to a single offset in this source.long
getEndOffset()
Returns the specified ending offset of the source.long
getEstimatedSizeBytes(PipelineOptions options)
An estimate of the total size (in bytes) of the data that would be read from this source.abstract long
getMaxEndOffset(PipelineOptions options)
Returns the actual ending offset of the current source.long
getMinBundleSize()
Returns the minimum bundle size that should be used when splitting the source into sub-sources.long
getStartOffset()
Returns the starting offset of the source.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.java.util.List<? extends OffsetBasedSource<T>>
split(long desiredBundleSizeBytes, PipelineOptions options)
Splits the source into bundles of approximatelydesiredBundleSizeBytes
.java.lang.String
toString()
void
validate()
Checks that this source is valid, before it can be used in a pipeline.-
Methods inherited from class org.apache.beam.sdk.io.BoundedSource
createReader
-
Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder
-
-
-
-
Constructor Detail
-
OffsetBasedSource
public OffsetBasedSource(long startOffset, long endOffset, long minBundleSize)
- Parameters:
startOffset
- starting offset (inclusive) of the source. Must be non-negative.endOffset
- ending offset (exclusive) of the source. UseLong.MAX_VALUE
to indicate that the entire source afterstartOffset
should be read. Must be> startOffset
.minBundleSize
- minimum bundle size in offset units that should be used when splitting the source into sub-sources. This value may not be respected if the total range of the source is smaller than the specifiedminBundleSize
. Must be non-negative.
-
-
Method Detail
-
getStartOffset
public long getStartOffset()
Returns the starting offset of the source.
-
getEndOffset
public long getEndOffset()
Returns the specified ending offset of the source. Any returned value greater than or equal togetMaxEndOffset(PipelineOptions)
should be treated asgetMaxEndOffset(PipelineOptions)
.
-
getMinBundleSize
public long getMinBundleSize()
Returns the minimum bundle size that should be used when splitting the source into sub-sources. This value may not be respected if the total range of the source is smaller than the specifiedminBundleSize
.
-
getEstimatedSizeBytes
public long getEstimatedSizeBytes(PipelineOptions options) throws java.lang.Exception
Description copied from class:BoundedSource
An estimate of the total size (in bytes) of the data that would be read from this source. This estimate is in terms of external storage size, before any decompression or other processing done by the reader.If there is no way to estimate the size of the source implementations MAY return 0L.
- Specified by:
getEstimatedSizeBytes
in classBoundedSource<T>
- Throws:
java.lang.Exception
-
split
public java.util.List<? extends OffsetBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws java.lang.Exception
Description copied from class:BoundedSource
Splits the source into bundles of approximatelydesiredBundleSizeBytes
.- Specified by:
split
in classBoundedSource<T>
- Throws:
java.lang.Exception
-
validate
public void validate()
Description copied from class:Source
Checks that this source is valid, before it can be used in a pipeline.It is recommended to use
Preconditions
for implementing this method.
-
toString
public java.lang.String toString()
- Overrides:
toString
in classjava.lang.Object
-
getBytesPerOffset
public long getBytesPerOffset()
Returns approximately how many bytes of data correspond to a single offset in this source. Used for translation between this source's range and methods defined in terms of bytes, such asgetEstimatedSizeBytes(org.apache.beam.sdk.options.PipelineOptions)
andsplit(long, org.apache.beam.sdk.options.PipelineOptions)
.Defaults to
1
byte, which is the common case for, e.g., file sources.
-
getMaxEndOffset
public abstract long getMaxEndOffset(PipelineOptions options) throws java.lang.Exception
Returns the actual ending offset of the current source. The value returned by this function will be used to clip the end of the range[startOffset, endOffset)
such that the range used is[startOffset, min(endOffset, maxEndOffset))
.As an example in which
OffsetBasedSource
is used to implement a file source, suppose that this source was constructed with anendOffset
ofLong.MAX_VALUE
to indicate that a file should be read to the end. Then this function should determine the actual, exact size of the file in bytes and return it.- Throws:
java.lang.Exception
-
createSourceForSubrange
public abstract OffsetBasedSource<T> createSourceForSubrange(long start, long end)
Returns anOffsetBasedSource
for a subrange of the current source. The subrange[start, end)
must be within the range[startOffset, endOffset)
of the current source, i.e.startOffset <= start < end <= endOffset
.
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Description copied from class:Source
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Overrides:
populateDisplayData
in classSource<T>
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
-