Class BlockBasedSource<T>
- java.lang.Object
-
- org.apache.beam.sdk.io.Source<T>
-
- org.apache.beam.sdk.io.BoundedSource<T>
-
- org.apache.beam.sdk.io.OffsetBasedSource<T>
-
- org.apache.beam.sdk.io.FileBasedSource<T>
-
- org.apache.beam.sdk.io.BlockBasedSource<T>
-
- Type Parameters:
T
- The type of records to be read from the source.
- All Implemented Interfaces:
java.io.Serializable
,HasDisplayData
- Direct Known Subclasses:
AvroSource
@Experimental(SOURCE_SINK) public abstract class BlockBasedSource<T> extends FileBasedSource<T>
ABlockBasedSource
is aFileBasedSource
where a file consists of blocks of records.BlockBasedSource
should be derived from when a file format does not support efficient seeking to a record in the file, but can support efficient seeking to a block. Alternatively, records in the file cannot be offset-addressed, but blocks can (it is not possible to say that record {code i} starts at offsetm
, but it is possible to say that blockj
starts at offsetn
).The records that will be read from a
BlockBasedSource
that corresponds to a subrange of a file[startOffset, endOffset)
are those records such that the record is contained in a block that starts at offseti
, wherei >= startOffset
andi < endOffset
. In other words, a record will be read from the source if its first byte is contained in a block that begins within the range described by the source.This entails that it is possible to determine the start offsets of all blocks in a file.
Progress reporting for reading from a
BlockBasedSource
is inaccurate. ABlockBasedSource.BlockBasedReader
reports its current offset as(offset of current block) + (current block size) * (fraction of block consumed)
. However, only the offset of the current block is required to be accurately reported by subclass implementations. As such, in the worst case, the current offset is only updated at block boundaries.BlockBasedSource
supports dynamic splitting. However, because records in aBlockBasedSource
are not required to have offsets and progress reporting is inaccurate,BlockBasedReader
only supports splitting at block boundaries. In other words,BlockBasedSource.BlockBasedReader.atSplitPoint
returns true iff the current record is the first record in a block. SeeFileBasedSource.FileBasedReader
for discussion about split points.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static class
BlockBasedSource.Block<T>
ABlock
represents a block of records that can be read.protected static class
BlockBasedSource.BlockBasedReader<T>
AReader
that reads records from aBlockBasedSource
.-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.FileBasedSource
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
-
Constructor Summary
Constructors Constructor Description BlockBasedSource(java.lang.String fileOrPatternSpec, long minBundleSize)
LikeBlockBasedSource(String, EmptyMatchTreatment, long)
but with a defaultEmptyMatchTreatment
ofEmptyMatchTreatment.DISALLOW
.BlockBasedSource(java.lang.String fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize)
Creates aBlockBasedSource
based on a file name or pattern.BlockBasedSource(MatchResult.Metadata metadata, long minBundleSize, long startOffset, long endOffset)
Creates aBlockBasedSource
for a single file.BlockBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, long minBundleSize)
BlockBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize)
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description protected abstract BlockBasedSource<T>
createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
Creates aBlockBasedSource
for the specified range in a single file.protected abstract BlockBasedSource.BlockBasedReader<T>
createSingleFileReader(PipelineOptions options)
Creates aBlockBasedReader
.-
Methods inherited from class org.apache.beam.sdk.io.FileBasedSource
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toString, validate
-
Methods inherited from class org.apache.beam.sdk.io.OffsetBasedSource
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
-
Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder
-
-
-
-
Constructor Detail
-
BlockBasedSource
public BlockBasedSource(java.lang.String fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize)
Creates aBlockBasedSource
based on a file name or pattern. Subclasses must call this constructor when creating aBlockBasedSource
for a file pattern. SeeFileBasedSource
for more information.
-
BlockBasedSource
public BlockBasedSource(java.lang.String fileOrPatternSpec, long minBundleSize)
LikeBlockBasedSource(String, EmptyMatchTreatment, long)
but with a defaultEmptyMatchTreatment
ofEmptyMatchTreatment.DISALLOW
.
-
BlockBasedSource
public BlockBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, long minBundleSize)
-
BlockBasedSource
public BlockBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize)
-
BlockBasedSource
public BlockBasedSource(MatchResult.Metadata metadata, long minBundleSize, long startOffset, long endOffset)
Creates aBlockBasedSource
for a single file. Subclasses must call this constructor when implementingcreateForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long)
. See documentation inFileBasedSource
.
-
-
Method Detail
-
createForSubrangeOfFile
protected abstract BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
Creates aBlockBasedSource
for the specified range in a single file.- Specified by:
createForSubrangeOfFile
in classFileBasedSource<T>
- Parameters:
metadata
- file backing the newFileBasedSource
.start
- starting byte offset of the newFileBasedSource
.end
- ending byte offset of the newFileBasedSource
. May be Long.MAX_VALUE, in which case it will be inferred usingFileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.
-
createSingleFileReader
protected abstract BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options)
Creates aBlockBasedReader
.- Specified by:
createSingleFileReader
in classFileBasedSource<T>
-
-