Class AvroSource.AvroReader<T>
- java.lang.Object
-
- org.apache.beam.sdk.io.Source.Reader<T>
-
- org.apache.beam.sdk.io.BoundedSource.BoundedReader<T>
-
- org.apache.beam.sdk.io.OffsetBasedSource.OffsetBasedReader<T>
-
- org.apache.beam.sdk.io.FileBasedSource.FileBasedReader<T>
-
- org.apache.beam.sdk.io.BlockBasedSource.BlockBasedReader<T>
-
- org.apache.beam.sdk.io.AvroSource.AvroReader<T>
-
- Type Parameters:
T
- The type of records contained in the block.
- All Implemented Interfaces:
java.lang.AutoCloseable
- Enclosing class:
- AvroSource<T>
@Experimental(SOURCE_SINK) public static class AvroSource.AvroReader<T> extends BlockBasedSource.BlockBasedReader<T>
ABlockBasedSource.BlockBasedReader
for reading blocks from Avro files.An Avro Object Container File consists of a header followed by a 16-bit sync marker and then a sequence of blocks, where each block begins with two encoded longs representing the total number of records in the block and the block's size in bytes, followed by the block's (optionally-encoded) records. Each block is terminated by a 16-bit sync marker.
-
-
Field Summary
-
Fields inherited from class org.apache.beam.sdk.io.BoundedSource.BoundedReader
SPLIT_POINTS_UNKNOWN
-
-
Constructor Summary
Constructors Constructor Description AvroReader(AvroSource<T> source)
Reads Avro records of typeT
from the specified source.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description org.apache.beam.sdk.io.AvroSource.AvroBlock<T>
getCurrentBlock()
Returns the current block (the block that was read by the last successful call toBlockBasedSource.BlockBasedReader.readNextBlock()
).long
getCurrentBlockOffset()
Returns the largest offset such that starting to read from that offset includes the current block.long
getCurrentBlockSize()
Returns the size of the current block in bytes as it is represented in the underlying file, if possible.AvroSource<T>
getCurrentSource()
Returns aSource
describing the same input that thisReader
currently reads (including items already read).long
getSplitPointsRemaining()
Returns the total amount of parallelism in the unprocessed part of this reader's currentBoundedSource
(as would be returned byBoundedSource.BoundedReader.getCurrentSource()
).boolean
readNextBlock()
Read the next block from the input.protected void
startReading(java.nio.channels.ReadableByteChannel channel)
Performs any initialization of the subclass ofFileBasedReader
that involves IO operations.-
Methods inherited from class org.apache.beam.sdk.io.BlockBasedSource.BlockBasedReader
getCurrent, getCurrentOffset, getFractionConsumed, isAtSplitPoint, readNextRecord
-
Methods inherited from class org.apache.beam.sdk.io.FileBasedSource.FileBasedReader
advanceImpl, allowsDynamicSplitting, close, startImpl
-
Methods inherited from class org.apache.beam.sdk.io.OffsetBasedSource.OffsetBasedReader
advance, getSplitPointsConsumed, isDone, isStarted, splitAtFraction, start
-
Methods inherited from class org.apache.beam.sdk.io.BoundedSource.BoundedReader
getCurrentTimestamp
-
-
-
-
Constructor Detail
-
AvroReader
public AvroReader(AvroSource<T> source)
Reads Avro records of typeT
from the specified source.
-
-
Method Detail
-
getCurrentSource
public AvroSource<T> getCurrentSource()
Description copied from class:BoundedSource.BoundedReader
Returns aSource
describing the same input that thisReader
currently reads (including items already read).Usage
Reader subclasses can use this method for convenience to access unchanging properties of the source being read. Alternatively, they can cache these properties in the constructor.
The framework will call this method in the course of dynamic work rebalancing, e.g. after a successful
BoundedSource.BoundedReader.splitAtFraction(double)
call.Mutability and thread safety
Remember that
Source
objects must always be immutable. However, the return value of this function may be affected by dynamic work rebalancing, happening asynchronously viaBoundedSource.BoundedReader.splitAtFraction(double)
, meaning it can return a differentSource
object. However, the returned object itself will still itself be immutable. Callers must take care not to rely on properties of the returned source that may be asynchronously changed as a result of this process (e.g. do not cache an end offset when reading a file).Implementation
For convenience, subclasses should usually return the most concrete subclass of
Source
possible. In practice, the implementation of this method should nearly always be one of the following:- Source that inherits from a base class that already implements
BoundedSource.BoundedReader.getCurrentSource()
: delegate to base class. In this case, it is almost always an error for the subclass to maintain its own copy of the source.public FooReader(FooSource<T> source) { super(source); } public FooSource<T> getCurrentSource() { return (FooSource<T>)super.getCurrentSource(); }
- Source that does not support dynamic work rebalancing: return a private final variable.
private final FooSource<T> source; public FooReader(FooSource<T> source) { this.source = source; } public FooSource<T> getCurrentSource() { return source; }
BoundedSource.BoundedReader
that explicitly supports dynamic work rebalancing: maintain a variable pointing to an immutable source object, and protect it with synchronization.private FooSource<T> source; public FooReader(FooSource<T> source) { this.source = source; } public synchronized FooSource<T> getCurrentSource() { return source; } public synchronized FooSource<T> splitAtFraction(double fraction) { ... FooSource<T> primary = ...; FooSource<T> residual = ...; this.source = primary; return residual; }
- Overrides:
getCurrentSource
in classFileBasedSource.FileBasedReader<T>
- Source that inherits from a base class that already implements
-
readNextBlock
public boolean readNextBlock()
Description copied from class:BlockBasedSource.BlockBasedReader
Read the next block from the input.- Specified by:
readNextBlock
in classBlockBasedSource.BlockBasedReader<T>
-
getCurrentBlock
public org.apache.beam.sdk.io.AvroSource.AvroBlock<T> getCurrentBlock()
Description copied from class:BlockBasedSource.BlockBasedReader
Returns the current block (the block that was read by the last successful call toBlockBasedSource.BlockBasedReader.readNextBlock()
). May return null initially, or if no block has been successfully read.- Specified by:
getCurrentBlock
in classBlockBasedSource.BlockBasedReader<T>
-
getCurrentBlockOffset
public long getCurrentBlockOffset()
Description copied from class:BlockBasedSource.BlockBasedReader
Returns the largest offset such that starting to read from that offset includes the current block.- Specified by:
getCurrentBlockOffset
in classBlockBasedSource.BlockBasedReader<T>
-
getCurrentBlockSize
public long getCurrentBlockSize()
Description copied from class:BlockBasedSource.BlockBasedReader
Returns the size of the current block in bytes as it is represented in the underlying file, if possible. This method may return0
if the size of the current block is unknown.The size returned by this method must be such that for two successive blocks A and B,
offset(A) + size(A) <= offset(B)
. If this is not satisfied, the progress reported by theBlockBasedReader
will be non-monotonic and will interfere with the quality (but not correctness) of dynamic work rebalancing.This method and
BlockBasedSource.Block.getFractionOfBlockConsumed()
are used to provide an estimate of progress within a block (getCurrentBlock().getFractionOfBlockConsumed() * getCurrentBlockSize()
). It is acceptable for the result of this computation to be0
, but progress estimation will be inaccurate.- Specified by:
getCurrentBlockSize
in classBlockBasedSource.BlockBasedReader<T>
-
getSplitPointsRemaining
public long getSplitPointsRemaining()
Description copied from class:BoundedSource.BoundedReader
Returns the total amount of parallelism in the unprocessed part of this reader's currentBoundedSource
(as would be returned byBoundedSource.BoundedReader.getCurrentSource()
). This corresponds to all unprocessed split point records (seeRangeTracker
), including the last split point returned, in the remainder part of the source.This function should be implemented only in addition to
BoundedSource.BoundedReader.getSplitPointsConsumed()
and only if an exact value can be returned.Consider the following examples: (1) An input that can be read in parallel down to the individual records, such as
CountingSource.upTo(long)
, is called "perfectly splittable". (2) a "block-compressed" file format such asAvroIO
, in which a block of records has to be read as a whole, but different blocks can be read in parallel. (3) An "unsplittable" input such as a cursor in a database.Assume for examples (1) and (2) that the number of records or blocks remaining is known:
- Any
reader
for which the last call toSource.Reader.start()
orSource.Reader.advance()
has returned true should should not return 0, because this reader itself represents parallelism at least 1. This condition holds independent of whether the input is splittable. - A finished reader (for which
Source.Reader.start()
orSource.Reader.advance()
) has returned false should return a value of 0. This condition holds independent of whether the input is splittable. - For example 1: After returning record #30 (starting at 1) out of 50 in a perfectly splittable 50-record input, this value should be 21 (20 remaining + 1 current) if the total number of records is known.
- For example 2: After returning a record in block 3 in a block-compressed file consisting of 5 blocks, this value should be 3 (since blocks 4 and 5 can be processed in parallel by new readers produced via dynamic work rebalancing, while the current reader continues processing block 3) if the total number of blocks is known.
- For example (3): a reader for any non-empty unsplittable input, should return 1 until it is finished, at which point it should return 0.
- For any reader: After returning the last split point in a file (e.g., the last record in example (1), the first record in the last block for example (2), or the first record in the file for example (3), this value should be 1: apart from the current task, no additional remainder can be split off.
Defaults to
BoundedSource.BoundedReader.SPLIT_POINTS_UNKNOWN
. Any value less than 0 will be interpreted as unknown.Thread safety
See the javadoc onBoundedSource.BoundedReader
for information about thread safety.- Overrides:
getSplitPointsRemaining
in classOffsetBasedSource.OffsetBasedReader<T>
- See Also:
BoundedSource.BoundedReader.getSplitPointsConsumed()
- Any
-
startReading
protected void startReading(java.nio.channels.ReadableByteChannel channel) throws java.io.IOException
Description copied from class:FileBasedSource.FileBasedReader
Performs any initialization of the subclass ofFileBasedReader
that involves IO operations. Will only be invoked once and before that invocation the base class will seek the channel to the source's starting offset.Provided
ReadableByteChannel
is for the file represented by the source of this reader. Subclass may use thechannel
to build a higher level IO abstraction, e.g., a BufferedReader or an XML parser.If the corresponding source is for a subrange of a file,
channel
is guaranteed to be an instance of the typeSeekableByteChannel
.After this method is invoked the base class will not be reading data from the channel or adjusting the position of the channel. But the base class is responsible for properly closing the channel.
- Specified by:
startReading
in classFileBasedSource.FileBasedReader<T>
- Parameters:
channel
- a byte channel representing the file backing the reader.- Throws:
java.io.IOException
-
-