T - The type of records contained in the block.@Experimental(value=SOURCE_SINK) public static class AvroSource.AvroReader<T> extends BlockBasedSource.BlockBasedReader<T>
BlockBasedSource.BlockBasedReader for reading blocks from Avro files.
An Avro Object Container File consists of a header followed by a 16-bit sync marker and then a sequence of blocks, where each block begins with two encoded longs representing the total number of records in the block and the block's size in bytes, followed by the block's (optionally-encoded) records. Each block is terminated by a 16-bit sync marker.
Here, we consider the sync marker that precedes a block to be its offset, as this allows a reader that begins reading at that offset to detect the sync marker and the beginning of the block.
| Constructor and Description |
|---|
AvroReader(AvroSource<T> source)
Reads Avro records of type
T from the specified source. |
| Modifier and Type | Method and Description |
|---|---|
com.google.cloud.dataflow.sdk.io.AvroSource.AvroBlock<T> |
getCurrentBlock()
Returns the current block (the block that was read by the last successful call to
BlockBasedSource.BlockBasedReader.readNextBlock()). |
long |
getCurrentBlockOffset()
Returns the largest offset such that starting to read from that offset includes the current
block.
|
long |
getCurrentBlockSize()
Returns the size of the current block in bytes as it is represented in the underlying file,
if possible.
|
AvroSource<T> |
getCurrentSource()
Returns a
Source describing the same input that this Reader currently reads
(including items already read). |
boolean |
readNextBlock()
Read the next block from the input.
|
protected void |
startReading(ReadableByteChannel channel)
Starts reading from the provided channel.
|
getCurrent, getCurrentOffset, getFractionConsumed, isAtSplitPoint, readNextRecordadvanceImpl, close, startImpladvance, splitAtFraction, startgetCurrentTimestamppublic AvroReader(AvroSource<T> source)
T from the specified source.public AvroSource<T> getCurrentSource()
BoundedSource.BoundedReaderSource describing the same input that this Reader currently reads
(including items already read).
Reader subclasses can use this method for convenience to access unchanging properties of the source being read. Alternatively, they can cache these properties in the constructor.
The framework will call this method in the course of dynamic work rebalancing, e.g. after
a successful BoundedSource.BoundedReader.splitAtFraction(double) call.
Source objects must always be immutable. However, the return value of
this function may be affected by dynamic work rebalancing, happening asynchronously via
BoundedSource.BoundedReader.splitAtFraction(double), meaning it can return a different
Source object. However, the returned object itself will still itself be immutable.
Callers must take care not to rely on properties of the returned source that may be
asynchronously changed as a result of this process (e.g. do not cache an end offset when
reading a file).
Source possible.
In practice, the implementation of this method should nearly always be one of the following:
BoundedSource.BoundedReader.getCurrentSource(): delegate to base class. In this case, it is almost always
an error for the subclass to maintain its own copy of the source.
public FooReader(FooSource<T> source) {
super(source);
}
public FooSource<T> getCurrentSource() {
return (FooSource<T>)super.getCurrentSource();
}
private final FooSource<T> source;
public FooReader(FooSource<T> source) {
this.source = source;
}
public FooSource<T> getCurrentSource() {
return source;
}
BoundedSource.BoundedReader that explicitly supports dynamic work rebalancing:
maintain a variable pointing to an immutable source object, and protect it with
synchronization.
private FooSource<T> source;
public FooReader(FooSource<T> source) {
this.source = source;
}
public synchronized FooSource<T> getCurrentSource() {
return source;
}
public synchronized FooSource<T> splitAtFraction(double fraction) {
...
FooSource<T> primary = ...;
FooSource<T> residual = ...;
this.source = primary;
return residual;
}
getCurrentSource in class FileBasedSource.FileBasedReader<T>public boolean readNextBlock()
throws IOException
BlockBasedSource.BlockBasedReaderreadNextBlock in class BlockBasedSource.BlockBasedReader<T>IOExceptionpublic com.google.cloud.dataflow.sdk.io.AvroSource.AvroBlock<T> getCurrentBlock()
BlockBasedSource.BlockBasedReaderBlockBasedSource.BlockBasedReader.readNextBlock()). May return null initially, or if no block has been
successfully read.getCurrentBlock in class BlockBasedSource.BlockBasedReader<T>public long getCurrentBlockOffset()
BlockBasedSource.BlockBasedReadergetCurrentBlockOffset in class BlockBasedSource.BlockBasedReader<T>public long getCurrentBlockSize()
BlockBasedSource.BlockBasedReader0 if the size of the current block is unknown.
The size returned by this method must be such that for two successive blocks A and B,
offset(A) + size(A) <= offset(B). If this is not satisfied, the progress reported
by the BlockBasedReader will be non-monotonic and will interfere with the quality
(but not correctness) of dynamic work rebalancing.
This method and BlockBasedSource.Block.getFractionOfBlockConsumed() are used to provide an estimate
of progress within a block (getCurrentBlock().getFractionOfBlockConsumed() *
getCurrentBlockSize()). It is acceptable for the result of this computation to be 0,
but progress estimation will be inaccurate.
getCurrentBlockSize in class BlockBasedSource.BlockBasedReader<T>protected void startReading(ReadableByteChannel channel) throws IOException
startReading in class FileBasedSource.FileBasedReader<T>channel - a byte channel representing the file backing the reader.IOException