@Experimental(value=SOURCE_SINK) public abstract static class BoundedSource.BoundedReader<T> extends Source.Reader<T>
Reader that reads a bounded amount of input and supports some additional
operations, such as progress estimation and dynamic work rebalancing.
Once Source.Reader.start() or Source.Reader.advance() has returned false, neither will be called
again on this object.
splitAtFraction(double),
getFractionConsumed() and getCurrentSource(), which can be called concurrently
from a different thread. There will not be multiple concurrent calls to
splitAtFraction(double) but there can be for getFractionConsumed() if
splitAtFraction(double) is implemented.
If the source does not implement splitAtFraction(double), you do not need to worry about
thread safety. If implemented, it must be safe to call splitAtFraction(double) and
getFractionConsumed() concurrently with other methods.
Additionally, a successful splitAtFraction(double) call must, by definition, cause
getCurrentSource() to start returning a different value.
Callers of getCurrentSource() need to be aware of the possibility that the returned
value can change at any time, and must only access the properties of the source returned by
getCurrentSource() which do not change between splitAtFraction(double) calls.
splitAtFraction(double)splitAtFraction(double)
may be called concurrently with Source.Reader.advance() or Source.Reader.start(). It is critical that
their interaction is implemented in a thread-safe way, otherwise data loss is possible.
Sources which support dynamic work rebalancing should use
RangeTracker to manage the (source-specific)
range of positions that is being split. If your source supports dynamic work rebalancing,
please use that class to implement it if possible; if not possible, please contact the team
at [email protected].
| Constructor and Description |
|---|
BoundedReader() |
| Modifier and Type | Method and Description |
|---|---|
abstract BoundedSource<T> |
getCurrentSource()
Returns a
Source describing the same input that this Reader currently reads
(including items already read). |
Instant |
getCurrentTimestamp()
By default, returns the minimum possible timestamp.
|
Double |
getFractionConsumed()
Returns a value in [0, 1] representing approximately what fraction of the
current source this reader has read so far, or null if such
an estimate is not available. |
BoundedSource<T> |
splitAtFraction(double fraction)
Tells the reader to narrow the range of the input it's going to read and give up
the remainder, so that the new range would contain approximately the given
fraction of the amount of data in the current range.
|
advance, close, getCurrent, startpublic Double getFractionConsumed()
current source this reader has read so far, or null if such
an estimate is not available.
It is recommended that this method should satisfy the following properties:
Source.Reader.start() call.
Source.Reader.start() or Source.Reader.advance() call that returns false.
By default, returns null to indicate that this cannot be estimated.
splitAtFraction(double) is implemented, this method can be called concurrently to other
methods (including itself), and it is therefore critical for it to be implemented
in a thread-safe way.public abstract BoundedSource<T> getCurrentSource()
Source describing the same input that this Reader currently reads
(including items already read).
Reader subclasses can use this method for convenience to access unchanging properties of the source being read. Alternatively, they can cache these properties in the constructor.
The framework will call this method in the course of dynamic work rebalancing, e.g. after
a successful splitAtFraction(double) call.
Source objects must always be immutable. However, the return value of
this function may be affected by dynamic work rebalancing, happening asynchronously via
splitAtFraction(double), meaning it can return a different
Source object. However, the returned object itself will still itself be immutable.
Callers must take care not to rely on properties of the returned source that may be
asynchronously changed as a result of this process (e.g. do not cache an end offset when
reading a file).
Source possible.
In practice, the implementation of this method should nearly always be one of the following:
getCurrentSource(): delegate to base class. In this case, it is almost always
an error for the subclass to maintain its own copy of the source.
public FooReader(FooSource<T> source) {
super(source);
}
public FooSource<T> getCurrentSource() {
return (FooSource<T>)super.getCurrentSource();
}
private final FooSource<T> source;
public FooReader(FooSource<T> source) {
this.source = source;
}
public FooSource<T> getCurrentSource() {
return source;
}
BoundedSource.BoundedReader that explicitly supports dynamic work rebalancing:
maintain a variable pointing to an immutable source object, and protect it with
synchronization.
private FooSource<T> source;
public FooReader(FooSource<T> source) {
this.source = source;
}
public synchronized FooSource<T> getCurrentSource() {
return source;
}
public synchronized FooSource<T> splitAtFraction(double fraction) {
...
FooSource<T> primary = ...;
FooSource<T> residual = ...;
this.source = primary;
return residual;
}
getCurrentSource in class Source.Reader<T>public BoundedSource<T> splitAtFraction(double fraction)
Returns a BoundedSource representing the remainder.
BoundedSource<T> initial = reader.getCurrentSource();
BoundedSource<T> residual = reader.splitAtFraction(fraction);
BoundedSource<T> primary = reader.getCurrentSource();
This method should return null if the split cannot be performed for this fraction
while satisfying the semantics above. E.g., a reader that reads a range of offsets
in a file should return null if it is already past the position in its range
corresponding to the given fraction. In this case, the method MUST have no effect
(the reader must behave as if the method hadn't been called at all).
It is also very important that this method always completes quickly. In particular, it should not perform or wait on any blocking operations such as I/O, RPCs etc. Violating this requirement may stall completion of the work item or even cause it to fail.
It is incorrect to make both this method and Source.Reader.start()/Source.Reader.advance()
synchronized, because those methods can perform blocking operations, and then
this method would have to wait for those calls to complete.
RangeTracker makes it easy to implement
this method safely and correctly.
By default, returns null to indicate that splitting is not possible.
public Instant getCurrentTimestamp() throws NoSuchElementException
getCurrentTimestamp in class Source.Reader<T>NoSuchElementException - if the reader is at the beginning of the input and
Source.Reader.start() or Source.Reader.advance() wasn't called, or if the last Source.Reader.start() or
Source.Reader.advance() returned false.