public interface RowBatchReader
RecordReader interface.
Classes that extend from this interface must handle all aspects of reading data, creating vectors, handling projection and so on. That is, extensions of this class are intended to be frameworks.
For most cases, a plugin probably wants to start from a
base implementation, such as the ManagedReader class,
which provides services for handling projection, setting up
the result set loader, handling schema smoothing, sharing
vectors across batches, etc.
Note that this interface reads a batch of rows, not
a single row. (The original RecordReader could be
confusing in this aspect.)
The expected lifecycle is:
open(): Allocate resources and set up the schema
(if early schema.)output() and schemaVersion() to determine
the initial version. If no schema is available on open (the reader
is late-schema), return null for the output and -1 for the schema
version. Else, return non-negative for the version number and
an empty batch with a schema.next()} to retrieve the next record batch.
Return true if a batch is available, false if EOF. There is no
requirement to return a batch; the first call to next()
can return false if no data is available.}
output() and schemaVersion() to obtain the
batch of records read, and to detect if the version of the schema
is different from the previous batch.close() when the reader is no longer needed. This
may occur before next() returns false if an
error occurs or a limit is reached.If an error occurs, the reader can throw a {@link RuntimeException} from any method. A {@link UserException} is preferred to provide detailed information about the source of the problem.
| Modifier and Type | Method and Description |
|---|---|
void |
close()
Release resources.
|
boolean |
defineSchema()
Called for the first reader within a scan.
|
String |
name()
Name used when reporting errors.
|
boolean |
next()
Read the next batch.
|
boolean |
open()
Setup the record reader.
|
VectorContainer |
output()
Return the container with the reader's output.
|
int |
schemaVersion()
Return the version of the schema returned by
output(). |
String name()
boolean open()
next(). Allocate resources here, not in the constructor.
Example: open files, allocate buffers, etc.RuntimeException - for "hard" errors that should terminate
the query. UserException preferred to explain the problem
better than the scan operator can by guessing at the causeboolean defineSchema()
This step is optional and is purely for performance.
boolean next()
true with an empty batch is valid, and is helpful on
the very first batch (returning schema only.) An empty batch
with a false return code indicates EOF and the batch
will be discarded. A non-empty batch along with a false
return result indicates a final, valid batch, but that EOF was
reached and no more data is available.
This somewhat complex protocol avoids the need to allocate a final batch just to find out that no more data is available; it allows EOF to be returned along with the final batch.
true if more data may be available (and so
next() should be called again, false to indicate
that EOF was reachedRuntimeException - (UserException preferred) if an
error occurs that should fail the query.VectorContainer output()
open(). If the data source
can provide a schema at open time, then the reader should provide an
empty batch with the schema set. The scanner will return this schema
downstream to inform other operators of the schema.next() to retrieve
the batch produced by that call. (No call is made if next()
returns false.open() (optional)
or returns rows read after next() (required)int schemaVersion()
output(). The schema
is assumed to start at -1 (no schema). The reader is free to use any
numbering system it likes as long as:
next(). Thus Two successive
calls to this method should return the same number if no next()
call lies between.
If the reader can return a schema on open (so-called "early-schema), then this method must return a non-negative version number, even if the schema happens to be empty (such as reading an empty file.)
However, if the reader cannot return a schema on open (so-called "late
schema"), then this method must return -1 (and output() must
return null) to indicate now schema is available when called before the
first call to next().
No calls will be made to this method before open() after
close(){@code or after {@code next()} returns false. The implementation
is thus not required to handle these cases.
void close()
next() returns EOF. Release
all resources and close files. Guaranteed to be called if
open() returns normally; will not be called if open()
throws an exception.RutimeException - (UserException preferred) if an
error occurs that should fail the query.Copyright © 2022 The Apache Software Foundation. All rights reserved.