public class AvroIO extends Object
PTransform
s for reading and writing Avro files.
To read a PCollection
from one or more Avro files, use
AvroIO.Read
, specifying AvroIO.Read.from(java.lang.String)
to specify
the path of the file(s) to read from (e.g., a local filename or
filename pattern if running locally, or a Google Cloud Storage
filename or filename pattern of the form
"gs://<bucket>/<filepath>"
), and optionally
AvroIO.Read.named(java.lang.String)
to specify the name of the pipeline step.
It is required to specify AvroIO.Read.withSchema(java.lang.Class<T>)
. To
read specific records, such as Avro-generated classes, provide an
Avro-generated class type. To read GenericRecords
, provide either
a Schema
object or an Avro schema in a JSON-encoded string form.
An exception will be thrown if a record doesn't match the specified
schema.
For example:
Pipeline p = ...;
// A simple Read of a local file (only runs locally):
PCollection<AvroAutoGenClass> records =
p.apply(AvroIO.Read.from("/path/to/file.avro")
.withSchema(AvroAutoGenClass.class));
// A Read from a GCS file (runs locally and via the Google Cloud
// Dataflow service):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
p.apply(AvroIO.Read.named("ReadFromAvro")
.from("gs://my_bucket/path/to/records-*.avro")
.withSchema(schema));
To write a PCollection
to one or more Avro files, use
AvroIO.Write
, specifying AvroIO.Write.to(java.lang.String)
to specify
the path of the file to write to (e.g., a local filename or sharded
filename pattern if running locally, or a Google Cloud Storage
filename or sharded filename pattern of the form
"gs://<bucket>/<filepath>"
), and optionally
AvroIO.Write.named(java.lang.String)
to specify the name of the pipeline step.
It is required to specify AvroIO.Write.withSchema(java.lang.Class<T>)
. To
write specific records, such as Avro-generated classes, provide an
Avro-generated class type. To write GenericRecords
, provide either
a Schema
object or a schema in a JSON-encoded string form.
An exception will be thrown if a record doesn't match the specified
schema.
For example:
// A simple Write to a local file (only runs locally):
PCollection<AvroAutoGenClass> records = ...;
records.apply(AvroIO.Write.to("/path/to/file.avro")
.withSchema(AvroAutoGenClass.class));
// A Write to a sharded GCS file (runs locally and via the Google Cloud
// Dataflow service):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records = ...;
records.apply(AvroIO.Write.named("WriteToAvro")
.to("gs://my_bucket/path/to/numbers")
.withSchema(schema)
.withSuffix(".avro"));
PipelineRunner
that is used to execute the
Dataflow job. Please refer to the documentation of corresponding PipelineRunner
s for
more details.Modifier and Type | Class and Description |
---|---|
static class |
AvroIO.Read
A root
PTransform that reads from an Avro file (or multiple Avro
files matching a pattern) and returns a PCollection containing
the decoding of each record. |
static class |
AvroIO.Write
A root
PTransform that writes a PCollection to an Avro file (or
multiple Avro files matching a sharding pattern). |