public class TextIO extends Object
PTransform
s for reading and writing text files.
To read a PCollection
from one or more text files, use TextIO.Read
.
You can instantiate a transform using TextIO.Read.from(String)
to specify
the path of the file(s) to read from (e.g., a local filename or
filename pattern if running locally, or a Google Cloud Storage
filename or filename pattern of the form
"gs://<bucket>/<filepath>"
). You may optionally call
TextIO.Read.named(String)
to specify the name of the pipeline step.
By default, TextIO.Read
returns a PCollection
of Strings
,
each corresponding to one line of an input UTF-8 text file. To convert directly from the raw
bytes (split into lines delimited by '\n', '\r', or '\r\n') to another object of type T
,
supply a Coder<T>
using TextIO.Read.withCoder(Coder)
.
See the following examples:
Pipeline p = ...;
// A simple Read of a local file (only runs locally):
PCollection<String> lines =
p.apply(TextIO.Read.from("/local/path/to/file.txt"));
// A fully-specified Read from a GCS file (runs locally and via the
// Google Cloud Dataflow service):
PCollection<Integer> numbers =
p.apply(TextIO.Read.named("ReadNumbers")
.from("gs://my_bucket/path/to/numbers-*.txt")
.withCoder(TextualIntegerCoder.of()));
To write a PCollection
to one or more text files, use
TextIO.Write
, specifying TextIO.Write.to(String)
to specify
the path of the file to write to (e.g., a local filename or sharded
filename pattern if running locally, or a Google Cloud Storage
filename or sharded filename pattern of the form
"gs://<bucket>/<filepath>"
). You can optionally name the resulting transform using
TextIO.Write.named(String)
, and you can use TextIO.Write.withCoder(Coder)
to specify the Coder to use to encode the Java values into text lines.
Any existing files with the same names as generated output files will be overwritten.
For example:
// A simple Write to a local file (only runs locally):
PCollection<String> lines = ...;
lines.apply(TextIO.Write.to("/path/to/file.txt"));
// A fully-specified Write to a sharded GCS file (runs locally and via the
// Google Cloud Dataflow service):
PCollection<Integer> numbers = ...;
numbers.apply(TextIO.Write.named("WriteNumbers")
.to("gs://my_bucket/path/to/numbers")
.withSuffix(".txt")
.withCoder(TextualIntegerCoder.of()));
When run using the DirectPipelineRunner
, your pipeline can read and write text files
on your local drive and remote text files on Google Cloud Storage that you have access to using
your gcloud
credentials. When running in the Dataflow service using
DataflowPipelineRunner
, the pipeline can only read and write files from GCS. For more
information about permissions, see the Cloud Dataflow documentation on
Security and
Permissions.
Modifier and Type | Class and Description |
---|---|
static class |
TextIO.CompressionType
Possible text file compression types.
|
static class |
TextIO.Read
A
PTransform that reads from a text file (or multiple text
files matching a pattern) and returns a PCollection containing
the decoding of each of the lines of the text file(s). |
static class |
TextIO.Write
A
PTransform that writes a PCollection to text file (or
multiple text files matching a sharding pattern), with each
element of the input collection encoded into its own line. |
Modifier and Type | Field and Description |
---|---|
static Coder<String> |
DEFAULT_TEXT_CODER
The default coder, which returns each line of the input file as a string.
|