public class TextIO extends java.lang.Object
PTransforms for reading and writing text files.
To read a
PCollection from one or more text files, use
instantiate a transform and use
TextIO.Read.from(String) to specify the path of the
file(s) to be read. Alternatively, if the filenames to be read are themselves in a
PCollection you can use
FileIO to match them and
readFiles() to read them.
read() returns a
Strings, each corresponding to
one line of an input UTF-8 text file (split into lines delimited by '\n', '\r', or '\r\n', or
specified delimiter see
By default, the filepatterns are expanded only once.
TextIO.Read.watchForNewFiles(org.joda.time.Duration, org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition<java.lang.String, ?>) or the
FileIO.Match#continuously(Duration, TerminationCondition) and
readFiles() allow streaming of new files matching the filepattern(s).
read() prohibits filepatterns that match no files, and
allows them in case the filepattern contains a glob wildcard character. Use
readFiles() to configure
Example 1: reading a file or filepattern.
Pipeline p = ...; // A simple Read of a local file (only runs locally): PCollection<String> lines = p.apply(TextIO.read().from("/local/path/to/file.txt"));
Example 2: reading a PCollection of filenames.
Pipeline p = ...; // E.g. the filenames might be computed from other data in the pipeline, or // read from a data source. PCollection<String> filenames = ...; // Read all files in the collection. PCollection<String> lines = filenames .apply(FileIO.matchAll()) .apply(FileIO.readMatches()) .apply(TextIO.readFiles());
Example 3: streaming new files matching a filepattern.
Pipeline p = ...; PCollection<String> lines = p.apply(TextIO.read() .from("/local/path/to/files/*") .watchForNewFiles( // Check for new files every minute Duration.standardMinutes(1), // Stop watching the filepattern if no new files appear within an hour afterTimeSinceNewOutput(Duration.standardHours(1))));
If it is known that the filepattern will match a very large number of files (e.g. tens of
thousands or more), use
TextIO.Read.withHintMatchesManyFiles() for better performance and
scalability. Note that it may decrease performance if the filepattern matches only a small number
// A simple Write to a local file (only runs locally): PCollection<String> lines = ...; lines.apply(TextIO.write().to("/path/to/file.txt")); // Same as above, only with Gzip compression: PCollection<String> lines = ...; lines.apply(TextIO.write().to("/path/to/file.txt")) .withSuffix(".txt") .withCompression(Compression.GZIP));
Any existing files with the same names as generated output files will be overwritten.
If you want better control over how filenames are generated than the default policy allows, a
FileBasedSink.FilenamePolicy can also be set using
TextIO supports all features of
such as writing windowed/unbounded data, writing data to multiple destinations, and so on, by
For example, to write events of different type to different filenames:
PCollection<Event> events = ...; events.apply(FileIO.<EventType, Event>writeDynamic() .by(Event::getTypeName) .via(TextIO.sink(), Event::toString) .to(type -> nameFilesUsingWindowPaneAndShard(".../events/" + type + "/data", ".txt")));
|Modifier and Type||Class and Description|
This class is used as the default return value of
|Modifier and Type||Method and Description|
public static TextIO.Read read()
@Deprecated public static TextIO.ReadAll readAll()
readFiles(). This is the preferred method to make composition explicit.
TextIO.ReadAllwill not receive upgrades and will be removed in a future version of Beam.
PTransformthat works like
read(), but reads each file in a
Can be applied to both bounded and unbounded
PCollections, so this is
suitable for reading a
PCollection of filepatterns arriving as a stream. However, every
filepattern is expanded once at the moment it is processed, rather than watched for new files
matching the filepattern to appear. Likewise, every file is read once, rather than watched for
public static TextIO.ReadFiles readFiles()
public static TextIO.Write write()
public static <UserT> TextIO.TypedWrite<UserT,java.lang.Void> writeCustomType()
PTransformthat writes a
PCollectionto a text file (or multiple text files matching a sharding pattern), with each element of the input collection encoded into its own line.
This version allows you to apply
TextIO writes to a PCollection of a custom type
UserT. A format mechanism that converts the input type
UserT to the String that
will be written to the file must be specified. If using a custom
object this is done using
FileBasedSink.DynamicDestinations.formatRecord(UserT), otherwise the
TextIO.TypedWrite.withFormatFunction(org.apache.beam.sdk.transforms.SerializableFunction<UserT, java.lang.String>) can be used to specify a format function.
The advantage of using a custom type is that is it allows a user-provided
FileBasedSink.DynamicDestinations object, set via
Write#to(DynamicDestinations) to examine the
custom type when choosing a destination.
public static TextIO.Sink sink()