public class Pipeline extends Object
Pipeline manages a DAG of PTransforms, and the
PCollections
that the PTransforms consume and produce.
After a Pipeline has been constructed, it can be executed,
using a default or an explicit PipelineRunner.
Multiple Pipelines can be constructed and executed independently
and concurrently.
Each Pipeline is self-contained and isolated from any other
Pipeline. The PValues that are inputs and outputs of each of a
Pipeline's PTransforms are also owned by that Pipeline.
A PValue owned by one Pipeline can be read only by PTransforms
also owned by that Pipeline.
Here's a typical example of use:
// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline.
Pipeline p = Pipeline.create(options);
// A root PTransform, like TextIO.Read or Create, gets added
// to the Pipeline by being applied:
PCollection<String> lines =
p.apply(TextIO.Read.from("gs://bucket/dir/file*.txt"));
// A Pipeline can have multiple root transforms:
PCollection<String> moreLines =
p.apply(TextIO.Read.from("gs://bucket/other/dir/file*.txt"));
PCollection<String> yetMoreLines =
p.apply(Create.of("yet", "more", "lines").withCoder(StringUtf8Coder.of()));
// Further PTransforms can be applied, in an arbitrary (acyclic) graph.
// Subsequent PTransforms (and intermediate PCollections etc.) are
// implicitly part of the same Pipeline.
PCollection<String> allLines =
PCollectionList.of(lines).and(moreLines).and(yetMoreLines)
.apply(new Flatten<String>());
PCollection<KV<String, Integer>> wordCounts =
allLines
.apply(ParDo.of(new ExtractWords()))
.apply(new Count<String>());
PCollection<String> formattedWordCounts =
wordCounts.apply(ParDo.of(new FormatCounts()));
formattedWordCounts.apply(TextIO.Write.to("gs://bucket/dir/counts.txt"));
// PTransforms aren't executed when they're applied, rather they're
// just added to the Pipeline. Once the whole Pipeline of PTransforms
// is constructed, the Pipeline's PTransforms can be run using a
// PipelineRunner. The default PipelineRunner executes the Pipeline
// directly, sequentially, in this one process, which is useful for
// unit tests and simple experiments:
p.run();
| Modifier and Type | Class and Description |
|---|---|
static class |
Pipeline.PipelineExecutionException
Thrown during pipeline execution, whenever user code within a pipeline throws an exception.
|
static interface |
Pipeline.PipelineVisitor
A
Pipeline.PipelineVisitor can be passed into
traverseTopologically(com.google.cloud.dataflow.sdk.Pipeline.PipelineVisitor) to be called for each of the
transforms and values in the Pipeline. |
| Modifier | Constructor and Description |
|---|---|
protected |
Pipeline(PipelineRunner<?> runner)
Deprecated.
replaced by
Pipeline(PipelineRunner, PipelineOptions) |
protected |
Pipeline(PipelineRunner<?> runner,
PipelineOptions options) |
| Modifier and Type | Method and Description |
|---|---|
void |
addValueInternal(PValue value)
Adds the given PValue to this Pipeline.
|
<OutputT extends POutput> |
apply(PTransform<? super PBegin,OutputT> root)
Like
apply(String, PTransform) but defaulting to the name
of the PTransform. |
<OutputT extends POutput> |
apply(String name,
PTransform<? super PBegin,OutputT> root)
|
static <InputT extends PInput,OutputT extends POutput> |
applyTransform(InputT input,
PTransform<? super InputT,OutputT> transform)
Like
applyTransform(String, PInput, PTransform) but defaulting to the name
provided by the PTransform. |
static <InputT extends PInput,OutputT extends POutput> |
applyTransform(String name,
InputT input,
PTransform<? super InputT,OutputT> transform)
Applies the given
PTransform to this input InputT and returns
its OutputT. |
PBegin |
begin()
Returns a
PBegin owned by this Pipeline. |
static Pipeline |
create(PipelineOptions options)
Constructs a pipeline from the provided options.
|
CoderRegistry |
getCoderRegistry()
Returns the
CoderRegistry that this Pipeline uses. |
String |
getFullNameForTesting(PTransform<?,?> transform)
Deprecated.
|
PipelineOptions |
getOptions()
Returns the configured pipeline options.
|
PipelineRunner<?> |
getRunner()
Returns the configured pipeline runner.
|
PipelineResult |
run()
Runs the Pipeline.
|
void |
setCoderRegistry(CoderRegistry coderRegistry)
Sets the
CoderRegistry that this Pipeline uses. |
String |
toString() |
void |
traverseTopologically(Pipeline.PipelineVisitor visitor)
Invokes the PipelineVisitor's
Pipeline.PipelineVisitor.visitTransform(com.google.cloud.dataflow.sdk.runners.TransformTreeNode) and
Pipeline.PipelineVisitor.visitValue(com.google.cloud.dataflow.sdk.values.PValue, com.google.cloud.dataflow.sdk.runners.TransformTreeNode) operations on each of this
Pipeline's PTransforms and PValues, in forward
topological order. |
@Deprecated protected Pipeline(PipelineRunner<?> runner)
Pipeline(PipelineRunner, PipelineOptions)protected Pipeline(PipelineRunner<?> runner, PipelineOptions options)
public static Pipeline create(PipelineOptions options)
public PBegin begin()
public <OutputT extends POutput> OutputT apply(PTransform<? super PBegin,OutputT> root)
apply(String, PTransform) but defaulting to the name
of the PTransform.public <OutputT extends POutput> OutputT apply(String name, PTransform<? super PBegin,OutputT> root)
PTransform such as
TextIO.READ or Create.
This specific call to apply is identified by the provided name.
This name is used in various places, including the monitoring UI, logging,
and to stably identify this application node in the job graph.
Alias for begin().apply(name, root).
public PipelineResult run()
public CoderRegistry getCoderRegistry()
CoderRegistry that this Pipeline uses.public void setCoderRegistry(CoderRegistry coderRegistry)
CoderRegistry that this Pipeline uses.public void traverseTopologically(Pipeline.PipelineVisitor visitor)
Pipeline.PipelineVisitor.visitTransform(com.google.cloud.dataflow.sdk.runners.TransformTreeNode) and
Pipeline.PipelineVisitor.visitValue(com.google.cloud.dataflow.sdk.values.PValue, com.google.cloud.dataflow.sdk.runners.TransformTreeNode) operations on each of this
Pipeline's PTransforms and PValues, in forward
topological order.
Traversal of the pipeline causes PTransform and PValue instances to be marked as finished, at which point they may no longer be modified.
Typically invoked by PipelineRunner subclasses.
public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(InputT input, PTransform<? super InputT,OutputT> transform)
applyTransform(String, PInput, PTransform) but defaulting to the name
provided by the PTransform.public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(String name, InputT input, PTransform<? super InputT,OutputT> transform)
PTransform to this input InputT and returns
its OutputT. This uses name to identify this specific application
of the transform. This name is used in various places, including the monitoring UI,
logging, and to stably identify this application node in the job graph.
Called by PInput subclasses in their apply methods.
public PipelineRunner<?> getRunner()
public PipelineOptions getOptions()
@Deprecated public String getFullNameForTesting(PTransform<?,?> transform)
IllegalStateException - if the transform has not been applied to the pipeline
or was applied multiple times.public void addValueInternal(PValue value)
For internal use only.