Input - the type of the input to this PTransformOutput - the type of the output of this PTransformpublic abstract class PTransform<Input extends PInput,Output extends POutput>
extends java.lang.Object
implements java.io.Serializable
PTransform<Input, Output> is an operation that takes an
Input (some subtype of PInput) and produces an
Output (some subtype of POutput).
Common PTransforms include root PTransforms like
TextIO.Read,
Create, processing and
conversion operations like ParDo,
GroupByKey,
CoGroupByKey,
Combine, and Count, and outputting
PTransforms like
TextIO.Write. Users also
define their own application-specific composite PTransforms.
Each PTransform<Input, Output> has a single
Input type and a single Output type. Many
PTransforms conceptually transform one input value to one output
value, and in this case Input and Output are
typically instances of
PCollection.
A root
PTransform conceptually has no input; in this case, conventionally
a PBegin object
produced by calling Pipeline.begin() is used as the input.
An outputting PTransform conceptually has no output; in this case,
conventionally PDone
is used as its output type. Some PTransforms conceptually have
multiple inputs and/or outputs; in these cases special "bundling"
classes like
PCollectionList,
PCollectionTuple
are used
to combine multiple values into a single bundle for passing into or
returning from the PTransform.
A PTransform<Input, Output> is invoked by calling
apply() on its Input, returning its Output.
Calls can be chained to concisely create linear pipeline segments.
For example:
PCollection<T1> pc1 = ...;
PCollection<T2> pc2 =
pc1.apply(ParDo.of(new MyDoFn<T1,KV<K,V>>()))
.apply(GroupByKey.<K, V>create())
.apply(Combine.perKey(new MyKeyedCombineFn<K,V>()))
.apply(ParDo.of(new MyDoFn2<KV<K,V>,T2>()));
PTransform operations have unique names, which are used by the
system when explaining what's going on during optimization and
execution. Each PTransform gets a system-provided default name,
but it's a good practice to specify an explicit name, where
possible, using the named() method offered by some
PTransforms such as ParDo. For example:
...
.apply(ParDo.named("Step1").of(new MyDoFn3()))
...
Each PCollection output produced by a PTransform, either directly or within a "bundling" class, automatically gets its own name derived from the name of its producing PTransform.
Each PCollection output produced by a PTransform
also records a Coder
that specifies how the elements of that PCollection
are to be encoded as a byte string, if necessary. The
PTransform may provide a default Coder for any of its outputs, for
instance by deriving it from the PTransform input's Coder. If the
PTransform does not specify the Coder for an output PCollection,
the system will attempt to infer a Coder for it, based on
what's known at run-time about the Java type of the output's
elements. The enclosing Pipeline's
CoderRegistry
(accessible via Pipeline.getCoderRegistry()) defines the
mapping from Java types to the default Coder to use, for a standard
set of Java types; users can extend this mapping for additional
types, via
CoderRegistry.registerCoder(java.lang.Class<?>, java.lang.Class<?>).
If this inference process fails, either because the Java type was
not known at run-time (e.g., due to Java's "erasure" of generic
types) or there was no default Coder registered, then the Coder
should be specified manually by calling
TypedPValue.setCoder(com.google.cloud.dataflow.sdk.coders.Coder<T>)
on the output PCollection. The Coder of every output
PCollection must be determined one way or another
before that output is used as an input to another PTransform, or
before the enclosing Pipeline is run.
A small number of PTransforms are implemented natively by the
Google Cloud Dataflow SDK; such PTransforms simply return an
output value as their apply implementation.
The majority of PTransforms are
implemented as composites of other PTransforms. Such a PTransform
subclass typically just implements apply(Input), computing its
Output value from its Input value. User programs are encouraged to
use this mechanism to modularize their own code. Such composite
abstractions get their own name, and navigating through the
composition hierarchy of PTransforms is supported by the monitoring
interface. Examples of composite PTransforms can be found in this
directory and in examples. From the caller's point of view, there
is no distinction between a PTransform implemented natively and one
implemented in terms of other PTransforms; both kinds of PTransform
are invoked in the same way, using apply().
PTransform doesn't actually support serialization, despite
implementing Serializable.
PTransform is marked Serializable solely
because it is common for an anonymous DoFn,
instance to be created within an
apply() method of a composite PTransform.
Each of those *Fns is Serializable, but
unfortunately its instance state will contain a reference to the
enclosing PTransform instance, and so attempt to serialize
the PTransform instance, even though the *Fn
instance never references anything about the enclosing
PTransform.
To allow such anonymous *Fns to be written
conveniently, PTransform is marked as Serializable,
and includes dummy writeObject() and readObject()
operations that do not save or restore any state.
| Modifier and Type | Field and Description |
|---|---|
protected java.lang.String |
name
The base name of this
PTransform, e.g., from
ParDo.named(String), or from defaults, or null if not
yet assigned. |
| Modifier | Constructor and Description |
|---|---|
protected |
PTransform() |
protected |
PTransform(java.lang.String name) |
| Modifier and Type | Method and Description |
|---|---|
Output |
apply(Input input)
Applies this
PTransform on the given Input, and returns its
Output. |
void |
finishSpecifying()
After building, finalizes this
PTransform to
make it ready for running. |
protected CoderRegistry |
getCoderRegistry()
Deprecated.
use pipeline.getCoderRegistry()
|
protected java.lang.String |
getDefaultName()
Returns the name to use by default for this
PTransform
(not including the names of any enclosing PTransforms). |
protected Coder<?> |
getDefaultOutputCoder()
Returns the default
Coder to use for the output of this
single-output PTransform, or null if
none can be inferred. |
<T> Coder<T> |
getDefaultOutputCoder(TypedPValue<T> output)
Returns the default
Coder to use for the given output of
this single-output PTransform, or null
if none can be inferred. |
Input |
getInput()
Deprecated.
Use pipeline.getInput(transform)
|
protected java.lang.String |
getKindString()
Returns a string describing what kind of
PTransform this is. |
java.lang.String |
getName()
Returns the transform name.
|
Output |
getOutput()
Deprecated.
|
Pipeline |
getPipeline()
Deprecated.
|
void |
setName(java.lang.String name)
Sets the base name of this
PTransform. |
void |
setPipeline(Pipeline pipeline)
Deprecated.
|
java.lang.String |
toString() |
PTransform<Input,Output> |
withName(java.lang.String name)
Sets the base name of this
PTransform and returns itself. |
protected transient java.lang.String name
PTransform, e.g., from
ParDo.named(String), or from defaults, or null if not
yet assigned.protected PTransform()
protected PTransform(java.lang.String name)
public Output apply(Input input)
PTransform on the given Input, and returns its
Output.
Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
The default implementation throws an exception. A derived class must
either implement apply, or else each runner must supply a custom
implementation via
PipelineRunner.apply(com.google.cloud.dataflow.sdk.transforms.PTransform<Input, Output>, Input).
public void setName(java.lang.String name)
PTransform.public PTransform<Input,Output> withName(java.lang.String name)
PTransform and returns itself.
This is a shortcut for calling setName(java.lang.String), which allows method
chaining.
public java.lang.String getName()
This name is provided by the transform creator and is not required to be unique.
@Deprecated public Pipeline getPipeline()
Pipeline of this PTransform.java.lang.IllegalStateException - if the owning Pipeline hasn't been
set yet@Deprecated public Input getInput()
java.lang.IllegalStateException - if this PTransform hasn't been applied yet@Deprecated public Output getOutput()
java.lang.IllegalStateException - if this PTransform hasn't been applied yet
#deprecated use pipeline.getOutput(transform)@Deprecated protected CoderRegistry getCoderRegistry()
CoderRegistry, useful for inferring
Coders.java.lang.IllegalStateException - if the owning Pipeline hasn't been
set yet@Deprecated public void setPipeline(Pipeline pipeline)
PTransform with the given Pipeline.
For internal use only.
java.lang.IllegalArgumentException - if this transform has already
been associated with a pipelinepublic java.lang.String toString()
toString in class java.lang.Objectprotected java.lang.String getDefaultName()
PTransform
(not including the names of any enclosing PTransforms).
By default, returns getKindString().
The caller is responsible for ensuring that names of applied
PTransforms are unique, e.g., by adding a uniquifying
suffix when needed.
protected java.lang.String getKindString()
PTransform this is.
By default, returns the base name of this
PTransform's class.
public void finishSpecifying()
PTransform to
make it ready for running. Called automatically when its
output(s) are finished.
Not normally called by user code.
protected Coder<?> getDefaultOutputCoder()
Coder to use for the output of this
single-output PTransform, or null if
none can be inferred.
By default, returns null.
public <T> Coder<T> getDefaultOutputCoder(TypedPValue<T> output)
Coder to use for the given output of
this single-output PTransform, or null
if none can be inferred.