Class DoFn<InputT extends @Nullable java.lang.Object,OutputT extends @Nullable java.lang.Object>
- java.lang.Object
-
- org.apache.beam.sdk.transforms.DoFn<InputT,OutputT>
-
- Type Parameters:
InputT
- the type of the (main) input elementsOutputT
- the type of the (main) output elements
- All Implemented Interfaces:
java.io.Serializable
,HasDisplayData
- Direct Known Subclasses:
JsonToRow.JsonToRowWithErrFn.ParseWithError
,Reshuffle.AssignShardFn
,ValueWithRecordId.StripIdsDoFn
,View.ToListViewDoFn
,Watch.WatchGrowthFn
public abstract class DoFn<InputT extends @Nullable java.lang.Object,OutputT extends @Nullable java.lang.Object> extends java.lang.Object implements java.io.Serializable, HasDisplayData
The argument toParDo
providing the code to use to process elements of the inputPCollection
.See
ParDo
for more explanation, examples of use, and discussion of constraints onDoFn
s, including their serializability, lack of access to global shared mutable state, requirements for failure tolerance, and benefits of optimization.DoFns
can be tested by usingTestPipeline
. You can verify their functional correctness in a local test using theDirectRunner
as well as running integration tests with your production runner of choice. Typically, you can generate the input data usingCreate.of(java.lang.Iterable<T>)
or other transforms. However, if you need to test the behavior ofDoFn.StartBundle
andDoFn.FinishBundle
with particular bundle boundaries, you can useTestStream
.Implementations must define a method annotated with
DoFn.ProcessElement
that satisfies the requirements described there. See theDoFn.ProcessElement
for details.Example usage:
PCollection<String> lines = ... ; PCollection<String> words = lines.apply(ParDo.of(new DoFn<String, String>()) { @ProcessElement public void processElement(@Element String element, BoundedWindow window) { ... }}));
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static interface
DoFn.AlwaysFetched
Annotation for declaring that a state parameter is always fetched.static interface
DoFn.BoundedPerElement
Annotation on a splittableDoFn
specifying that theDoFn
performs a bounded amount of work per input element, so applying it to a boundedPCollection
will produce also a boundedPCollection
.static interface
DoFn.BundleFinalizer
A parameter that is accessible during@StartBundle
,@ProcessElement
and@FinishBundle
that allows the caller to register a callback that will be invoked after the bundle has been successfully completed and the runner has commit the output.static interface
DoFn.Element
Parameter annotation for the input element forDoFn.ProcessElement
,DoFn.GetInitialRestriction
,DoFn.GetSize
,DoFn.SplitRestriction
,DoFn.GetInitialWatermarkEstimatorState
,DoFn.NewWatermarkEstimator
, andDoFn.NewTracker
methods.static interface
DoFn.FieldAccess
Annotation for specifying specific fields that are accessed in a Schema PCollection.static interface
DoFn.FinishBundle
Annotation for the method to use to finish processing a batch of elements.class
DoFn.FinishBundleContext
Information accessible while within theDoFn.FinishBundle
method.static interface
DoFn.GetInitialRestriction
Annotation for the method that maps an element to an initial restriction for a splittableDoFn
.static interface
DoFn.GetInitialWatermarkEstimatorState
Annotation for the method that maps an element and restriction to initial watermark estimator state for a splittableDoFn
.static interface
DoFn.GetRestrictionCoder
Annotation for the method that returns the coder to use for the restriction of a splittableDoFn
.static interface
DoFn.GetSize
Annotation for the method that returns the corresponding size for an element and restriction pair.static interface
DoFn.GetWatermarkEstimatorStateCoder
Annotation for the method that returns the coder to use for the watermark estimator state of a splittableDoFn
.static interface
DoFn.Key
Parameter annotation for dereferencing input element key inKV
pair.static interface
DoFn.MultiOutputReceiver
Receives tagged output for a multi-output function.static interface
DoFn.NewTracker
Annotation for the method that creates a newRestrictionTracker
for the restriction of a splittableDoFn
.static interface
DoFn.NewWatermarkEstimator
Annotation for the method that creates a newWatermarkEstimator
for the watermark state of a splittableDoFn
.static interface
DoFn.OnTimer
Annotation for registering a callback for a timer.class
DoFn.OnTimerContext
Information accessible when running aDoFn.OnTimer
method.static interface
DoFn.OnTimerFamily
Annotation for registering a callback for a timerFamily.static interface
DoFn.OnWindowExpiration
Annotation for the method to use for performing actions on window expiration.class
DoFn.OnWindowExpirationContext
static interface
DoFn.OutputReceiver<T>
Receives values of the given type.class
DoFn.ProcessContext
Information accessible when running aDoFn.ProcessElement
method.static class
DoFn.ProcessContinuation
When used as a return value ofDoFn.ProcessElement
, indicates whether there is more work to be done for the current element.static interface
DoFn.ProcessElement
Annotation for the method to use for processing elements.static interface
DoFn.RequiresStableInput
Experimental - no backwards compatibility guarantees.static interface
DoFn.RequiresTimeSortedInput
Experimental - no backwards compatibility guarantees.static interface
DoFn.Restriction
Parameter annotation for the restriction forDoFn.GetSize
,DoFn.SplitRestriction
,DoFn.GetInitialWatermarkEstimatorState
,DoFn.NewWatermarkEstimator
, andDoFn.NewTracker
methods.static interface
DoFn.Setup
Annotation for the method to use to prepare an instance for processing bundles of elements.static interface
DoFn.SideInput
Parameter annotation for the SideInput for aDoFn.ProcessElement
method.static interface
DoFn.SplitRestriction
Annotation for the method that splits restriction of a splittableDoFn
into multiple parts to be processed in parallel.static interface
DoFn.StartBundle
Annotation for the method to use to prepare an instance for processing a batch of elements.class
DoFn.StartBundleContext
Information accessible while within theDoFn.StartBundle
method.static interface
DoFn.StateId
Annotation for declaring and dereferencing state cells.static interface
DoFn.Teardown
Annotation for the method to use to clean up this instance before it is discarded.static interface
DoFn.TimerFamily
Parameter annotation for the TimerMap for aDoFn.ProcessElement
method.static interface
DoFn.TimerId
Annotation for declaring and dereferencing timers.static interface
DoFn.Timestamp
Parameter annotation for the input element timestamp forDoFn.ProcessElement
,DoFn.GetInitialRestriction
,DoFn.GetSize
,DoFn.SplitRestriction
,DoFn.GetInitialWatermarkEstimatorState
,DoFn.NewWatermarkEstimator
, andDoFn.NewTracker
methods.static interface
DoFn.TruncateRestriction
Annotation for the method that truncates the restriction of a splittableDoFn
into a bounded one.static interface
DoFn.UnboundedPerElement
Annotation on a splittableDoFn
specifying that theDoFn
performs an unbounded amount of work per input element, so applying it to a boundedPCollection
will produce an unboundedPCollection
.static interface
DoFn.WatermarkEstimatorState
Parameter annotation for the watermark estimator state for theDoFn.NewWatermarkEstimator
method.class
DoFn.WindowedContext
Information accessible to all methods in thisDoFn
where the context is in some window.
-
Constructor Summary
Constructors Constructor Description DoFn()
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description org.joda.time.Duration
getAllowedTimestampSkew()
Deprecated.This method permits aDoFn
to emit elements behind the watermark.TypeDescriptor<InputT>
getInputTypeDescriptor()
Returns aTypeDescriptor
capturing what is known statically about the input type of thisDoFn
instance's most-derived class.TypeDescriptor<OutputT>
getOutputTypeDescriptor()
Returns aTypeDescriptor
capturing what is known statically about the output type of thisDoFn
instance's most-derived class.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.void
prepareForProcessing()
Deprecated.useDoFn.Setup
orDoFn.StartBundle
instead.
-
-
-
Method Detail
-
getAllowedTimestampSkew
@Deprecated public org.joda.time.Duration getAllowedTimestampSkew()
Deprecated.This method permits aDoFn
to emit elements behind the watermark. These elements are considered late, and if behind theallowed lateness
of a downstreamPCollection
may be silently dropped. See https://github.com/apache/beam/issues/18065 for details on a replacement.Returns the allowed timestamp skew duration, which is the maximum duration that timestamps can be shifted backward inDoFn.WindowedContext.outputWithTimestamp(OutputT, org.joda.time.Instant)
.The default value is
Duration.ZERO
, in which case timestamps can only be shifted forward to future. For infinite skew, returnDuration.millis(Long.MAX_VALUE)
.
-
getInputTypeDescriptor
public TypeDescriptor<InputT> getInputTypeDescriptor()
Returns aTypeDescriptor
capturing what is known statically about the input type of thisDoFn
instance's most-derived class.See
getOutputTypeDescriptor()
for more discussion.
-
getOutputTypeDescriptor
public TypeDescriptor<OutputT> getOutputTypeDescriptor()
Returns aTypeDescriptor
capturing what is known statically about the output type of thisDoFn
instance's most-derived class.In the normal case of a concrete
DoFn
subclass with no generic type parameters of its own (including anonymous inner classes), this will be a complete non-generic type, which is good for choosing a default outputCoder<O>
for the outputPCollection<O>
.
-
prepareForProcessing
@Deprecated public final void prepareForProcessing()
Deprecated.useDoFn.Setup
orDoFn.StartBundle
instead. This method will be removed in a future release.Finalize theDoFn
construction to prepare for processing. This method should be called by runners before any processing methods.
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
-