Annotation Type DoFn.GetSize


  • @Documented
    @Retention(RUNTIME)
    @Target(METHOD)
    public static @interface DoFn.GetSize
    Annotation for the method that returns the corresponding size for an element and restriction pair.

    Signature: double getSize(<arguments>);

    This method must satisfy the following constraints:

    • If one of its arguments is tagged with the DoFn.Element annotation, then it will be passed the current element being processed; the argument must be of type InputT. Note that automatic conversion of Rows and DoFn.FieldAccess parameters are currently unsupported.
    • If one of its arguments is tagged with the DoFn.Restriction annotation, then it will be passed the current restriction being processed; the argument must be of type RestrictionT.
    • If one of its arguments is tagged with the DoFn.Timestamp annotation, then it will be passed the timestamp of the current element being processed; the argument must be of type Instant.
    • If one of its arguments is a RestrictionTracker, then it will be passed a tracker that is initialized for the current DoFn.Restriction. The argument must be of the exact type RestrictionTracker<RestrictionT, PositionT>.
    • If one of its arguments is a subtype of BoundedWindow, then it will be passed the window of the current element. When applied by ParDo the subtype of BoundedWindow must match the type of windows on the input PCollection. If the window is not accessed a runner may perform additional optimizations.
    • If one of its arguments is of type PaneInfo, then it will be passed information about the current triggering pane.
    • If one of the parameters is of type PipelineOptions, then it will be passed the options for the current pipeline.

    Returns a non-negative double representing the size of the current element and restriction.

    Splittable DoFns should only provide this method if the default RestrictionTracker.HasProgress implementation within the RestrictionTracker is an inaccurate representation of known work.

    It is up to each splittable to convert between their natural representation of outstanding work and this representation. For example:

    • Block based file source (e.g. Avro): The number of bytes that will be read from the file.
    • Pull based queue based source (e.g. Pubsub): The local/global size available in number of messages or number of message bytes that have not been processed.
    • Key range based source (e.g. Shuffle, Bigtable, ...): Typically 1.0 unless additional details such as the number of bytes for keys and values is known for the key range.

    Must be thread safe. Will be invoked concurrently during bundle processing due to runner initiated splitting and progress estimation.