The ScioContext associated with this PCollection.
The ScioContext associated with this PCollection.
The PCollection being wrapped internally.
The PCollection being wrapped internally.
Return the union of this SCollection and another one.
Return the union of this SCollection and another one. Any identical elements will appear multiple times (use distinct to eliminate them).
Aggregate with Aggregator.
Aggregate with Aggregator. First each item T
is mapped
to A
, then we reduce with a Semigroup of A
, then
finally we present the results as U
. This could be more powerful and better optimized in
some cases.
Aggregate the elements using given combine functions and a neutral "zero value".
Aggregate the elements using given combine functions and a neutral "zero value". This
function can return a different result type, U
, than the type of this SCollection, T
.
Thus, we need one operation for merging a T
into an U
and one operation for merging two
U
's. Both of these functions are allowed to modify and return their first argument instead
of creating a new U
to avoid memory allocation.
Apply a PTransform and wrap the output in an SCollection.
Apply a PTransform and wrap the output in an SCollection. This is a special case of applyTransform for transforms with KV output.
Apply a PTransform and wrap the output in an SCollection.
Convert this SCollection to a SideInput, mapping each window to an Iterable
, to be used
with withSideInputs.
Convert this SCollection to a SideInput, mapping each window to an Iterable
, to be used
with withSideInputs.
The values of the Iterable
for a window are not required to fit in memory, but they may also
not be effectively cached. If it is known that every window fits in memory, and stronger
caching is desired, use asListSideInput.
Convert this SCollection to a SideInput, mapping each window to a Seq
, to be used with
withSideInputs.
Convert this SCollection to a SideInput, mapping each window to a Seq
, to be used with
withSideInputs.
The resulting Seq
is required to fit in memory.
Convert this SCollection to a SideInput, mapping each window to a Set[T]
, to be used
with withSideInputs.
Convert this SCollection to a SideInput, mapping each window to a Set[T]
, to be used
with withSideInputs.
The resulting SideInput is a one element singleton which is a Set
of all elements in
the SCollection for the given window. The complete Set must fit in memory of the worker.
Convert this SCollection of a single value per window to a SideInput with a default value, to be used with withSideInputs.
Convert this SCollection of a single value per window to a SideInput, to be used with withSideInputs.
Filter the elements for which the given PartialFunction
is defined, and then map.
Generic function to combine the elements using a custom set of aggregation functions.
Generic function to combine the elements using a custom set of aggregation functions. Turns
an SCollection[T]
into a result of type SCollection[C]
, for a "combined type" C
. Note
that T
and C
can be different -- for example, one might combine an SCollection of type
Int
into an SCollection of type Seq[Int]
. Users provide three functions:
- createCombiner
, which turns a T
into a C
(e.g., creates a one-element list)
- mergeValue
, to merge a T
into a C
(e.g., adds it to the end of a list)
- mergeCombiners
, to combine two C
's into a single one.
Count the number of elements in the SCollection.
Count the number of elements in the SCollection.
a new SCollection with the count
Count approximate number of distinct elements in the SCollection.
Count approximate number of distinct elements in the SCollection.
the maximum estimation error, which should be in the range
[0.01, 0.5]
Count approximate number of distinct elements in the SCollection.
Count approximate number of distinct elements in the SCollection.
the number of entries in the statistical sample; the higher this number, the
more accurate the estimate will be; should be >= 16
Count of each unique value in this SCollection as an SCollection of (value, count) pairs.
Return the cross product with another SCollection by replicating that
to all workers.
Return the cross product with another SCollection by replicating that
to all workers. The
right side should be tiny and fit in memory.
Print content of an SCollection to out()
.
Print content of an SCollection to out()
.
where to write the debug information. Default: stdout
prefix for each logged entry. Default: empty string
if debugging is enabled or not. Default: true. It can be useful to set this to sc.isTest to avoid debugging when running in production.
Return a new SCollection containing the distinct elements in this SCollection.
Returns a new SCollection with distinct elements using given function to obtain a representative value for each input element.
Returns a new SCollection with distinct elements using given function to obtain a representative value for each input element.
The type of representative values used to dedup.
The function to use to get representative values.
Return a new SCollection containing only the elements that satisfy a predicate.
Return a new SCollection by first applying a function to all elements of this SCollection, and then flattening the results.
Return a new SCollection[U]
by flattening each element of an SCollection[Traversable[U]]
.
Fold with Monoid, which defines the associative function and
"zero value" for T
.
Fold with Monoid, which defines the associative function and
"zero value" for T
. This could be more powerful and better optimized in some cases.
Aggregate the elements using a given associative function and a neutral "zero value".
Aggregate the elements using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
Return an SCollection of grouped items.
Return an SCollection of grouped items. Each group consists of a key and a sequence of elements mapping to that key. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.
Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairSCollectionFunctions.aggregateByKey or PairSCollectionFunctions.reduceByKey will provide much better performance.
Return a new SCollection containing only the elements that also exist in the SideInput
.
Look up values in an SCollection[(T, V)]
for each element T
in this SCollection by
replicating that
to all workers.
Look up values in an SCollection[(T, V)]
for each element T
in this SCollection by
replicating that
to all workers. The right side should be tiny and fit in memory.
Partition this SCollection using Object.hashCode() into n
partitions
Partition this SCollection using Object.hashCode() into n
partitions
number of output partitions
partitioned SCollections in a Seq
Return the intersection of this SCollection and another one.
Return the intersection of this SCollection and another one. The output will not contain any duplicate elements, even if the input SCollections did.
Note that this method performs a shuffle internally.
Create tuples of the elements in this SCollection by applying f
.
Return a new SCollection by applying a function to all elements of this SCollection.
Extract data from this SCollection as a closed Tap.
Extract data from this SCollection as a closed Tap. The Tap will be available
once the pipeline completes successfully. .materialize()
must be called before
the ScioContext
is run, as its implementation modifies the current pipeline graph.
val closedTap = sc.parallelize(1 to 10).materialize sc.run().waitUntilDone().tap(closedTap)
Return the max of this SCollection as defined by the implicit Ordering[T]
.
Return the max of this SCollection as defined by the implicit Ordering[T]
.
a new SCollection with the maximum element
Return the mean of this SCollection as defined by the implicit Numeric[T]
.
Return the mean of this SCollection as defined by the implicit Numeric[T]
.
a new SCollection with the mean of elements
Return the min of this SCollection as defined by the implicit Ordering[T]
.
Return the min of this SCollection as defined by the implicit Ordering[T]
.
a new SCollection with the minimum element
A friendly name for this SCollection.
Partition this SCollection into a pair of SCollections according to a predicate.
Partition this SCollection into a pair of SCollections according to a predicate.
predicate on which to partition
a pair of SCollections: the first SCollection consists of all elements that satisfy the predicate p and the second consists of all element that do not.
Partition this SCollection with the provided function.
Partition this SCollection with the provided function.
number of output partitions
function that assigns an output partition to each element, should be in the range
[0, numPartitions - 1]
partitioned SCollections in a Seq
Partition this SCollection into a map from possible key values to an SCollection of corresponding elements based on the provided function .
Partition this SCollection into a map from possible key values to an SCollection of corresponding elements based on the provided function .
The keys for the output partitions
function that assigns an output partition to each element, should be in the range
of partitionKeys
partitioned SCollections in a Map
Compute the SCollection's data distribution using approximate N
-tiles.
Compute the SCollection's data distribution using approximate N
-tiles.
a new SCollection whose single value is an Iterable
of the approximate N
-tiles of
the elements
Randomly splits this SCollection into three parts.
Randomly splits this SCollection into three parts.
Note: 0 < weightA + weightB < 1
weight for first SCollection, should be in the range (0, 1)
weight for second SCollection, should be in the range (0, 1)
split SCollections in a Tuple3
Randomly splits this SCollection into two parts.
Randomly splits this SCollection into two parts.
weight for left hand side SCollection, should be in the range (0, 1)
split SCollections in a Tuple2
Randomly splits this SCollection with the provided weights.
Randomly splits this SCollection with the provided weights.
weights for splits, will be normalized if they don't sum to 1
split SCollections in an array
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
Controls how to handle directories in the input.
Reads files using the given org.apache.beam.sdk.io.Compression.
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
Controls how to handle directories in the input.
Reads files using the given org.apache.beam.sdk.io.Compression.
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
each line of the input files.
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
each file fully read as Array[Byte.
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
each file fully read as String.
Reduce the elements of this SCollection using the specified commutative and associative binary operator.
Return a sampled subset of this SCollection.
Return a sampled subset of this SCollection.
Return a sampled subset of this SCollection.
a new SCollection whose single value is an Iterable
of the samples
Save this SCollection as raw bytes.
Save this SCollection as raw bytes. Note that elements must be of type Array[Byte]
.
Save this SCollection with a custom output transform.
Save this SCollection with a custom output transform. The transform should have a unique name.
Save this SCollection as a Datastore dataset.
Save this SCollection as a Datastore dataset. Note that elements must be of type Entity
.
Save this SCollection as a Pub/Sub topic.
Save this SCollection as a Pub/Sub topic using the given map as message attributes.
Save this SCollection as a text file.
Save this SCollection as a text file. Note that elements must be of type String
.
Assign a Coder to this SCollection.
Return an SCollection with the elements from this
that are not in other
.
Reduce with Semigroup.
Reduce with Semigroup. This could be more powerful and better optimized than reduce in some cases.
Return a sampled subset of any num
elements of the SCollection.
Applies f to each element of this SCollection, and returns the original value.
Assign timestamps to values.
Assign timestamps to values. With a optional skew
Go from an SCollection of type T to an SCollection of U given the Schemas of both types T and U.
Go from an SCollection of type T to an SCollection of U given the Schemas of both types T and U.
There are two constructors for To:
Type safe (Schema compatibility is verified during compilation)
SCollection[T]#to(To.safe[T, U])
Unsafe conversion from T to U. Schema compatibility is not checked during compile time.
SCollection[T]#to[U](To.unsafe)
Convert this SCollection to an WindowedSCollection.
Return the top k (largest) elements from this SCollection as defined by the specified
implicit Ordering[T]
.
Return the top k (largest) elements from this SCollection as defined by the specified
implicit Ordering[T]
.
a new SCollection whose single value is an Iterable
of the top k
Apply a transform.
Return the union of this SCollection and another one.
Return the union of this SCollection and another one. Any identical elements will appear multiple times (use distinct to eliminate them).
Window values into by days.
Window values into by months.
Window values into by weeks.
Window values into by years.
Convert this SCollection to an SCollectionWithFanout that uses an intermediate node to combine parts of the data to reduce load on the final global combine step.
Convert this SCollection to an SCollectionWithFanout that uses an intermediate node to combine parts of the data to reduce load on the final global combine step.
the number of intermediate keys that will be used
Window values into fixed windows.
Group values in to a single global window.
Set a custom name for the next transform to be applied.
Set a custom name for the next transform to be applied.
Convert values into pairs of (value, window).
Window values based on sessions.
Convert this SCollection to an SCollectionWithSideInput with one or more SideInputs, similar to Spark broadcast variables.
Convert this SCollection to an SCollectionWithSideInput with one or more SideInputs, similar to Spark broadcast variables. Call SCollectionWithSideInput.toSCollection when done with side inputs.
val s1: SCollection[Int] = // ... val s2: SCollection[String] = // ... val s3: SCollection[(String, Double)] = // ... // Prepare side inputs val side1 = s1.asSingletonSideInput val side2 = s2.asIterableSideInput val side3 = s3.asMapSideInput val side4 = s4.asMultiMapSideInput val p: SCollection[MyRecord] = // ... p.withSideInputs(side1, side2, side3).map { (x, s) => // Extract side inputs from context val s1: Int = s(side1) val s2: Iterable[String] = s(side2) val s3: Map[String, Double] = s(side3) val s4: Map[String, Iterable[Double]] = s(side4) // ... }
Convert this SCollection to an SCollectionWithSideOutput with one or more SideOutputs, so that a single transform can write to multiple destinations.
Convert this SCollection to an SCollectionWithSideOutput with one or more SideOutputs, so that a single transform can write to multiple destinations.
// Prepare side inputs val side1 = SideOutput[String]() val side2 = SideOutput[Int]() val p: SCollection[MyRecord] = // ... p.withSideOutputs(side1, side2).map { (x, s) => // Write to side outputs via context s.output(side1, "word").output(side2, 1) // ... }
Window values into sliding windows.
Convert values into pairs of (value, timestamp).
Convert values into pairs of (value, window).
Convert values into pairs of (value, window).
window type, must be BoundedWindow or one of it's sub-types, e.g. GlobalWindow if this SCollection is not windowed or IntervalWindow if it is windowed.
Window values with the given function.
Generic write method for all ScioIO[T]
implementations, if it is test pipeline this will
evaluate pre-registered output IO implementation which match for the passing ScioIO[T]
implementation.
Generic write method for all ScioIO[T]
implementations, if it is test pipeline this will
evaluate pre-registered output IO implementation which match for the passing ScioIO[T]
implementation. if not this will invoke com.spotify.scio.io.ScioIO[T]#write method along
with write configurations passed by.
an implementation of ScioIO[T]
trait
configurations need to pass to perform underline write implementation
Return a new SCollection containing only the elements that also exist in the SideSet.
Return a new SCollection containing only the elements that also exist in the SideSet.
(Since version 0.8.0) use SCollection[T]#hashFilter(right.asSetSingletonSideInput) instead
Read files represented by elements of this SCollection as file patterns.
Read files represented by elements of this SCollection as file patterns.
sc.parallelize("a.txt").readAll(TextIO.readAll())
(Since version 0.8.1) Use readFiles instead
Read files as byte arrays represented by elements of this SCollection as file patterns.
Read files as byte arrays represented by elements of this SCollection as file patterns.
(Since version 0.8.1) Use readAllAsBytes instead
(Since version 0.8.0) Use SCollection[T]#asSetSingletonSideInput instead
A Scala wrapper for PCollection. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all SCollections, such as
map
,filter
, andsum
. In addition, PairSCollectionFunctions contains operations available only on SCollections of key-value pairs, such asgroupByKey
andjoin
; DoubleSCollectionFunctions contains operations available only on SCollections ofDouble
s.