Utilities for Breeze.
Utilities for Scala collection library.
Utilities for Scala collection library.
Adds a top
method to Array[T]
and Iterable[T]
and a topByKey
method to Array[(K, V)]
and Iterable[(K, V)]
.
import com.spotify.scio.extra.Collections._ val xs: Array[(String, Int)] = // ... xs.top(5)(Ordering.by(_._2)) xs.topByKey(5)
Utilities for Scala iterators.
Utilities for Scala iterators.
Adds a timeSeries
method to Iterator[T]
so that it can be windowed with different logic.
import com.spotify.scio.extra.Iterators._ case class Event(user: String, action: String, timestamp: Long) val i: Iterator[Event] = // ... // 60 minutes fixed windows offset by 30 minutes // E.g. minute [30, 90), [90, 120), [120, 150), [150, 180) ... i.timeSeries(_.timestamp).fixed(3600000, 1800000) // session windows with 60 minute gaps between windows i.timeSeries(_.timestamp).session(3600000) // 60 minutes sliding windows, one every 10 minutes, offset by 5 minutes // E.g. minute [5, 65), [15, 75), [25, 85), [35, 95) ... i.timeSeries(_.timestamp).session(3600000, 600000, 300000)
Main package for checkpoint API.
Main package for checkpoint API. Import all.
import com.spotify.scio.extra.checkpoint._
Main package for JSON APIs.
Main package for JSON APIs. Import all.
This package uses Circe for JSON handling under the hood.
import com.spotify.scio.extra.json._ // define a type-safe JSON schema case class Record(i: Int, d: Double, s: String) // read JSON as case classes sc.jsonFile[Record]("input.json") // write case classes as JSON sc.parallelize((1 to 10).map(x => Record(x, x.toDouble, x.toString)) .saveAsJsonFile("output")
Main package for reading the Lib SVM Format
Main package for reading the Lib SVM Format
import com.spotify.scio.extra.libsvm._ // Read SVM Lib as Label, SparseVector sc.libSVMFile("input.svm")
Main package for Sparkey side input APIs.
Main package for Sparkey side input APIs. Import all.
import com.spotify.scio.extra.sparkey._
To save an SCollection[(String, String)]
to a Sparkey file:
val s = sc.parallelize(Seq("a" -> "one", "b" -> "two")) // temporary location val s1: SCollection[SparkeyUri] = s.asSparkey // specific location val s1: SCollection[SparkeyUri] = s.asSparkey("gs:/// )/ "
The result SCollection[SparkeyUri]
can be converted to a side input:
val s: SCollection[SparkeyUri] = sc.parallelize(Seq("a" -> "one", "b" -> "two")).asSparkey val side: SideInput[SparkeyReader] = s.asSparkeySideInput
These two steps can be done with a syntactic sugar:
val side: SideInput[SparkeyReader] = sc .parallelize(Seq("a" -> "one", "b" -> "two")) .asSparkeySideInput
An existing Sparkey file can also be converted to a side input directly:
sc.sparkeySideInput("gs:////" )
SparkeyReader
can be used like a lookup table in a side input operation:
val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c")) val side: SideInput[SparkeyReader] = sc .parallelize(Seq("a" -> "one", "b" -> "two")) .asSparkeySideInput main.withSideInputs(side) .map { (x, s) => s(side).getOrElse(x, "unknown") }
Main package for transforms APIs.
Main package for transforms APIs. Import all.
Utilities for Breeze.
Includes Semigroups for Breeze data types like DenseVectors and DenseMatrixs.