spark.RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Each RDD is characterized by five main properties: - A list of splits (partitions) - A function for computing each split - A list of dependencies on other RDDs - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) - Optionally, a list of preferred locations to compute each split on (e.g. block locations for HDFS)

All the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself.

This class also contains transformation methods available on all RDDs (e.g. map and filter). In addition, PairRDDFunctions contains extra methods available on RDDs of key-value pairs, and SequenceFileRDDFunctions contains extra methods for saving RDDs to Hadoop SequenceFiles.

Linear Supertypes

Serializable, Serializable, AnyRef, Any

Known Subclasses

CartesianRDD, CoGroupedRDD, FilteredRDD, FlatMappedRDD, FlatMappedValuesRDD, GlommedRDD, HadoopRDD, MapPartitionsRDD, MappedRDD, MappedValuesRDD, NewHadoopRDD, ParallelCollection, PipedRDD, SampledRDD, ShuffledRDD, SortedRDD, UnionRDD

Instance Constructors

new RDD(sc: SparkContext)(implicit arg0: ClassManifest[T])

Abstract Value Members

abstract def compute(split: Split): Iterator[T]
abstract val dependencies: List[spark.Dependency[_]]
abstract def splits: Array[Split]

Concrete Value Members

final def !=(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def !=(arg0: Any): Boolean

Definition Classes
Any
final def ##(): Int

Definition Classes
AnyRef → Any
def ++(other: RDD[T]): RDD[T]
final def ==(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def ==(arg0: Any): Boolean

Definition Classes
Any
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassManifest[U]): U

Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value".
Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.
final def asInstanceOf[T0]: T0

Definition Classes
Any
def cache(): RDD[T]
def cartesian[U](other: RDD[U])(implicit arg0: ClassManifest[U]): RDD[(T, U)]
def clone(): AnyRef

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws()
def collect(): Array[T]
def context: SparkContext
def count(): Long
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def filter(f: (T) ⇒ Boolean): RDD[T]
def finalize(): Unit

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws()
def first(): T
def flatMap[U](f: (T) ⇒ TraversableOnce[U])(implicit arg0: ClassManifest[U]): RDD[U]
def fold(zeroValue: T)(op: (T, T) ⇒ T): T

Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value".
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
def foreach(f: (T) ⇒ Unit): Unit
final def getClass(): java.lang.Class[_]

Definition Classes
AnyRef → Any
def glom(): RDD[Array[T]]
def groupBy[K](f: (T) ⇒ K)(implicit arg0: ClassManifest[K]): RDD[(K, Seq[T])]
def groupBy[K](f: (T) ⇒ K, numSplits: Int)(implicit arg0: ClassManifest[K]): RDD[(K, Seq[T])]
def hashCode(): Int

Definition Classes
AnyRef → Any
val id: Int
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def iterator(split: Split): Iterator[T]
def map[U](f: (T) ⇒ U)(implicit arg0: ClassManifest[U]): RDD[U]
def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: ClassManifest[U]): RDD[U]
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
val partitioner: Option[Partitioner]
def pipe(command: Seq[String], env: Map[String, String]): RDD[String]
def pipe(command: Seq[String]): RDD[String]
def pipe(command: String): RDD[String]
def preferredLocations(split: Split): Seq[String]
def reduce(f: (T, T) ⇒ T): T
def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]
def saveAsObjectFile(path: String): Unit
def saveAsTextFile(path: String): Unit
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def take(num: Int): Array[T]

Take the first num elements of the RDD.
Take the first num elements of the RDD. This currently scans the partitions *one by one*, so it will be slow if a lot of partitions are required. In that case, use collect() to get the whole RDD instead.
def takeSample(withReplacement: Boolean, num: Int, seed: Int): Array[T]
def toArray(): Array[T]
def toString(): String

Definition Classes
AnyRef → Any
def union(other: RDD[T]): RDD[T]
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws()
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws()
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws()

RDD

abstract class RDD[T] extends Serializable

Instance Constructors

new RDD(sc: SparkContext)(implicit arg0: ClassManifest[T])

Abstract Value Members

abstract def compute(split: Split): Iterator[T]

abstract val dependencies: List[spark.Dependency[_]]

abstract def splits: Array[Split]

Concrete Value Members

final def !=(arg0: AnyRef): Boolean

final def !=(arg0: Any): Boolean

final def ##(): Int

def ++(other: RDD[T]): RDD[T]

final def ==(arg0: AnyRef): Boolean

final def ==(arg0: Any): Boolean

def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassManifest[U]): U

final def asInstanceOf[T0]: T0

def cache(): RDD[T]

def cartesian[U](other: RDD[U])(implicit arg0: ClassManifest[U]): RDD[(T, U)]

def clone(): AnyRef

def collect(): Array[T]

def context: SparkContext

def count(): Long

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def filter(f: (T) ⇒ Boolean): RDD[T]

def finalize(): Unit

def first(): T

def flatMap[U](f: (T) ⇒ TraversableOnce[U])(implicit arg0: ClassManifest[U]): RDD[U]

def fold(zeroValue: T)(op: (T, T) ⇒ T): T

def foreach(f: (T) ⇒ Unit): Unit

final def getClass(): java.lang.Class[_]

def glom(): RDD[Array[T]]

def groupBy[K](f: (T) ⇒ K)(implicit arg0: ClassManifest[K]): RDD[(K, Seq[T])]

def groupBy[K](f: (T) ⇒ K, numSplits: Int)(implicit arg0: ClassManifest[K]): RDD[(K, Seq[T])]

def hashCode(): Int

val id: Int

final def isInstanceOf[T0]: Boolean

final def iterator(split: Split): Iterator[T]

def map[U](f: (T) ⇒ U)(implicit arg0: ClassManifest[U]): RDD[U]

def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: ClassManifest[U]): RDD[U]

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

val partitioner: Option[Partitioner]

def pipe(command: Seq[String], env: Map[String, String]): RDD[String]

def pipe(command: Seq[String]): RDD[String]

def pipe(command: String): RDD[String]

def preferredLocations(split: Split): Seq[String]

def reduce(f: (T, T) ⇒ T): T

def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]

def saveAsObjectFile(path: String): Unit

def saveAsTextFile(path: String): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def take(num: Int): Array[T]

def takeSample(withReplacement: Boolean, num: Int, seed: Int): Array[T]

def toArray(): Array[T]

def toString(): String

def union(other: RDD[T]): RDD[T]

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any