spark

RDD

abstract class RDD[T] extends Serializable

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Each RDD is characterized by five main properties: - A list of splits (partitions) - A function for computing each split - A list of dependencies on other RDDs - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) - Optionally, a list of preferred locations to compute each split on (e.g. block locations for HDFS)

All the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself.

This class also contains transformation methods available on all RDDs (e.g. map and filter). In addition, PairRDDFunctions contains extra methods available on RDDs of key-value pairs, and SequenceFileRDDFunctions contains extra methods for saving RDDs to Hadoop SequenceFiles.

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. Hide All
  2. Show all
  1. RDD
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
Visibility
  1. Public
  2. All

Instance Constructors

  1. new RDD(sc: SparkContext)(implicit arg0: ClassManifest[T])

Abstract Value Members

  1. abstract def compute(split: Split): Iterator[T]

  2. abstract val dependencies: List[spark.Dependency[_]]

  3. abstract def splits: Array[Split]

Concrete Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. def ++(other: RDD[T]): RDD[T]

  5. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  6. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  7. def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassManifest[U]): U

    Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value".

    Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.

  8. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  9. def cache(): RDD[T]

  10. def cartesian[U](other: RDD[U])(implicit arg0: ClassManifest[U]): RDD[(T, U)]

  11. def clone(): AnyRef

    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws()
  12. def collect(): Array[T]

  13. def context: SparkContext

  14. def count(): Long

  15. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  16. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  17. def filter(f: (T) ⇒ Boolean): RDD[T]

  18. def finalize(): Unit

    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws()
  19. def first(): T

  20. def flatMap[U](f: (T) ⇒ TraversableOnce[U])(implicit arg0: ClassManifest[U]): RDD[U]

  21. def fold(zeroValue: T)(op: (T, T) ⇒ T): T

    Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value".

    Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

  22. def foreach(f: (T) ⇒ Unit): Unit

  23. final def getClass(): java.lang.Class[_]

    Definition Classes
    AnyRef → Any
  24. def glom(): RDD[Array[T]]

  25. def groupBy[K](f: (T) ⇒ K)(implicit arg0: ClassManifest[K]): RDD[(K, Seq[T])]

  26. def groupBy[K](f: (T) ⇒ K, numSplits: Int)(implicit arg0: ClassManifest[K]): RDD[(K, Seq[T])]

  27. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  28. val id: Int

  29. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  30. final def iterator(split: Split): Iterator[T]

  31. def map[U](f: (T) ⇒ U)(implicit arg0: ClassManifest[U]): RDD[U]

  32. def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: ClassManifest[U]): RDD[U]

  33. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  34. final def notify(): Unit

    Definition Classes
    AnyRef
  35. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  36. val partitioner: Option[Partitioner]

  37. def pipe(command: Seq[String], env: Map[String, String]): RDD[String]

  38. def pipe(command: Seq[String]): RDD[String]

  39. def pipe(command: String): RDD[String]

  40. def preferredLocations(split: Split): Seq[String]

  41. def reduce(f: (T, T) ⇒ T): T

  42. def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]

  43. def saveAsObjectFile(path: String): Unit

  44. def saveAsTextFile(path: String): Unit

  45. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  46. def take(num: Int): Array[T]

    Take the first num elements of the RDD.

    Take the first num elements of the RDD. This currently scans the partitions *one by one*, so it will be slow if a lot of partitions are required. In that case, use collect() to get the whole RDD instead.

  47. def takeSample(withReplacement: Boolean, num: Int, seed: Int): Array[T]

  48. def toArray(): Array[T]

  49. def toString(): String

    Definition Classes
    AnyRef → Any
  50. def union(other: RDD[T]): RDD[T]

  51. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws()
  52. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws()
  53. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws()

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any