Trait/Object

io.eels.datastream

DataStream

Related Docs: object DataStream | package datastream

Permalink

trait DataStream extends Logging

A DataStream is kind of like a table of data. It has fields (like columns) and rows of data. Each row has an entry for each field (this may be null depending on the field definition).

It is a lazily evaluated data structure. Each operation on a stream will create a new derived stream, but those operations will only occur when a final action is performed.

You can create a DataStream from an IO source, such as a Parquet file or a Hive table, or you may create a fully evaluated one from an in memory structure. In the case of the former, the data will only be loaded on demand as an action is performed.

A DataStream is split into one or more partitions. Each partition can operate independantly of the others. For example, if you filter a stream, each partition will be filtered seperately, which allows it to be parallelized. If you write out a stream, each partition can be written out to individual files, again allowing parallelization.

Self Type
DataStream
Linear Supertypes
Logging, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. DataStream
  2. Logging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Abstract Value Members

  1. abstract def schema: StructType

    Permalink

    Returns the Schema for this stream.

    Returns the Schema for this stream. This call will not cause a full evaluation, but only the operations required to retrieve a schema will occur. For example, on a stream backed by a JDBC source, an empty resultset will be obtained in order to query the metadata for the database columns.

Concrete Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. def ++(other: DataStream): DataStream

    Permalink

    Joins two streams together, such that the elements of the given frame are appended to the end of this streams.

    Joins two streams together, such that the elements of the given frame are appended to the end of this streams. This operation is the same as a concat operation. This results in having numPartitions(a) + numPartitions(b)

  4. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  5. def addField(name: String, defaultValue: String): DataStream

    Permalink

    Returns a new DataStream with the new field of type String added at the end.

    Returns a new DataStream with the new field of type String added at the end. The value of this field for each Row is specified by the default value.

  6. def addField(field: Field, defaultValue: Any): DataStream

    Permalink

    Returns a new DataStream with the given field added at the end.

    Returns a new DataStream with the given field added at the end. The value of this field for each Row is specified by the default value. The value must be compatible with the field definition. Eg, an error will occur if the field has type Int and the default value was 1.3

  7. def addFieldIfNotExists(field: Field, defaultValue: Any): DataStream

    Permalink
  8. def addFieldIfNotExists(name: String, defaultValue: Any): DataStream

    Permalink
  9. def aggregated(): GroupedDataStream

    Permalink
  10. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  11. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  12. def collect: Vector[Row]

    Permalink

    Action which results in all the rows being returned in memory as a Vector.

  13. def count: Long

    Permalink
  14. def drop(k: Int): DataStream

    Permalink

    Returns a new DataStream where k number of rows has been dropped.

    Returns a new DataStream where k number of rows has been dropped. This operation requires a reshuffle.

  15. def dropNullRows(): DataStream

    Permalink
  16. def dropWhile(fieldName: String, pred: (Any) ⇒ Boolean): DataStream

    Permalink
  17. def dropWhile(p: (Row) ⇒ Boolean): DataStream

    Permalink
  18. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  19. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  20. implicit val executor: ExecutionContextExecutor

    Permalink
  21. def exists(p: (Row) ⇒ Boolean): Boolean

    Permalink
  22. def explode(fn: (Row) ⇒ Seq[Row]): DataStream

    Permalink
  23. def filter(fieldName: String, p: (Any) ⇒ Boolean): DataStream

    Permalink

    Filters where the given field name matches the given predicate.

  24. def filter(p: (Row) ⇒ Boolean): DataStream

    Permalink

    For each row in the stream, filter drops any rows which do not match the predicate.

  25. def filterNot(p: (Row) ⇒ Boolean): DataStream

    Permalink
  26. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  27. def find(p: (Row) ⇒ Boolean): Option[Row]

    Permalink
  28. def fold[A](initial: A)(fn: (A, Row) ⇒ A): A

    Permalink
  29. def forall(p: (Row) ⇒ Boolean): Boolean

    Permalink
  30. def foreach[U](fn: (Row) ⇒ U): DataStream

    Permalink

    Execute a side effecting function for every row in the stream, returning the same row.

  31. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  32. def groupBy(fn: (Row) ⇒ Any): GroupedDataStream

    Permalink
  33. def groupBy(fields: Iterable[String]): GroupedDataStream

    Permalink
  34. def groupBy(first: String, rest: String*): GroupedDataStream

    Permalink
  35. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  36. def head: Row

    Permalink
  37. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  38. def iterator: Iterator[Row]

    Permalink

    Action which returns a scala.collection.CloseIterator, which will result in the lazy evaluation of the stream, element by element.

  39. def join(other: DataStream): DataStream

    Permalink

    Combines two frames together such that the fields from this frame are joined with the fields of the given frame.

    Combines two frames together such that the fields from this frame are joined with the fields of the given frame. Eg, if this frame has A,B and the given frame has C,D then the result will be A,B,C,D

    Each stream has different partitions so we'll need to re-partition it to ensure we have an even distribution.

  40. val logger: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  41. def map(f: (Row) ⇒ Row): DataStream

    Permalink
  42. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  43. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  44. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  45. def projection(fields: Seq[String]): DataStream

    Permalink

    Returns a new DataStream which contains the given list of fields from the existing stream.

  46. def projection(first: String, rest: String*): DataStream

    Permalink
  47. def projectionExpression(expr: String): DataStream

    Permalink
  48. def removeField(fieldName: String, caseSensitive: Boolean = true): DataStream

    Permalink
  49. def renameField(nameFrom: String, nameTo: String): DataStream

    Permalink
  50. def replace(from: String, target: Any): DataStream

    Permalink

    Foreach row, any values that match "from" will be replaced with "target".

    Foreach row, any values that match "from" will be replaced with "target". This operation applies to all values for all rows.

  51. def replace(fieldName: String, from: String, target: Any): DataStream

    Permalink

    Replaces any values that match "form" with the value "target".

    Replaces any values that match "form" with the value "target". This operation only applies to the field name specified.

  52. def replace(fieldName: String, fn: (Any) ⇒ Any): DataStream

    Permalink

    For each row, the value corresponding to the given fieldName is applied to the function.

    For each row, the value corresponding to the given fieldName is applied to the function. The result of the function is the new value for that cell.

  53. def replaceFieldType(from: DataType, toType: DataType): DataStream

    Permalink
  54. def replaceNullValues(defaultValue: String): DataStream

    Permalink
  55. def sample(k: Int): DataStream

    Permalink

    Returns a new DataStream where only each "k" row is retained.

    Returns a new DataStream where only each "k" row is retained. Ie, if sample is 2, then on average, every other row will be returned. If sample is 10 then only 10% of rows will be returned. When running concurrently, the rows that are sampled will vary depending on the ordering that the workers pull through the rows. Each partition uses its own couter.

  56. def size: Long

    Permalink
  57. def stripCharsFromFieldNames(chars: Seq[Char]): DataStream

    Permalink

    Returns a new DataStream with the same data as this stream, but where the field names have been sanitized by removing any occurances of the given characters.

  58. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  59. def take(k: Int): DataStream

    Permalink
  60. def takeWhile(pred: (Row) ⇒ Boolean): DataStream

    Permalink
  61. def takeWhile(fieldName: String, pred: (Any) ⇒ Boolean): DataStream

    Permalink
  62. def to(sink: Sink, listener: Listener): Long

    Permalink
  63. def to(sink: Sink): Long

    Permalink
  64. def toSet: Set[Row]

    Permalink
  65. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  66. def toVector: Vector[Row]

    Permalink

    Action which results in all the rows being returned in memory as a Vector.

    Action which results in all the rows being returned in memory as a Vector. Alias for 'collect()'

  67. def union(other: DataStream): DataStream

    Permalink
  68. def updateField(name: String, field: Field): DataStream

    Permalink
  69. def updateFieldType(fieldName: String, datatype: DataType): DataStream

    Permalink

    Returns the same data but with an updated schema.

    Returns the same data but with an updated schema. The field that matches the given name will have its datatype set to the given datatype.

  70. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  71. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  72. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  73. def withLowerCaseSchema(): DataStream

    Permalink

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped