com.twitter.scalding

RichPipe

class RichPipe extends Serializable with JoinAlgorithms

This is an enrichment-pattern class for cascading.pipe.Pipe. The rule is to never use this class directly in input or return types, but only to add methods to Pipe.

Linear Supertypes
JoinAlgorithms, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. RichPipe
  2. JoinAlgorithms
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new RichPipe(pipe: Pipe)

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. def ++(that: Pipe): Pipe

    Merge or Concatenate several pipes together with this one:

  5. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  6. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  7. def addTrap(trapsource: Source)(implicit flowDef: FlowDef, mode: Mode): Pipe

    Adds a trap to the current pipe, which will capture all exceptions that occur in this pipe and save them to the trapsource given

    Adds a trap to the current pipe, which will capture all exceptions that occur in this pipe and save them to the trapsource given

    Traps do not include the original fields in a tuple, only the fields seen in an operation. Traps also do not include any exception information.

    There can only be at most one trap for each pipe.

  8. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  9. def blockJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, rightReplication: Int = 1, leftReplication: Int = 1, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    Performs a block join, otherwise known as a replicate fragment join (RF join).

    Performs a block join, otherwise known as a replicate fragment join (RF join). The input params leftReplication and rightReplication control the replication of the left and right pipes respectively.

    This is useful in cases where the data has extreme skew. A symptom of this is that we may see a job stuck for a very long time on a small number of reducers.

    A block join is way to get around this: we add a random integer field and a replica field to every tuple in the left and right pipes. We then join on the original keys and on these new dummy fields. These dummy fields make it less likely that the skewed keys will be hashed to the same reducer.

    The final data size is right * rightReplication + left * leftReplication but because of the fragmentation, we are guaranteed the same number of hits as the original join.

    If the right pipe is really small then you are probably better off with a joinWithTiny. If however the right pipe is medium sized, then you are better off with a blockJoinWithSmaller, and a good rule of thumb is to set rightReplication = left.size / right.size and leftReplication = 1

    Finally, if both pipes are of similar size, e.g. in case of a self join with a high data skew, then it makes sense to set leftReplication and rightReplication to be approximately equal.

    Note

    You can only use an InnerJoin or a LeftJoin with a leftReplication of 1 (or a RightJoin with a rightReplication of 1) when doing a blockJoin.

    Definition Classes
    JoinAlgorithms
  10. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  11. def coGroupBy(f: Fields, j: JoinMode = InnerJoinMode)(builder: (CoGroupBuilder) ⇒ GroupBuilder): Pipe

    This method is used internally to implement all joins.

    This method is used internally to implement all joins. You can use this directly if you want to implement something like a star join, e.g., when joining a single pipe to multiple other pipes. Make sure that you call this method on the larger pipe to make the grouping as efficient as possible.

    If you are only joining two pipes, then you are better off using joinWithSmaller/joinWithLarger/joinWithTiny/leftJoinWithTiny.

    Definition Classes
    JoinAlgorithms
  12. def collect[A, T](fs: (Fields, Fields))(fn: PartialFunction[A, T])(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

    Filters all data that is defined for this partial function and then applies that function

  13. def collectTo[A, T](fs: (Fields, Fields))(fn: PartialFunction[A, T])(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

  14. def crossWithSmaller(p: Pipe, replication: Int = 20): Pipe

    Does a cross-product by doing a blockJoin.

    Does a cross-product by doing a blockJoin. Useful when doing a large cross, if your cluster can take it. Prefer crossWithTiny

    Definition Classes
    JoinAlgorithms
  15. def crossWithTiny(tiny: Pipe): Pipe

    Doing a cross product with even a moderate sized pipe can create ENORMOUS output.

    WARNING

    Doing a cross product with even a moderate sized pipe can create ENORMOUS output. The use-case here is attaching a constant (e.g. a number or a dictionary or set) to each row in another pipe. A common use-case comes from a groupAll and reduction to one row, then you want to send the results back out to every element in a pipe

    This uses joinWithTiny, so tiny pipe is replicated to all Mappers. If it is large, this will blow up. Get it: be foolish here and LOSE IT ALL!

    Use at your own risk.

    Definition Classes
    JoinAlgorithms
  16. def debug(dbg: PipeDebug): Pipe

    Print the tuples that pass with the options configured in debugger For instance:

    Print the tuples that pass with the options configured in debugger For instance:

    debug(PipeDebug().toStdOut.printTuplesEvery(100))
  17. def debug: Pipe

    Print all the tuples that pass to stderr

  18. def discard(f: Fields): Pipe

    Discard the given fields, and keep the rest.

    Discard the given fields, and keep the rest. Kind of the opposite of project method.

  19. def distinct(f: Fields): Pipe

    Returns the set of distinct tuples containing the specified fields

  20. def each(fs: (Fields, Fields))(fn: (Fields) ⇒ Function[_]): Each

    Convenience method for integrating with existing cascading Functions

  21. def eachTo(fs: (Fields, Fields))(fn: (Fields) ⇒ Function[_]): Each

    Same as above, but only keep the results field.

  22. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  23. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  24. def filter[A](f: Fields)(fn: (A) ⇒ Boolean)(implicit conv: TupleConverter[A]): Pipe

    Keep only items that satisfy this predicate.

  25. def filterNot[A](f: Fields)(fn: (A) ⇒ Boolean)(implicit conv: TupleConverter[A]): Pipe

    Keep only items that don't satisfy this predicate.

    Keep only items that don't satisfy this predicate. filterNot is equal to negating a filter operation.

    filterNot('name) { name: String => name contains "a" }

    is the same as:

    filter('name) { name: String => !(name contains "a") }
  26. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  27. def flatMap[A, T](fs: (Fields, Fields))(fn: (A) ⇒ TraversableOnce[T])(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

  28. def flatMapTo[A, T](fs: (Fields, Fields))(fn: (A) ⇒ TraversableOnce[T])(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

  29. def flatten[T](fs: (Fields, Fields))(implicit conv: TupleConverter[TraversableOnce[T]], setter: TupleSetter[T]): Pipe

    the same as

    the same as

    flatMap(fs) { it : TraversableOnce[T] => it }

    Common enough to be useful.

  30. def flattenTo[T](fs: (Fields, Fields))(implicit conv: TupleConverter[TraversableOnce[T]], setter: TupleSetter[T]): Pipe

    the same as

    the same as

    flatMapTo(fs) { it : TraversableOnce[T] => it }

    Common enough to be useful.

  31. lazy val forceToDisk: Pipe

    Force a materialization to disk in the flow.

    Force a materialization to disk in the flow. This is useful before crossWithTiny if you filter just before. Ideally scalding/cascading would see this (and may in future versions), but for now it is here to aid in hand-tuning jobs

  32. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  33. def groupAll(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    This kills parallelism.

    Warning

    This kills parallelism. All the work is sent to one reducer.

    Only use this in the case that you truly need all the data on one reducer.

    Just about the only reasonable case of this method is to reduce all values of a column or count all the rows.

  34. def groupAll: Pipe

    Group all tuples down to one reducer.

    Group all tuples down to one reducer. (due to cascading limitation). This is probably only useful just before setting a tail such as Database tail, so that only one reducer talks to the DB. Kind of a hack.

  35. def groupAndShuffleRandomly(reducers: Int, seed: Long)(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Like groupAndShuffleRandomly(reducers : Int) but with a fixed seed.

  36. def groupAndShuffleRandomly(reducers: Int)(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Like shard, except do some operation im the reducers

  37. def groupBy(f: Fields)(builder: (GroupBuilder) ⇒ GroupBuilder): Pipe

    group the Pipe based on fields

    group the Pipe based on fields

    builder is typically a block that modifies the given GroupBuilder the final OUTPUT of the block is used to schedule the new pipe each method in GroupBuilder returns this, so it is recommended to chain them and use the default input:

    _.size.max('f1) etc...
  38. def groupRandomly(n: Int, seed: Long)(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    like groupRandomly(n : Int) with a given seed in the randomization

  39. def groupRandomly(n: Int)(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Like groupAll, but randomly groups data into n reducers.

    Like groupAll, but randomly groups data into n reducers.

    you can provide a seed for the random number generator to get reproducible results

  40. def groupRandomlyAux(n: Int, optSeed: Option[Long])(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Attributes
    protected
  41. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  42. def insert[A](fs: Fields, value: A)(implicit setter: TupleSetter[A]): Pipe

    Adds a field with a constant value.

    Adds a field with a constant value.

    Usage

    insert('a, 1)
  43. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  44. def joinWithLarger(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    same as reversing the order on joinWithSmaller

    same as reversing the order on joinWithSmaller

    Definition Classes
    JoinAlgorithms
  45. def joinWithSmaller(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    Joins the first set of keys in the first pipe to the second set of keys in the second pipe.

    Joins the first set of keys in the first pipe to the second set of keys in the second pipe. All keys must be unique UNLESS it is an inner join, then duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).

    Smaller here means that the values/key is smaller than the left.

    Avoid going crazy adding more explicit join modes. Instead do for some other join mode with a larger pipe:

    .then { pipe => other.
    joinWithSmaller(('other1, 'other2)->('this1, 'this2), pipe, new FancyJoin)
    }
    Definition Classes
    JoinAlgorithms
  46. def joinWithTiny(fs: (Fields, Fields), that: Pipe): Pipe

    This does an assymmetric join, using cascading's "HashJoin".

    This does an assymmetric join, using cascading's "HashJoin". This only runs through this pipe once, and keeps the right hand side pipe in memory (but is spillable).

    Choose this when Left > max(mappers,reducers) * Right, or when the left side is three orders of magnitude larger.

    joins the first set of keys in the first pipe to the second set of keys in the second pipe. Duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).

    Warning

    This does not work with outer joins, or right joins, only inner and left join versions are given.

    Definition Classes
    JoinAlgorithms
  47. def joinerToJoinModes(j: Joiner): (Product with Serializable with JoinMode, Product with Serializable with JoinMode)

    Definition Classes
    JoinAlgorithms
  48. def leftJoinWithLarger(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

    This is joinWithLarger with joiner parameter fixed to LeftJoin.

    This is joinWithLarger with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values

    Definition Classes
    JoinAlgorithms
  49. def leftJoinWithSmaller(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

    This is joinWithSmaller with joiner parameter fixed to LeftJoin.

    This is joinWithSmaller with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values

    Definition Classes
    JoinAlgorithms
  50. def leftJoinWithTiny(fs: (Fields, Fields), that: Pipe): HashJoin

    Definition Classes
    JoinAlgorithms
  51. def limit(n: Long): Pipe

    Keep at most n elements.

    Keep at most n elements. This is implemented by keeping approximately n/k elements on each of the k mappers or reducers (whichever we wind up being scheduled on).

  52. def map[A, T](fs: (Fields, Fields))(fn: (A) ⇒ T)(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

    If you use a map function that does not accept TupleEntry args, which is the common case, an implicit conversion in GeneratedConversions will convert your function into a (TupleEntry => T).

    If you use a map function that does not accept TupleEntry args, which is the common case, an implicit conversion in GeneratedConversions will convert your function into a (TupleEntry => T). The result type T is converted to a cascading Tuple by an implicit TupleSetter[T]. acceptable T types are primitive types, cascading Tuples of those types, or scala.Tuple(1-22) of those types.

    After the map, the input arguments will be set to the output of the map, so following with filter or map is fine without a new using statement if you mean to operate on the output.

    map('data -> 'stuff)

    * if output equals input, REPLACE is used. * if output or input is a subset of the other SWAP is used. * otherwise we append the new fields (cascading Fields.ALL is used)

    mapTo('data -> 'stuff)

    Only the results (stuff) are kept (cascading Fields.RESULTS)

    Note

    Using mapTo is the same as using map followed by a project for selecting just the output fields

  53. def mapTo[A, T](fs: (Fields, Fields))(fn: (A) ⇒ T)(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

  54. def name(s: String): Pipe

    Rename the current pipe

  55. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  56. def normalize(f: Fields, useTiny: Boolean = true): Pipe

    Divides sum of values for this variable by their sum; assumes without checking that division is supported on this type and that sum is not zero

    Divides sum of values for this variable by their sum; assumes without checking that division is supported on this type and that sum is not zero

    If those assumptions do not hold, will throw an exception -- consider checking sum sepsarately and/or using addTrap

    in some cases, crossWithTiny has been broken, the implementation supports a work-around

  57. final def notify(): Unit

    Definition Classes
    AnyRef
  58. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  59. def pack[T](fs: (Fields, Fields))(implicit packer: TuplePacker[T], setter: TupleSetter[T]): Pipe

    Maps the input fields into an output field of type T.

    Maps the input fields into an output field of type T. For example:

    pipe.pack[(Int, Int)] (('field1, 'field2) -> 'field3)

    will pack fields 'field1 and 'field2 to field 'field3, as long as 'field1 and 'field2 can be cast into integers. The output field 'field3 will be of tupel (Int, Int)

  60. def packTo[T](fs: (Fields, Fields))(implicit packer: TuplePacker[T], setter: TupleSetter[T]): Pipe

    Same as pack but only the to fields are preserved.

  61. def partition[A, R](fs: (Fields, Fields))(fn: (A) ⇒ R)(builder: (GroupBuilder) ⇒ GroupBuilder)(implicit conv: TupleConverter[A], ord: Ordering[R], rset: TupleSetter[R]): Pipe

    Given a function, partitions the pipe into several groups based on the output of the function.

    Given a function, partitions the pipe into several groups based on the output of the function. Then applies a GroupBuilder function on each of the groups.

    Example: pipe .mapTo(()->('age, 'weight) { ... } .partition('age -> 'isAdult) { _ > 18 } { _.average('weight) } pipe now contains the average weights of adults and minors.

  62. val pipe: Pipe

    Definition Classes
    RichPipeJoinAlgorithms
  63. def project(fields: Fields): Pipe

    Keep only the given fields, and discard the rest.

    Keep only the given fields, and discard the rest. takes any number of parameters as long as we can convert them to a fields object

  64. def rename(fields: (Fields, Fields)): Pipe

    Rename some set of N fields as another set of N fields

    Rename some set of N fields as another set of N fields

    Usage

    rename('x -> 'z)
    rename(('x,'y) -> ('X,'Y))

    Warning

    rename('x,'y) is interpreted by scala as rename(Tuple2('x,'y)) which then does rename('x -> 'y). This is probably not what is intended but the compiler doesn't resolve the ambiguity. YOU MUST CALL THIS WITH A TUPLE2! If you don't, expect the unexpected.

  65. def sample(fraction: Double, seed: Long): Pipe

  66. def sample(fraction: Double): Pipe

    Sample a fraction of elements.

    Sample a fraction of elements. fraction should be between 0.00 (0%) and 1.00 (100%) you can provide a seed to get reproducible results

  67. def sampleWithReplacement(fraction: Double, seed: Int): Pipe

  68. def sampleWithReplacement(fraction: Double): Pipe

    Sample fraction of elements with return.

    Sample fraction of elements with return. fraction should be between 0.00 (0%) and 1.00 (100%) you can provide a seed to get reproducible results

  69. def shard(n: Int, seed: Int): Pipe

    Force a random shuffle of all the data to exactly n reducers, with a given seed if you need repeatability.

  70. def shard(n: Int): Pipe

    Force a random shuffle of all the data to exactly n reducers

  71. def shuffle(shards: Int, seed: Long): Pipe

  72. def shuffle(shards: Int): Pipe

    Put all rows in random order

    Put all rows in random order

    you can provide a seed for the random number generator to get reproducible results

  73. def skewJoinWithLarger(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

    Definition Classes
    JoinAlgorithms
  74. def skewJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

    Performs a skewed join, which is useful when the data has extreme skew.

    Performs a skewed join, which is useful when the data has extreme skew.

    For example, imagine joining a pipe of Twitter's follow graph against a pipe of user genders, in order to find the gender distribution of the accounts every Twitter user follows. Since celebrities (e.g., Justin Bieber and Lady Gaga) have a much larger follower base than other users, and (under a standard join algorithm) all their followers get sent to the same reducer, the job will likely be stuck on a few reducers for a long time. A skewed join attempts to alleviate this problem.

    This works as follows:

    1. First, we sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe. 2. We use these estimated counts to replicate the join keys, according to the given replication strategy. 3. Finally, we join the replicated pipes together.

    sampleRate

    This controls how often we sample from the left and right pipes when estimating key counts.

    replicator

    Algorithm for determining how much to replicate a join key in the left and right pipes.

    Note: since we do not set the replication counts, only inner joins are allowed. (Otherwise, replicated rows would stay replicated when there is no counterpart in the other pipe.)

    Definition Classes
    JoinAlgorithms
  75. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  76. def thenDo[T, U](pfn: (T) ⇒ U)(implicit in: (RichPipe) ⇒ T): U

    Insert a function into the pipeline:

  77. def toString(): String

    Definition Classes
    AnyRef → Any
  78. def unique(f: Fields): Pipe

    Returns the set of unique tuples containing the specified fields.

    Returns the set of unique tuples containing the specified fields. Same as distinct

  79. def unpack[T](fs: (Fields, Fields))(implicit unpacker: TupleUnpacker[T], conv: TupleConverter[T]): Pipe

    The opposite of pack.

    The opposite of pack. Unpacks the input field of type T into the output fields. For example:

    pipe.unpack[(Int, Int)] ('field1 -> ('field2, 'field3))

    will unpack 'field1 into 'field2 and 'field3

  80. def unpackTo[T](fs: (Fields, Fields))(implicit unpacker: TupleUnpacker[T], conv: TupleConverter[T]): Pipe

    Same as unpack but only the to fields are preserved.

  81. def unpivot(fieldDef: (Fields, Fields)): Pipe

    This is an analog of the SQL/Excel unpivot function which converts columns of data into rows of data.

    This is an analog of the SQL/Excel unpivot function which converts columns of data into rows of data. Only the columns given as input fields are expanded in this way. For this operation to be reversible, you need to keep some unique key on each row. See GroupBuilder.pivot to reverse this operation assuming you leave behind a grouping key

    Example

    pipe.unpivot(('w,'x,'y,'z) -> ('feature, 'value))

    takes rows like:

    key, w, x, y, z
    1, 2, 3, 4, 5
    2, 8, 7, 6, 5

    to:

    key, feature, value
    1, w, 2
    1, x, 3
    1, y, 4

    etc...

  82. def upstreamPipes: Set[Pipe]

    Set of pipes reachable from this pipe (transitive closure of 'Pipe.

    Set of pipes reachable from this pipe (transitive closure of 'Pipe.getPrevious')

  83. def using[C <: AnyRef { def release(): Unit }](bf: ⇒ C): AnyRef { ... /* 3 definitions in type refinement */ }

    Beginning of block with access to expensive nonserializable state.

    Beginning of block with access to expensive nonserializable state. The state object should contain a function release() for resource management purpose.

  84. def verifyTypes[A](f: Fields)(implicit conv: TupleConverter[A]): Pipe

    Text files can have corrupted data.

    Text files can have corrupted data. If you use this function and a cascading trap you can filter out corrupted data from your pipe.

  85. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  86. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  87. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  88. def write(outsource: Source)(implicit flowDef: FlowDef, mode: Mode): Pipe

    Write all the tuples to the given source and return this Pipe

Inherited from JoinAlgorithms

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped