Merge or Concatenate several pipes together with this one:
Adds a trap to the current pipe, which will capture all exceptions that occur in this pipe and save them to the trapsource given
Adds a trap to the current pipe, which will capture all exceptions that occur in this pipe and save them to the trapsource given
Traps do not include the original fields in a tuple, only the fields seen in an operation. Traps also do not include any exception information.
There can only be at most one trap for each pipe.
Performs a block join, otherwise known as a replicate fragment join (RF join).
Performs a block join, otherwise known as a replicate fragment join (RF join). The input params leftReplication and rightReplication control the replication of the left and right pipes respectively.
This is useful in cases where the data has extreme skew. A symptom of this is that we may see a job stuck for a very long time on a small number of reducers.
A block join is way to get around this: we add a random integer field and a replica field to every tuple in the left and right pipes. We then join on the original keys and on these new dummy fields. These dummy fields make it less likely that the skewed keys will be hashed to the same reducer.
The final data size is right * rightReplication + left * leftReplication but because of the fragmentation, we are guaranteed the same number of hits as the original join.
If the right pipe is really small then you are probably better off with a joinWithTiny. If however the right pipe is medium sized, then you are better off with a blockJoinWithSmaller, and a good rule of thumb is to set rightReplication = left.size / right.size and leftReplication = 1
Finally, if both pipes are of similar size, e.g. in case of a self join with a high data skew, then it makes sense to set leftReplication and rightReplication to be approximately equal.
You can only use an InnerJoin or a LeftJoin with a leftReplication of 1 (or a RightJoin with a rightReplication of 1) when doing a blockJoin.
This method is used internally to implement all joins.
This method is used internally to implement all joins. You can use this directly if you want to implement something like a star join, e.g., when joining a single pipe to multiple other pipes. Make sure that you call this method on the larger pipe to make the grouping as efficient as possible.
If you are only joining two pipes, then you are better off using joinWithSmaller/joinWithLarger/joinWithTiny/leftJoinWithTiny.
Filters all data that is defined for this partial function and then applies that function
Does a cross-product by doing a blockJoin.
Does a cross-product by doing a blockJoin. Useful when doing a large cross, if your cluster can take it. Prefer crossWithTiny
Doing a cross product with even a moderate sized pipe can create ENORMOUS output.
Doing a cross product with even a moderate sized pipe can create ENORMOUS output. The use-case here is attaching a constant (e.g. a number or a dictionary or set) to each row in another pipe. A common use-case comes from a groupAll and reduction to one row, then you want to send the results back out to every element in a pipe
This uses joinWithTiny, so tiny pipe is replicated to all Mappers. If it is large, this will blow up. Get it: be foolish here and LOSE IT ALL!
Use at your own risk.
Print the tuples that pass with the options configured in debugger For instance:
Print the tuples that pass with the options configured in debugger For instance:
debug(PipeDebug().toStdOut.printTuplesEvery(100))
Print all the tuples that pass to stderr
Discard the given fields, and keep the rest.
Discard the given fields, and keep the rest. Kind of the opposite of project method.
Returns the set of distinct tuples containing the specified fields
Convenience method for integrating with existing cascading Functions
Same as above, but only keep the results field.
Keep only items that satisfy this predicate.
Keep only items that don't satisfy this predicate.
Keep only items that don't satisfy this predicate.
filterNot
is equal to negating a filter
operation.
filterNot('name) { name: String => name contains "a" }
is the same as:
filter('name) { name: String => !(name contains "a") }
the same as
the same as
flatMap(fs) { it : TraversableOnce[T] => it }
Common enough to be useful.
the same as
the same as
flatMapTo(fs) { it : TraversableOnce[T] => it }
Common enough to be useful.
Force a materialization to disk in the flow.
Force a materialization to disk in the flow. This is useful before crossWithTiny if you filter just before. Ideally scalding/cascading would see this (and may in future versions), but for now it is here to aid in hand-tuning jobs
This kills parallelism.
This kills parallelism. All the work is sent to one reducer.
Only use this in the case that you truly need all the data on one reducer.
Just about the only reasonable case of this method is to reduce all values of a column or count all the rows.
Group all tuples down to one reducer.
Group all tuples down to one reducer. (due to cascading limitation). This is probably only useful just before setting a tail such as Database tail, so that only one reducer talks to the DB. Kind of a hack.
Like groupAndShuffleRandomly(reducers : Int) but with a fixed seed.
Like shard, except do some operation im the reducers
group the Pipe based on fields
group the Pipe based on fields
builder is typically a block that modifies the given GroupBuilder the final OUTPUT of the block is used to schedule the new pipe each method in GroupBuilder returns this, so it is recommended to chain them and use the default input:
_.size.max('f1) etc...
like groupRandomly(n : Int) with a given seed in the randomization
Like groupAll, but randomly groups data into n reducers.
Like groupAll, but randomly groups data into n reducers.
you can provide a seed for the random number generator to get reproducible results
Adds a field with a constant value.
Adds a field with a constant value.
insert('a, 1)
same as reversing the order on joinWithSmaller
same as reversing the order on joinWithSmaller
Joins the first set of keys in the first pipe to the second set of keys in the second pipe.
Joins the first set of keys in the first pipe to the second set of keys in the second pipe. All keys must be unique UNLESS it is an inner join, then duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).
Smaller here means that the values/key is smaller than the left.
Avoid going crazy adding more explicit join modes. Instead do for some other join mode with a larger pipe:
.then { pipe => other. joinWithSmaller(('other1, 'other2)->('this1, 'this2), pipe, new FancyJoin) }
This does an assymmetric join, using cascading's "HashJoin".
This does an assymmetric join, using cascading's "HashJoin". This only runs through this pipe once, and keeps the right hand side pipe in memory (but is spillable).
Choose this when Left > max(mappers,reducers) * Right, or when the left side is three orders of magnitude larger.
joins the first set of keys in the first pipe to the second set of keys in the second pipe. Duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).
This does not work with outer joins, or right joins, only inner and left join versions are given.
This is joinWithLarger with joiner parameter fixed to LeftJoin.
This is joinWithLarger with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values
This is joinWithSmaller with joiner parameter fixed to LeftJoin.
This is joinWithSmaller with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values
Keep at most n elements.
Keep at most n elements. This is implemented by keeping approximately n/k elements on each of the k mappers or reducers (whichever we wind up being scheduled on).
If you use a map function that does not accept TupleEntry args,
which is the common case, an implicit conversion in GeneratedConversions
will convert your function into a (TupleEntry => T)
.
If you use a map function that does not accept TupleEntry args,
which is the common case, an implicit conversion in GeneratedConversions
will convert your function into a (TupleEntry => T)
. The result type
T is converted to a cascading Tuple by an implicit TupleSetter[T]
.
acceptable T types are primitive types, cascading Tuples of those types,
or scala.Tuple(1-22)
of those types.
After the map, the input arguments will be set to the output of the map, so following with filter or map is fine without a new using statement if you mean to operate on the output.
map('data -> 'stuff)
* if output equals input, REPLACE is used. * if output or input is a subset of the other SWAP is used. * otherwise we append the new fields (cascading Fields.ALL is used)
mapTo('data -> 'stuff)
Only the results (stuff) are kept (cascading Fields.RESULTS)
Using mapTo is the same as using map followed by a project for selecting just the output fields
Rename the current pipe
Divides sum of values for this variable by their sum; assumes without checking that division is supported on this type and that sum is not zero
Divides sum of values for this variable by their sum; assumes without checking that division is supported on this type and that sum is not zero
If those assumptions do not hold, will throw an exception -- consider checking sum sepsarately and/or using addTrap
in some cases, crossWithTiny has been broken, the implementation supports a work-around
Maps the input fields into an output field of type T.
Maps the input fields into an output field of type T. For example:
pipe.pack[(Int, Int)] (('field1, 'field2) -> 'field3)
will pack fields 'field1 and 'field2 to field 'field3, as long as 'field1 and 'field2
can be cast into integers. The output field 'field3 will be of tupel (Int, Int)
Same as pack but only the to fields are preserved.
Given a function, partitions the pipe into several groups based on the output of the function.
Given a function, partitions the pipe into several groups based on the output of the function. Then applies a GroupBuilder function on each of the groups.
Example: pipe .mapTo(()->('age, 'weight) { ... } .partition('age -> 'isAdult) { _ > 18 } { _.average('weight) } pipe now contains the average weights of adults and minors.
Keep only the given fields, and discard the rest.
Keep only the given fields, and discard the rest. takes any number of parameters as long as we can convert them to a fields object
Rename some set of N fields as another set of N fields
Rename some set of N fields as another set of N fields
rename('x -> 'z) rename(('x,'y) -> ('X,'Y))
rename('x,'y)
is interpreted by scala as rename(Tuple2('x,'y))
which then does rename('x -> 'y)
. This is probably not what is intended
but the compiler doesn't resolve the ambiguity. YOU MUST CALL THIS WITH
A TUPLE2! If you don't, expect the unexpected.
Sample percent of elements.
Sample percent of elements. percent should be between 0.00 (0%) and 1.00 (100%) you can provide a seed to get reproducible results
Sample percent of elements with return.
Sample percent of elements with return. percent should be between 0.00 (0%) and 1.00 (100%) you can provide a seed to get reproducible results
Force a random shuffle of all the data to exactly n reducers, with a given seed if you need repeatability.
Force a random shuffle of all the data to exactly n reducers
Put all rows in random order
Put all rows in random order
you can provide a seed for the random number generator to get reproducible results
Performs a skewed join, which is useful when the data has extreme skew.
Performs a skewed join, which is useful when the data has extreme skew.
For example, imagine joining a pipe of Twitter's follow graph against a pipe of user genders, in order to find the gender distribution of the accounts every Twitter user follows. Since celebrities (e.g., Justin Bieber and Lady Gaga) have a much larger follower base than other users, and (under a standard join algorithm) all their followers get sent to the same reducer, the job will likely be stuck on a few reducers for a long time. A skewed join attempts to alleviate this problem.
This works as follows:
1. First, we sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe. 2. We use these estimated counts to replicate the join keys, according to the given replication strategy. 3. Finally, we join the replicated pipes together.
This controls how often we sample from the left and right pipes when estimating key counts.
Algorithm for determining how much to replicate a join key in the left and right pipes. Note: since we do not set the replication counts, only inner joins are allowed. (Otherwise, replicated rows would stay replicated when there is no counterpart in the other pipe.)
Insert a function into the pipeline:
Returns the set of unique tuples containing the specified fields.
Returns the set of unique tuples containing the specified fields. Same as distinct
The opposite of pack.
The opposite of pack. Unpacks the input field of type T
into
the output fields. For example:
pipe.unpack[(Int, Int)] ('field1 -> ('field2, 'field3))
will unpack 'field1 into 'field2 and 'field3
Same as unpack but only the to fields are preserved.
This is an analog of the SQL/Excel unpivot function which converts columns of data into rows of data.
This is an analog of the SQL/Excel unpivot function which converts columns of data into rows of data. Only the columns given as input fields are expanded in this way. For this operation to be reversible, you need to keep some unique key on each row. See GroupBuilder.pivot to reverse this operation assuming you leave behind a grouping key
pipe.unpivot(('w,'x,'y,'z) -> ('feature, 'value))
takes rows like:
key, w, x, y, z 1, 2, 3, 4, 5 2, 8, 7, 6, 5
to:
key, feature, value 1, w, 2 1, x, 3 1, y, 4
etc...
Set of pipes reachable from this pipe (transitive closure of 'Pipe.getPrevious')
Beginning of block with access to expensive nonserializable state.
Beginning of block with access to expensive nonserializable state. The state object should contain a function release() for resource management purpose.
Text files can have corrupted data.
Text files can have corrupted data. If you use this function and a cascading trap you can filter out corrupted data from your pipe.
Write all the tuples to the given source and return this Pipe
This is an enrichment-pattern class for cascading.pipe.Pipe. The rule is to never use this class directly in input or return types, but only to add methods to Pipe.