Type T is the type of the input field (input to map, T => X) Type X is the intermediate type, which your reduce function operates on (reduce is (X,X) => X) Type U is the final result type, (final map is: X => U)
Type T is the type of the input field (input to map, T => X) Type X is the intermediate type, which your reduce function operates on (reduce is (X,X) => X) Type U is the final result type, (final map is: X => U)
The previous output goes into the reduce function on the left, like foldLeft, so if your operation is faster for the accumulator to be on one side, be aware.
Assumed to be a commutative operation. If you don't want that, use .forceToReducers
Pretty much a synonym for mapReduceMap with the methods collected into a trait.
Approximate number of unique values
We use about m = (104/errPercent)^2 bytes of memory per key
Uses .toString.getBytes
to serialize the data so you MUST
ensure that .toString is an equivalance on your counted fields
(i.e. x.toString == y.toString
if and only if x == y
)
Approximate number of unique values
We use about m = (104/errPercent)^2 bytes of memory per key
Uses .toString.getBytes
to serialize the data so you MUST
ensure that .toString is an equivalance on your counted fields
(i.e. x.toString == y.toString
if and only if x == y
)
For each key:
10% error ~ 256 bytes 5% error ~ 1kB 2% error ~ 4kB 1% error ~ 16kB 0.5% error ~ 64kB 0.25% error ~ 256kB
uses a more stable online algorithm which should be suitable for large numbers of records
uses a more stable online algorithm which should be suitable for large numbers of records
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
This is count with a predicate: only counts the tuples for which
fn(tuple)
is true
First do "times" on each pair, then "plus" them all together.
First do "times" on each pair, then "plus" them all together.
groupBy('x) { _.dot('y,'z, 'ydotz) }
Return the first, useful probably only for sorted case.
Collect all the values into a List[T] and then operate on that list.
Collect all the values into a List[T] and then operate on that list. This fundamentally uses as much memory as it takes to store the list. This gives you the list in the reverse order it was encounted (it is built as a stack for efficiency reasons). If you care about order, call .reverse in your fn
STRONGLY PREFER TO AVOID THIS. Try reduce or plus and an O(1) memory algorithm.
these will only be called if a tuple is not passed, meaning just one column
Similar to the scala.collection.Iterable.mkString takes the source and destination fieldname, which should be a single field.
Similar to the scala.collection.Iterable.mkString takes the source and destination fieldname, which should be a single field. The result will be start, each item.toString separated by sep, followed by end for convenience there several common variants below
Opposite of RichPipe.unpivot.
Opposite of RichPipe.unpivot. See SQL/Excel for more on this function converts a row-wise representation into a column-wise one.
pivot(('feature, 'value) -> ('clicks, 'impressions, 'requests))
it will find the feature named "clicks", and put the value in the column with the field named clicks.
Absent fields result in null unless a default value is provided. Unnamed output fields are ignored.
Duplicated fields will result in an error.
if you want more precision, first do a
map('value -> value) { x : AnyRef => Option(x) }
and you will have non-nulls for all present values, and Nones for values that were present but previously null. All nulls in the final output will be those truly missing. Similarly, if you want to check if there are any items present that shouldn't be:
map('feature -> 'feature) { fname : String => if (!goodFeatures(fname)) { throw new Exception("ohnoes") } else fname }
Apply an associative/commutative operation on the left field.
Apply an associative/commutative operation on the left field.
reduce(('mass,'allids)->('totalMass, 'idset)) { (left:(Double,Set[Long]),right:(Double,Set[Long])) => (left._1 + right._1, left._2 ++ right._2) }
Equivalent to a mapReduceMap with trivial (identity) map functions.
Assumed to be a commutative operation. If you don't want that, use .forceToReducers
The previous output goes into the reduce function on the left, like foldLeft, so if your operation is faster for the accumulator to be on one side, be aware.
How many values are there for this key
Compute the count, ave and standard deviation in one pass example: g.sizeAveStdev('x -> ('cntx, 'avex, 'stdevx))
Equivalent to sorting by a comparison function then take-ing k items.
Equivalent to sorting by a comparison function then take-ing k items. This is MUCH more efficient than doing a total sort followed by a take, since these bounded sorts are done on the mapper, so only a sort of size k is needed.
sortWithTake( ('clicks, 'tweet) -> 'topClicks, 5) { fn : (t0 :(Long,Long), t1:(Long,Long) => t0._1 < t1._1 }
topClicks will be a List[(Long,Long)]
Reverse of above when the implicit ordering makes sense.
Same as above but useful when the implicit ordering makes sense.
The same as sum(fs -> fs)
Assumed to be a commutative operation.
The same as sum(fs -> fs)
Assumed to be a commutative operation. If you don't want that, use .forceToReducers
Use Semigroup.plus
to compute a sum.
Use Semigroup.plus
to compute a sum. Not called sum to avoid conflicting with standard sum
Your Semigroup[T]
should be associated and commutative, else this doesn't make sense
Assumed to be a commutative operation. If you don't want that, use .forceToReducers
The same as times(fs -> fs)
Returns the product of all the items in this grouping
Convert a subset of fields into a list of Tuples.
Convert a subset of fields into a list of Tuples. Need to provide the types of the tuple fields.
Implements reductions on top of a simple abstraction for the Fields-API This is for associative and commutive operations (particularly Monoids and Semigroups play a big role here)
We use the f-bounded polymorphism trick to return the type called Self in each operation.