An implementation of map-side combining which is appropriate for associative and commutative functions
If a cacheSize is given, it is used, else we query
the config for cascading.aggregateby.threshold (standard cascading param for an equivalent case)
else we use a default value of 100,000
This keeps a cache of keys up to the cache-size, summing values as keys collide
On eviction, or completion of this Operation, the key-value pairs are put into outputCollector.
This NEVER spills to disk and generally never be a performance penalty. If you have
poor locality in the keys, you just don't get any benefit but little added cost.
Note this means that you may still have repeated keys in the output even on a single mapper
since the key space may be so large that you can't fit all of them in the cache at the same
time.
You can use this with the Fields-API by doing:
val msr = new MapsideReduce(Semigroup.from(fn), 'key, 'value, None)
// MUST map onto the same key,value space (may be multiple fields)val mapSideReduced = pipe.eachTo(('key, 'value) -> ('key, 'value)) { _ => msr }
That said, this is equivalent to AggregateBy, and the only value is that it is much simpler than AggregateBy.
AggregateBy assumes several parallel reductions are happening, and thus has many loops, and array lookups
to deal with that. Since this does many fewer allocations, and has a smaller code-path it may be faster for
the typed-API.
An implementation of map-side combining which is appropriate for associative and commutative functions If a cacheSize is given, it is used, else we query the config for cascading.aggregateby.threshold (standard cascading param for an equivalent case) else we use a default value of 100,000
This keeps a cache of keys up to the cache-size, summing values as keys collide On eviction, or completion of this Operation, the key-value pairs are put into outputCollector.
This NEVER spills to disk and generally never be a performance penalty. If you have poor locality in the keys, you just don't get any benefit but little added cost.
Note this means that you may still have repeated keys in the output even on a single mapper since the key space may be so large that you can't fit all of them in the cache at the same time.
You can use this with the Fields-API by doing:
That said, this is equivalent to AggregateBy, and the only value is that it is much simpler than AggregateBy. AggregateBy assumes several parallel reductions are happening, and thus has many loops, and array lookups to deal with that. Since this does many fewer allocations, and has a smaller code-path it may be faster for the typed-API.