N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.
Perform a skewed full outer join where some keys on the left hand may be hot, i.e.appear
more thanhotKeyThreshold
times. Frequency of a key is estimated with 1 - delta
probability, and the estimate is within eps * N
of the true frequency.
true frequency <= estimate <= true frequency + eps * N
, where N is the total size of
the left hand side stream so far.
key with hotKeyThreshold
values will be considered hot. Some runners
have inefficient GroupByKey
implementation for groups with more than
10K values. Thus it is recommended to set hotKeyThreshold
to below
10K, keep upper estimation error in mind.
left hand side key com.twitter.algebird.CMSMonoid
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val keyAggregator = CMS.aggregator[K](eps, delta, seed) val hotKeyCMS = self.keys.aggregate(keyAggregator) val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)
Read more about CMS: com.twitter.algebird.CMSMonoid.
Make sure to import com.twitter.algebird.CMSHasherImplicits
before using this join.
N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.
Perform a skewed full join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with 1 - delta
probability, and the
estimate is within eps * N
of the true frequency.
true frequency <= estimate <= true frequency + eps * N
, where N is the total size of
the left hand side stream so far.
key with hotKeyThreshold
values will be considered hot. Some runners
have inefficient GroupByKey
implementation for groups with more than
10K values. Thus it is recommended to set hotKeyThreshold
to below
10K, keep upper estimation error in mind. If you sample input via
sampleFraction
make sure to adjust hotKeyThreshold
accordingly.
One-sided error bound on the error of each point query, i.e. frequency estimate.
Must lie in (0, 1)
.
A seed to initialize the random number generator used to create the pairwise independent hash functions.
A bound on the probability that a query estimate does not lie within some small
interval (an interval that depends on eps
) around the truth. Must lie in
(0, 1)
.
left side sample fraction. Default is 1.0
- no sampling.
whether to use sampling with replacement, see SCollection.sample.
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val p = logs.skewedLeftJoin(logMetadata)
Read more about CMS: com.twitter.algebird.CMSMonoid.
Make sure to import com.twitter.algebird.CMSHasherImplicits
before using this join.
N to 1 skew-proof flavor of PairSCollectionFunctions.join.
N to 1 skew-proof flavor of PairSCollectionFunctions.join.
Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with 1 - delta
probability, and the
estimate is within eps * N
of the true frequency.
true frequency <= estimate <= true frequency + eps * N
, where N is the total size of
the left hand side stream so far.
key with hotKeyThreshold
values will be considered hot. Some runners
have inefficient GroupByKey
implementation for groups with more than
10K values. Thus it is recommended to set hotKeyThreshold
to below
10K, keep upper estimation error in mind.
left hand side key com.twitter.algebird.CMSMonoid
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val keyAggregator = CMS.aggregator[K](eps, delta, seed) val hotKeyCMS = self.keys.aggregate(keyAggregator) val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)
Read more about CMS: com.twitter.algebird.CMSMonoid.
Make sure to import com.twitter.algebird.CMSHasherImplicits
before using this join.
N to 1 skew-proof flavor of PairSCollectionFunctions.join.
N to 1 skew-proof flavor of PairSCollectionFunctions.join.
Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with 1 - delta
probability, and the
estimate is within eps * N
of the true frequency.
true frequency <= estimate <= true frequency + eps * N
, where N is the total size of
the left hand side stream so far.
key with hotKeyThreshold
values will be considered hot. Some runners
have inefficient GroupByKey
implementation for groups with more than
10K values. Thus it is recommended to set hotKeyThreshold
to below
10K, keep upper estimation error in mind. If you sample input via
sampleFraction
make sure to adjust hotKeyThreshold
accordingly.
One-sided error bound on the error of each point query, i.e. frequency estimate.
Must lie in (0, 1)
.
A seed to initialize the random number generator used to create the pairwise independent hash functions.
A bound on the probability that a query estimate does not lie within some small
interval (an interval that depends on eps
) around the truth. Must lie in
(0, 1)
.
left side sample fraction. Default is 1.0
- no sampling.
whether to use sampling with replacement, see SCollection.sample.
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val p = logs.skewedJoin(logMetadata)
Read more about CMS: com.twitter.algebird.CMSMonoid.
Make sure to import com.twitter.algebird.CMSHasherImplicits
before using this join.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with 1 - delta
probability, and the
estimate is within eps * N
of the true frequency.
true frequency <= estimate <= true frequency + eps * N
, where N is the total size of
the left hand side stream so far.
key with hotKeyThreshold
values will be considered hot. Some runners
have inefficient GroupByKey
implementation for groups with more than
10K values. Thus it is recommended to set hotKeyThreshold
to below
10K, keep upper estimation error in mind.
left hand side key com.twitter.algebird.CMSMonoid
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val keyAggregator = CMS.aggregator[K](eps, delta, seed) val hotKeyCMS = self.keys.aggregate(keyAggregator) val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)
Read more about CMS: com.twitter.algebird.CMSMonoid.
Make sure to import com.twitter.algebird.CMSHasherImplicits
before using this join.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with 1 - delta
probability, and the
estimate is within eps * N
of the true frequency.
true frequency <= estimate <= true frequency + eps * N
, where N is the total size of
the left hand side stream so far.
key with hotKeyThreshold
values will be considered hot. Some runners
have inefficient GroupByKey
implementation for groups with more than
10K values. Thus it is recommended to set hotKeyThreshold
to below
10K, keep upper estimation error in mind. If you sample input via
sampleFraction
make sure to adjust hotKeyThreshold
accordingly.
One-sided error bound on the error of each point query, i.e. frequency estimate.
Must lie in (0, 1)
.
A seed to initialize the random number generator used to create the pairwise independent hash functions.
A bound on the probability that a query estimate does not lie within some small
interval (an interval that depends on eps
) around the truth. Must lie in
(0, 1)
.
left side sample fraction. Default is 1.0
- no sampling.
whether to use sampling with replacement, see SCollection.sample.
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val p = logs.skewedLeftJoin(logMetadata)
Read more about CMS: com.twitter.algebird.CMSMonoid.
Make sure to import com.twitter.algebird.CMSHasherImplicits
before using this join.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with 1 - delta
probability, and the
estimate is within eps * N
of the true frequency.
true frequency <= estimate <= true frequency + eps * N
, where N is the total size of
the left hand side stream so far.
key with hotKeyThreshold
values will be considered hot. Some runners
have inefficient GroupByKey
implementation for groups with more than
10K values. Thus it is recommended to set hotKeyThreshold
to below
10K, keep upper estimation error in mind.
left hand side key com.twitter.algebird.CMSMonoid
(Since version 0.8.0) Use SCollection[(K, V)].skewedLeftOuterJoin(rhs, hotKeyThreshold, cms) instead.
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val keyAggregator = CMS.aggregator[K](eps, delta, seed) val hotKeyCMS = self.keys.aggregate(keyAggregator) val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)
Read more about CMS: com.twitter.algebird.CMSMonoid.
Make sure to import com.twitter.algebird.CMSHasherImplicits
before using this join.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with 1 - delta
probability, and the
estimate is within eps * N
of the true frequency.
true frequency <= estimate <= true frequency + eps * N
, where N is the total size of
the left hand side stream so far.
key with hotKeyThreshold
values will be considered hot. Some runners
have inefficient GroupByKey
implementation for groups with more than
10K values. Thus it is recommended to set hotKeyThreshold
to below
10K, keep upper estimation error in mind. If you sample input via
sampleFraction
make sure to adjust hotKeyThreshold
accordingly.
One-sided error bound on the error of each point query, i.e. frequency estimate.
Must lie in (0, 1)
.
A seed to initialize the random number generator used to create the pairwise independent hash functions.
A bound on the probability that a query estimate does not lie within some small
interval (an interval that depends on eps
) around the truth. Must lie in
(0, 1)
.
left side sample fraction. Default is 1.0
- no sampling.
whether to use sampling with replacement, see SCollection.sample.
(Since version 0.8.0) Use SCollection[(K, V)].skewedLeftOuterJoin(rhs) instead.
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val p = logs.skewedLeftJoin(logMetadata)
Read more about CMS: com.twitter.algebird.CMSMonoid.
Make sure to import com.twitter.algebird.CMSHasherImplicits
before using this join.
Extra functions available on SCollections of (key, value) pairs for skwed joins through an implicit conversion.