Class ApproximateUnique
- java.lang.Object
-
- org.apache.beam.sdk.transforms.ApproximateUnique
-
@Deprecated public class ApproximateUnique extends java.lang.Object
Deprecated.Consider using
ApproximateCountDistinct
in thezetasketch
extension module, which makes use of theHllCount
implementation.If
ApproximateCountDistinct
does not meet your needs then you can directly useHllCount
. Direct usage will also give you access to save intermediate aggregation result into a sketch for later processing.For example, to estimate the number of distinct elements in a
PCollection<String>
:
For more details about usingPCollection<String> input = ...; PCollection<Long> countDistinct = input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());
HllCount
and thezetasketch
extension module, see https://s.apache.org/hll-in-beam#bookmark=id.v6chsij1ixo7.PTransform
s for estimating the number of distinct elements in aPCollection
, or the number of distinct values associated with each key in aPCollection
ofKV
s.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
ApproximateUnique.ApproximateUniqueCombineFn<T>
Deprecated.CombineFn
that computes an estimate of the number of distinct values that were combined.static class
ApproximateUnique.Globally<T>
Deprecated.PTransform
for estimating the number of distinct elements in aPCollection
.static class
ApproximateUnique.PerKey<K,V>
Deprecated.PTransform
for estimating the number of distinct values associated with each key in aPCollection
ofKV
s.
-
Constructor Summary
Constructors Constructor Description ApproximateUnique()
Deprecated.
-
Method Summary
All Methods Static Methods Concrete Methods Deprecated Methods Modifier and Type Method Description static <T> ApproximateUnique.Globally<T>
globally(double maximumEstimationError)
Deprecated.Likeglobally(int)
, but specifies the desired maximum estimation error instead of the sample size.static <T> ApproximateUnique.Globally<T>
globally(int sampleSize)
Deprecated.Returns aPTransform
that takes aPCollection<T>
and returns aPCollection<Long>
containing a single value that is an estimate of the number of distinct elements in the inputPCollection
.static <K,V>
ApproximateUnique.PerKey<K,V>perKey(double maximumEstimationError)
Deprecated.LikeperKey(int)
, but specifies the desired maximum estimation error instead of the sample size.static <K,V>
ApproximateUnique.PerKey<K,V>perKey(int sampleSize)
Deprecated.Returns aPTransform
that takes aPCollection<KV<K, V>>
and returns aPCollection<KV<K, Long>>
that contains an output element mapping each distinct key in the inputPCollection
to an estimate of the number of distinct values associated with that key in the inputPCollection
.
-
-
-
Method Detail
-
globally
public static <T> ApproximateUnique.Globally<T> globally(int sampleSize)
Deprecated.Returns aPTransform
that takes aPCollection<T>
and returns aPCollection<Long>
containing a single value that is an estimate of the number of distinct elements in the inputPCollection
.The
sampleSize
parameter controls the estimation error. The error is about2 / sqrt(sampleSize)
, so forApproximateUnique.globally(10000)
the estimation error is about 2%. Similarly, forApproximateUnique.of(16)
the estimation error is about 50%. If there are fewer thansampleSize
distinct elements then the returned result will be exact with extremely high probability (the chance of a hash collision is aboutsampleSize^2 / 2^65
).This transform approximates the number of elements in a set by computing the top
sampleSize
hash values, and using that to extrapolate the size of the entire set of hash values by assuming the rest of the hash values are as densely distributed as the topsampleSize
.See also
globally(double)
.Example of use:
PCollection<String> pc = ...; PCollection<Long> approxNumDistinct = pc.apply(ApproximateUnique.<String>globally(1000));
- Type Parameters:
T
- the type of the elements in the inputPCollection
- Parameters:
sampleSize
- the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be>= 16
- Throws:
java.lang.IllegalArgumentException
- if thesampleSize
argument is too small
-
globally
public static <T> ApproximateUnique.Globally<T> globally(double maximumEstimationError)
Deprecated.Likeglobally(int)
, but specifies the desired maximum estimation error instead of the sample size.- Type Parameters:
T
- the type of the elements in the inputPCollection
- Parameters:
maximumEstimationError
- the maximum estimation error, which should be in the range[0.01, 0.5]
- Throws:
java.lang.IllegalArgumentException
- if themaximumEstimationError
argument is out of range
-
perKey
public static <K,V> ApproximateUnique.PerKey<K,V> perKey(int sampleSize)
Deprecated.Returns aPTransform
that takes aPCollection<KV<K, V>>
and returns aPCollection<KV<K, Long>>
that contains an output element mapping each distinct key in the inputPCollection
to an estimate of the number of distinct values associated with that key in the inputPCollection
.See
globally(int)
for an explanation of thesampleSize
parameter. A separate sampling is computed for each distinct key of the input.See also
perKey(double)
.Example of use:
PCollection<KV<Integer, String>> pc = ...; PCollection<KV<Integer, Long>> approxNumDistinctPerKey = pc.apply(ApproximateUnique.<Integer, String>perKey(1000));
- Type Parameters:
K
- the type of the keys in the input and outputPCollection
sV
- the type of the values in the inputPCollection
- Parameters:
sampleSize
- the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be>= 16
- Throws:
java.lang.IllegalArgumentException
- if thesampleSize
argument is too small
-
perKey
public static <K,V> ApproximateUnique.PerKey<K,V> perKey(double maximumEstimationError)
Deprecated.LikeperKey(int)
, but specifies the desired maximum estimation error instead of the sample size.- Type Parameters:
K
- the type of the keys in the input and outputPCollection
sV
- the type of the values in the inputPCollection
- Parameters:
maximumEstimationError
- the maximum estimation error, which should be in the range[0.01, 0.5]
- Throws:
java.lang.IllegalArgumentException
- if themaximumEstimationError
argument is out of range
-
-