Class Sample


  • public class Sample
    extends java.lang.Object
    PTransforms for taking samples of the elements in a PCollection, or samples of the values associated with each key in a PCollection of KVs.

    fixedSizeGlobally(int) and fixedSizePerKey(int) compute uniformly random samples. any(long) is faster, but provides no uniformity guarantees.

    combineFn(int) can also be used manually, in combination with state and with the Combine transform.

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  Sample.FixedSizedSampleFn<T>
      CombineFn that computes a fixed-size sample of a collection of values.
    • Constructor Summary

      Constructors 
      Constructor Description
      Sample()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static <T> PTransform<PCollection<T>,​PCollection<T>> any​(long limit)
      Sample#any(long) takes a PCollection<T> and a limit, and produces a new PCollection<T> containing up to limit elements of the input PCollection.
      static <T> Combine.CombineFn<T,​?,​java.lang.Iterable<T>> anyCombineFn​(int sampleSize)
      Returns a Combine.CombineFn that computes a fixed-sized potentially non-uniform sample of its inputs.
      static <T> Combine.CombineFn<T,​?,​T> anyValueCombineFn()
      Returns a Combine.CombineFn that computes a single and potentially non-uniform sample value of its inputs.
      static <T> Combine.CombineFn<T,​?,​java.lang.Iterable<T>> combineFn​(int sampleSize)
      Returns a Combine.CombineFn that computes a fixed-sized uniform sample of its inputs.
      static <T> PTransform<PCollection<T>,​PCollection<java.lang.Iterable<T>>> fixedSizeGlobally​(int sampleSize)
      Returns a PTransform that takes a PCollection<T>, selects sampleSize elements, uniformly at random, and returns a PCollection<Iterable<T>> containing the selected elements.
      static <K,​V>
      PTransform<PCollection<KV<K,​V>>,​PCollection<KV<K,​java.lang.Iterable<V>>>>
      fixedSizePerKey​(int sampleSize)
      Returns a PTransform that takes an input PCollection<KV<K, V>> and returns a PCollection<KV<K, Iterable<V>>> that contains an output element mapping each distinct key in the input PCollection to a sample of sampleSize values associated with that key in the input PCollection, taken uniformly at random.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • Sample

        public Sample()
    • Method Detail

      • combineFn

        public static <T> Combine.CombineFn<T,​?,​java.lang.Iterable<T>> combineFn​(int sampleSize)
        Returns a Combine.CombineFn that computes a fixed-sized uniform sample of its inputs.
      • anyCombineFn

        public static <T> Combine.CombineFn<T,​?,​java.lang.Iterable<T>> anyCombineFn​(int sampleSize)
        Returns a Combine.CombineFn that computes a fixed-sized potentially non-uniform sample of its inputs.
      • anyValueCombineFn

        public static <T> Combine.CombineFn<T,​?,​T> anyValueCombineFn()
        Returns a Combine.CombineFn that computes a single and potentially non-uniform sample value of its inputs.
      • any

        public static <T> PTransform<PCollection<T>,​PCollection<T>> any​(long limit)
        Sample#any(long) takes a PCollection<T> and a limit, and produces a new PCollection<T> containing up to limit elements of the input PCollection.

        If limit is greater than or equal to the size of the input PCollection, then all the input's elements will be selected.

        Example of use:

        
         PCollection<String> input = ...;
         PCollection<String> output = input.apply(Sample.<String>any(100));
         
        Type Parameters:
        T - the type of the elements of the input and output PCollections
        Parameters:
        limit - the number of elements to take from the input
      • fixedSizeGlobally

        public static <T> PTransform<PCollection<T>,​PCollection<java.lang.Iterable<T>>> fixedSizeGlobally​(int sampleSize)
        Returns a PTransform that takes a PCollection<T>, selects sampleSize elements, uniformly at random, and returns a PCollection<Iterable<T>> containing the selected elements. If the input PCollection has fewer than sampleSize elements, then the output Iterable<T> will be all the input's elements.

        All of the elements of the output PCollection should fit into main memory of a single worker machine. This operation does not run in parallel.

        Example of use:

        
         PCollection<String> pc = ...;
         PCollection<Iterable<String>> sampleOfSize10 =
             pc.apply(Sample.fixedSizeGlobally(10));
         
        Type Parameters:
        T - the type of the elements
        Parameters:
        sampleSize - the number of elements to select; must be >= 0
      • fixedSizePerKey

        public static <K,​V> PTransform<PCollection<KV<K,​V>>,​PCollection<KV<K,​java.lang.Iterable<V>>>> fixedSizePerKey​(int sampleSize)
        Returns a PTransform that takes an input PCollection<KV<K, V>> and returns a PCollection<KV<K, Iterable<V>>> that contains an output element mapping each distinct key in the input PCollection to a sample of sampleSize values associated with that key in the input PCollection, taken uniformly at random. If a key in the input PCollection has fewer than sampleSize values associated with it, then the output Iterable<V> associated with that key will be all the values associated with that key in the input PCollection.

        Example of use:

        
         PCollection<KV<String, Integer>> pc = ...;
         PCollection<KV<String, Iterable<Integer>>> sampleOfSize10PerKey =
             pc.apply(Sample.<String, Integer>fixedSizePerKey());
         
        Type Parameters:
        K - the type of the keys
        V - the type of the values
        Parameters:
        sampleSize - the number of values to select for each distinct key; must be >= 0