Class Histogram


  • public class Histogram
    extends Object
    This class is used to encapsulate a Histogram to provide Histogram data. If the data fits in the cardinality set then it simply uses a map to generate the histogram values. Once the cardinality exceeds maxCardinality then the data is tracked using an algorithm based on Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872 All data is stored in the Cardinality Map until this is exhausted at this point we start to populate (via accept) the underlying Histogram Sketch with all values not captured in the Cardinality Map. Once we need to generate a Histogram we either just generate it from the Cardinality Map or if the MaxCardinality has been exceeded we add all the entries captured in the Cardinality Map to the Sketch.
    • Method Detail

      • setCardinality

        public void setCardinality​(Map<String,​Long> map)
      • setCardinalityOverflow

        public void setCardinalityOverflow​(HistogramSPDT histogramSPDT)
      • getHistogram

        public Histogram.Entry[] getHistogram​(int buckets)
        Get the histogram with the supplied number of buckets
        Parameters:
        buckets - the number of buckets in the Histogram
        Returns:
        An array of length 'buckets' that constitutes the Histogram (or null if cardinality is zero).
      • tagClusters

        public static void tagClusters​(Histogram.Entry[] buckets)
        Given a Histogram analysis mark each bucket as part of a cluster and then attach the count and percent for the cluster to all buckets in the cluster. For example, with the following distribution: 1, 1, 0, 0, 0, 0, 10, 20, 30, 30, 8 We would declare two clusters - the first one having 2% and the second having 98%, so the percentages would look as follows: 2, 2, 0, 0, 0, 0, 98, 98, 98, 98, 98
        Parameters:
        buckets - The set of Histogram buckets for this analysis
      • getBucket

        public static Histogram.Entry getBucket​(Histogram.Entry[] buckets,
                                                double value)
        Given a value and set of buckets - locate the bucket holding this value.
        Parameters:
        buckets - The set of Histogram buckets for this analysis
        value - The value we are searching for
        Returns:
        The bucket containing the supplied value.