Class DefaultClusteredGroupPartitioner

    • Constructor Detail

      • DefaultClusteredGroupPartitioner

        public DefaultClusteredGroupPartitioner​(RowsAndColumns rac)
    • Method Detail

      • computeBoundaries

        public int[] computeBoundaries​(List<String> columns)
        Description copied from interface: ClusteredGroupPartitioner
        Computes and returns a list of contiguous boundaries for independent groups. All rows in a specific grouping should have the same values for the identified columns. Additionally, as this is assuming it is dealing with clustered data, there should only be a single entry in the return value for a given set of values of the columns.

        Note that implementations are not expected to do any validation that the data is pre-clustered. There is no expectation that an implementation will identify that the same cluster existed non-contiguously. It is up to the caller to ensure that data is clustered correctly before invoking this method.

        Specified by:
        computeBoundaries in interface ClusteredGroupPartitioner
        Parameters:
        columns - the columns to partition on
        Returns:
        an int[] representing the start (inclusive) and stop (exclusive) offsets of boundaries. Boundaries are contiguous, so the stop of the previous boundary is the start of the subsequent one.
      • partitionOnBoundaries

        public ArrayList<RowsAndColumns> partitionOnBoundaries​(List<String> partitionColumns)
        Description copied from interface: ClusteredGroupPartitioner
        Semantically equivalent to computeBoundaries, but returns a list of RowsAndColumns objects instead of just boundary positions. This is useful as it allows the concrete implementation to return RowsAndColumns objects that are aware of the internal representation of the data and thus can provide optimized implementations of other semantic interfaces as the "child" RowsAndColumns are used
        Specified by:
        partitionOnBoundaries in interface ClusteredGroupPartitioner
        Parameters:
        partitionColumns - the columns to partition on
        Returns:
        a list of RowsAndColumns representing the data grouped by the partition columns.