Interface ClusteredGroupPartitioner

  • All Known Implementing Classes:
    DefaultClusteredGroupPartitioner

    public interface ClusteredGroupPartitioner
    A semantic interface used to partition a data set based on a given set of columns.

    This specifically assumes that it is working with pre-clustered data and, as such, the groups returned should be contiguous and unique (that is, all rows for a given combination of values exist in only one grouping)

    • Method Detail

      • computeBoundaries

        int[] computeBoundaries​(List<String> columns)
        Computes and returns a list of contiguous boundaries for independent groups. All rows in a specific grouping should have the same values for the identified columns. Additionally, as this is assuming it is dealing with clustered data, there should only be a single entry in the return value for a given set of values of the columns.

        Note that implementations are not expected to do any validation that the data is pre-clustered. There is no expectation that an implementation will identify that the same cluster existed non-contiguously. It is up to the caller to ensure that data is clustered correctly before invoking this method.

        Parameters:
        columns - the columns to partition on
        Returns:
        an int[] representing the start (inclusive) and stop (exclusive) offsets of boundaries. Boundaries are contiguous, so the stop of the previous boundary is the start of the subsequent one.
      • partitionOnBoundaries

        ArrayList<RowsAndColumns> partitionOnBoundaries​(List<String> partitionColumns)
        Semantically equivalent to computeBoundaries, but returns a list of RowsAndColumns objects instead of just boundary positions. This is useful as it allows the concrete implementation to return RowsAndColumns objects that are aware of the internal representation of the data and thus can provide optimized implementations of other semantic interfaces as the "child" RowsAndColumns are used
        Parameters:
        partitionColumns - the columns to partition on
        Returns:
        a list of RowsAndColumns representing the data grouped by the partition columns.