Interface ClusteredGroupPartitioner
-
- All Known Implementing Classes:
DefaultClusteredGroupPartitioner
public interface ClusteredGroupPartitioner
A semantic interface used to partition a data set based on a given set of columns.This specifically assumes that it is working with pre-clustered data and, as such, the groups returned should be contiguous and unique (that is, all rows for a given combination of values exist in only one grouping)
-
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Modifier and Type Method Description int[]
computeBoundaries(List<String> columns)
Computes and returns a list of contiguous boundaries for independent groups.static ClusteredGroupPartitioner
fromRAC(RowsAndColumns rac)
ArrayList<RowsAndColumns>
partitionOnBoundaries(List<String> partitionColumns)
Semantically equivalent to computeBoundaries, but returns a list of RowsAndColumns objects instead of just boundary positions.
-
-
-
Method Detail
-
fromRAC
static ClusteredGroupPartitioner fromRAC(RowsAndColumns rac)
-
computeBoundaries
int[] computeBoundaries(List<String> columns)
Computes and returns a list of contiguous boundaries for independent groups. All rows in a specific grouping should have the same values for the identified columns. Additionally, as this is assuming it is dealing with clustered data, there should only be a single entry in the return value for a given set of values of the columns.Note that implementations are not expected to do any validation that the data is pre-clustered. There is no expectation that an implementation will identify that the same cluster existed non-contiguously. It is up to the caller to ensure that data is clustered correctly before invoking this method.
- Parameters:
columns
- the columns to partition on- Returns:
- an int[] representing the start (inclusive) and stop (exclusive) offsets of boundaries. Boundaries are contiguous, so the stop of the previous boundary is the start of the subsequent one.
-
partitionOnBoundaries
ArrayList<RowsAndColumns> partitionOnBoundaries(List<String> partitionColumns)
Semantically equivalent to computeBoundaries, but returns a list of RowsAndColumns objects instead of just boundary positions. This is useful as it allows the concrete implementation to return RowsAndColumns objects that are aware of the internal representation of the data and thus can provide optimized implementations of other semantic interfaces as the "child" RowsAndColumns are used- Parameters:
partitionColumns
- the columns to partition on- Returns:
- a list of RowsAndColumns representing the data grouped by the partition columns.
-
-