Class SuperSorter


  • public class SuperSorter
    extends Object
    Sorts and partitions a dataset using parallel external merge sort. Input is provided as a set of ReadableFrameChannel and output is provided as OutputChannels. Work is performed on a provided FrameProcessorExecutor. The most central point for SuperSorter logic is the runWorkersIfPossible() method, which determines what needs to be done next based on the current state of the SuperSorter. The logic is: 1) Read input channels into inputBuffer using FrameChannelBatcher, launched via runNextBatcher(), up to a limit of maxChannelsPerProcessor per batcher. 2) Merge and write frames from inputBuffer into FrameFile scratch files using FrameChannelMerger launched via runNextLevelZeroMerger(). 3a) Merge level 0 scratch files into level 1 scratch files using FrameChannelMerger launched from runNextMiddleMerger(), processing up to maxChannelsPerProcessor files per merger. Continue this process through increasing level numbers, with the size of scratch files increasing by a factor of maxChannelsPerProcessor each level. 3b) For the penultimate level, the FrameChannelMerger launched by runNextMiddleMerger() writes partitioned FrameFile scratch files. The penultimate level cannot be written until outputPartitionsFuture resolves, so if it has not resolved yet by this point, the SuperSorter pauses. The SuperSorter resumes and writes the penultimate level's files when the future resolves. 4) Write the final level using FrameChannelMerger launched from runNextUltimateMerger(). Outputs for this level are written to channels provided by outputChannelFactory, rather than scratch files. At all points, higher level processing is preferred over lower-level processing. Writing to final output files is preferred over intermediate, and writing to intermediate files is preferred over reading inputs. These preferences ensure that the amount of data buffered up in memory does not grow too large. Potential future work (things we could optimize if necessary): - Collapse merging to a single level if level zero has one merger, and we want to write one output partition. - Skip batching, and inject directly into level 0, if input channels are already individually fully-sorted. - Combine (for example: aggregate) while merging.
    • Constructor Detail

      • SuperSorter

        public SuperSorter​(List<ReadableFrameChannel> inputChannels,
                           FrameReader frameReader,
                           List<KeyColumn> sortKey,
                           com.google.common.util.concurrent.ListenableFuture<ClusterByPartitions> outputPartitionsFuture,
                           FrameProcessorExecutor exec,
                           OutputChannelFactory outputChannelFactory,
                           OutputChannelFactory intermediateOutputChannelFactory,
                           int maxActiveProcessors,
                           int maxChannelsPerProcessor,
                           long rowLimit,
                           @Nullable
                           String cancellationId,
                           SuperSorterProgressTracker superSorterProgressTracker)
        Initializes a SuperSorter.
        Parameters:
        inputChannels - input channels. All frames in these channels must be sorted according to the ClusterBy.getColumns(), or else sorting will not produce correct output.
        frameReader - frame reader for the input channels
        sortKey - desired sorting order
        outputPartitionsFuture - a future that resolves to the desired output partitions. Sorting will block prior to writing out final outputs until this future resolves. However, the sorter will be able to read all inputs even if this future is unresolved. If output need not be partitioned, use ClusterByPartitions.oneUniversalPartition(). In this case a single sorted channel is generated.
        exec - executor to perform work in
        outputChannelFactory - factory for partitioned, sorted output channels
        intermediateOutputChannelFactory - factory for intermediate data produced by sorting levels
        maxActiveProcessors - maximum number of merging processors to execute at once in the provided FrameProcessorExecutor
        maxChannelsPerProcessor - maximum number of channels to merge at once per merging processor
        rowLimit - limit to apply during sorting. The limit is merely advisory: the actual number of rows returned may be larger than the limit. The limit is applied across all partitions, not to each partition individually.
        cancellationId - cancellation id to use when running processors in the provided FrameProcessorExecutor.
        superSorterProgressTracker - progress tracker
    • Method Detail

      • run

        public com.google.common.util.concurrent.ListenableFuture<OutputChannels> run()
        Starts sorting. Can only be called once. Work is performed in the FrameProcessorExecutor that was passed to the constructor. Returns a future containing partitioned sorted output channels.
      • stateString

        public String stateString()
        Returns a string encapsulating the current state of this object.