Interface BucketNumberedShardSpec<T extends BuildingShardSpec>

  • All Superinterfaces:
    ShardSpec
    All Known Implementing Classes:
    DimensionRangeBucketShardSpec, HashBucketShardSpec, SingleDimensionRangeBucketShardSpec

    public interface BucketNumberedShardSpec<T extends BuildingShardSpec>
    extends ShardSpec
    This is one of the special shardSpecs which are temporarily used during batch ingestion. In Druid, there is a concept of core partition set which is a set of segments atomically becoming queryable together in Brokers. The core partition set is represented as a range of partitionIds, i.e., [0, ShardSpec.getNumCorePartitions()). When you run a batch ingestion task with a non-linear partitioning scheme, the task populates all possible buckets upfront at the beginning (see CachingLocalSegmentAllocator) and uses them to partition input rows. However, some of the buckets can be empty even after the task consumes all inputs if the data is highly skewed. Since Druid doesn't create empty segments, the partitionId should be dynamically allocated when a bucket is actually in use, so that we can always create the packed core partition set without missing partitionIds. This BucketNumberedShardSpec is used for such use case. The task with a non-linear partitioning scheme uses it to postpone the partitionId allocation until all empty buckets are identified. See ParallelIndexSupervisorTask.groupGenericPartitionLocationsPerPartition and CachingLocalSegmentAllocator for parallel and sequential ingestion, respectively. Note that SegmentId requires the partitionId. Since the segmentId is used everwhere during ingestion, this class should implement getPartitionNum() which returns the bucketId. This should be fine because the segmentId is only used to identify each segment until pushing them to deep storage. The bucketId should be enough to uniquely identify each segment. However, when pushing segments to deep storage, the partitionId is used to create the path to store the segment on deep storage (DataSegmentPusher.getDefaultStorageDir(org.apache.druid.timeline.DataSegment, boolean) which should be correct. As a result, this shardSpec should not be used in pushing segments. This class should be Jackson-serializable as the subtasks can send it to the parallel task in parallel ingestion. This interface doesn't really have to extend ShardSpec. The only reason is the ShardSpec is used in many places such as DataSegment, and we have to modify those places to allow other types than ShardSpec which seems pretty invasive. Maybe we could clean up this mess someday in the future.
    See Also:
    BuildingShardSpec
    • Method Detail

      • getBucketId

        int getBucketId()
      • convert

        T convert​(int partitionId)
      • getPartitionNum

        default int getPartitionNum()
        Description copied from interface: ShardSpec
        Returns the partition ID of this segment.
        Specified by:
        getPartitionNum in interface ShardSpec
      • getDomainDimensions

        default List<String> getDomainDimensions()
        Description copied from interface: ShardSpec
        Get dimensions who have possible range for the rows this shard contains.
        Specified by:
        getDomainDimensions in interface ShardSpec
        Returns:
        list of dimensions who has its possible range. Dimensions with unknown possible range are not listed
      • possibleInDomain

        default boolean possibleInDomain​(Map<String,​com.google.common.collect.RangeSet<String>> domain)
        Description copied from interface: ShardSpec
        if given domain ranges are not possible in this shard, return false; otherwise return true;
        Specified by:
        possibleInDomain in interface ShardSpec
        Returns:
        possibility of in domain