Interface BucketNumberedShardSpec<T extends BuildingShardSpec>
-
- All Superinterfaces:
ShardSpec
- All Known Implementing Classes:
DimensionRangeBucketShardSpec
,HashBucketShardSpec
,SingleDimensionRangeBucketShardSpec
public interface BucketNumberedShardSpec<T extends BuildingShardSpec> extends ShardSpec
This is one of the special shardSpecs which are temporarily used during batch ingestion. In Druid, there is a concept of core partition set which is a set of segments atomically becoming queryable together in Brokers. The core partition set is represented as a range of partitionIds, i.e., [0,ShardSpec.getNumCorePartitions()
). When you run a batch ingestion task with a non-linear partitioning scheme, the task populates all possible buckets upfront at the beginning (seeCachingLocalSegmentAllocator
) and uses them to partition input rows. However, some of the buckets can be empty even after the task consumes all inputs if the data is highly skewed. Since Druid doesn't create empty segments, the partitionId should be dynamically allocated when a bucket is actually in use, so that we can always create the packed core partition set without missing partitionIds. This BucketNumberedShardSpec is used for such use case. The task with a non-linear partitioning scheme uses it to postpone the partitionId allocation until all empty buckets are identified. SeeParallelIndexSupervisorTask.groupGenericPartitionLocationsPerPartition
andCachingLocalSegmentAllocator
for parallel and sequential ingestion, respectively. Note thatSegmentId
requires the partitionId. Since the segmentId is used everwhere during ingestion, this class should implementgetPartitionNum()
which returns the bucketId. This should be fine because the segmentId is only used to identify each segment until pushing them to deep storage. The bucketId should be enough to uniquely identify each segment. However, when pushing segments to deep storage, the partitionId is used to create the path to store the segment on deep storage (DataSegmentPusher.getDefaultStorageDir(org.apache.druid.timeline.DataSegment, boolean)
which should be correct. As a result, this shardSpec should not be used in pushing segments. This class should be Jackson-serializable as the subtasks can send it to the parallel task in parallel ingestion. This interface doesn't really have to extendShardSpec
. The only reason is the ShardSpec is used in many places such asDataSegment
, and we have to modify those places to allow other types than ShardSpec which seems pretty invasive. Maybe we could clean up this mess someday in the future.- See Also:
BuildingShardSpec
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.druid.timeline.partition.ShardSpec
ShardSpec.Type
-
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description T
convert(int partitionId)
default <O> PartitionChunk<O>
createChunk(O obj)
int
getBucketId()
default List<String>
getDomainDimensions()
Get dimensions who have possible range for the rows this shard contains.default int
getNumCorePartitions()
default int
getPartitionNum()
Returns the partition ID of this segment.default boolean
possibleInDomain(Map<String,com.google.common.collect.RangeSet<String>> domain)
if given domain ranges are not possible in this shard, return false; otherwise return true;-
Methods inherited from interface org.apache.druid.timeline.partition.ShardSpec
getAtomicUpdateGroupSize, getEndRootPartitionId, getLookup, getMinorVersion, getStartRootPartitionId, getType, sharePartitionSpace
-
-
-
-
Method Detail
-
getBucketId
int getBucketId()
-
convert
T convert(int partitionId)
-
createChunk
default <O> PartitionChunk<O> createChunk(O obj)
- Specified by:
createChunk
in interfaceShardSpec
-
getPartitionNum
default int getPartitionNum()
Description copied from interface:ShardSpec
Returns the partition ID of this segment.- Specified by:
getPartitionNum
in interfaceShardSpec
-
getNumCorePartitions
default int getNumCorePartitions()
- Specified by:
getNumCorePartitions
in interfaceShardSpec
-
getDomainDimensions
default List<String> getDomainDimensions()
Description copied from interface:ShardSpec
Get dimensions who have possible range for the rows this shard contains.- Specified by:
getDomainDimensions
in interfaceShardSpec
- Returns:
- list of dimensions who has its possible range. Dimensions with unknown possible range are not listed
-
possibleInDomain
default boolean possibleInDomain(Map<String,com.google.common.collect.RangeSet<String>> domain)
Description copied from interface:ShardSpec
if given domain ranges are not possible in this shard, return false; otherwise return true;- Specified by:
possibleInDomain
in interfaceShardSpec
- Returns:
- possibility of in domain
-
-