Metadata describing Cassandra table partition processed by a single Spark task.
Metadata describing Cassandra table partition processed by a single Spark task.
Beware the term "partition" is overloaded. Here, in the context of Spark,
it means an arbitrary collection of rows that can be processed locally on a single Cassandra cluster node.
A CassandraPartition
typically contains multiple CQL partitions, i.e. rows identified by different values of
the CQL partitioning key.
identifier of the partition, used internally by Spark
which nodes the data partition is located on
token ranges determining the row set to be fetched
estimated total row count in a partition
RDD created by repartitionByCassandraReplica with preferred locations mapping to the CassandraReplicas each partition was created for.
Creates CassandraPartitions for given Cassandra table
Stores a CQL WHERE
predicate matching a range of tokens.
Fast token range splitter assuming that data are spread out evenly in the whole range.
Fast token range splitter assuming that data are spread out evenly in the whole range.
The replica partitioner will work on an RDD which is keyed on sets of InetAddresses representing Cassandra Hosts .
The replica partitioner will work on an RDD which is keyed on sets of InetAddresses representing Cassandra Hosts . It will group keys which share a common IP address into partitionsPerReplicaSet Partitions.
Delegates token range splitting to Cassandra server.
Divides a set of token ranges into groups containing not more than maxRowCountPerGroup
rows
and not more than maxGroupSize
token ranges.
Divides a set of token ranges into groups containing not more than maxRowCountPerGroup
rows
and not more than maxGroupSize
token ranges. Each group will form a single CassandraRDDPartition
.
The algorithm is as follows:
1. Sort token ranges by endpoints lexicographically.
2. Take the highest possible number of token ranges from the beginning of the list,
such that their sum of rowCounts does not exceed maxRowCountPerGroup
and they all contain at
least one common endpoint. If it is not possible, take at least one item.
Those token ranges will make a group.
3. Repeat the previous step until no more token ranges left.
Splits a token range into smaller sub-ranges, each with the desired approximate number of rows.
Provides components for partitioning a Cassandra table into smaller parts of appropriate size. Each partition can be processed locally on at least one cluster node.