This is slightly different than Scala this.type.
This is slightly different than Scala this.type.
this.type is the unique singleton type of an object which is not compatible with other
instances of the same type, so returning anything other than this
is not really possible
without lying to the compiler by explicit casts.
Here SelfType is used to return a copy of the object - a different instance of the same type
Maps each row into object of a different type using provided function taking column value(s) as argument(s).
Maps each row into object of a different type using provided function taking column value(s) as argument(s). Can be used to convert each row to a tuple or a case class object:
sc.cassandraTable("ks", "table") .select("column1") .as((s: String) => s) // yields CassandraRDD[String] sc.cassandraTable("ks", "table") .select("column1", "column2") .as((_: String, _: Long)) // yields CassandraRDD[(String, Long)] case class MyRow(key: String, value: Long) sc.cassandraTable("ks", "table") .select("column1", "column2") .as(MyRow) // yields CassandraRDD[MyRow]
This method will create the RowWriter required before the RDD is serialized.
This method will create the RowWriter required before the RDD is serialized. This is called during getPartitions
Adds a CQL ORDER BY
clause to the query.
Adds a CQL ORDER BY
clause to the query.
It can be applied only in case there are clustering columns and primary key predicate is
pushed down in where
.
It is useful when the default direction of ordering rows within a single Cassandra partition
needs to be changed.
When computing a CassandraPartitionKeyRDD the data is selected via single CQL statements from the specified C* Keyspace and Table.
When computing a CassandraPartitionKeyRDD the data is selected via single CQL statements from the specified C* Keyspace and Table. This will be preformed on whatever data is available in the previous RDD in the chain.
Allows to copy this RDD with changing some of the properties
Allows to copy this RDD with changing some of the properties
Uses the data from RDD
to join with a Cassandra table without retrieving the entire table.
Uses the data from RDD
to join with a Cassandra table without retrieving the entire table.
Any RDD which can be used to saveToCassandra can be used to joinWithCassandra as well as any
RDD which only specifies the partition Key of a Cassandra Table. This method executes single
partition requests against the Cassandra Table and accepts the functional modifiers that a
normal com.datastax.spark.connector.rdd.CassandraTableScanRDD takes.
By default this method only uses the Partition Key for joining but any combination of columns which are acceptable to C* can be used in the join. Specify columns using joinColumns as a parameter or the on() method.
Example With Prior Repartitioning:
val source = sc.parallelize(keys).map(x => new KVRow(x)) val repart = source.repartitionByCassandraReplica(keyspace, tableName, 10) val someCass = repart.joinWithCassandraTable(keyspace, tableName)
Example Joining on Clustering Columns:
val source = sc.parallelize(keys).map(x => (x, x * 100)) val someCass = source.joinWithCassandraTable(keyspace, wideTable).on(SomeColumns("key", "group"))
Key every row in the RDD by with the IP Adresses of all of the Cassandra nodes which a contain a replica of the data specified by that row.
Key every row in the RDD by with the IP Adresses of all of the Cassandra nodes which a contain a replica of the data specified by that row. The calling RDD must have rows that can be converted into the partition key of the given Cassandra Table.
Adds the limit clause to CQL select statement.
Adds the limit clause to CQL select statement. The limit will be applied for each created Spark partition. In other words, unless the data are fetched from a single Cassandra partition the number of results is unpredictable.
The main purpose of passing limit clause is to fetch top n rows from a single Cassandra partition when the table is designed so that it uses clustering keys and a partition key predicate is passed to the where clause.
Filters currently selected set of columns with a new set of columns
Filters currently selected set of columns with a new set of columns
Repartitions the data (via a shuffle) based upon the replication of the given keyspaceName
and tableName
.
Repartitions the data (via a shuffle) based upon the replication of the given keyspaceName
and tableName
.
Calling this method before using joinWithCassandraTable will ensure that requests will be coordinator
local. partitionsPerHost
Controls the number of Spark Partitions that will be created in this repartitioning
event.
The calling RDD must have rows that can be converted into the partition key of the given Cassandra Table.
RowReaderFactory and ClassTag should be provided from implicit parameters in the constructor of the class implementing this trait
RowReaderFactory and ClassTag should be provided from implicit parameters in the constructor of the class implementing this trait
CassandraTableScanRDD
Saves the data from RDD
to a new table with definition taken from the
ColumnMapper
for this class.
Saves the data from RDD
to a new table with definition taken from the
ColumnMapper
for this class.
keyspace where to create a new table
name of the table to create; the table must not exist
Selects the columns to save data to. Uses only the unique column names, and you must select at least all primary key columns. All other fields are discarded. Non-selected property/column names are left unchanged. This parameter does not affect table creation.
additional configuration object allowing to set consistency level, batch size, etc.
optional, implicit connector to Cassandra
factory for obtaining the row writer to be used to extract column values
from items of the RDD
a column mapper determining the definition of the table
Saves the data from RDD
to a new table defined by the given TableDef
.
Saves the data from RDD
to a new table defined by the given TableDef
.
First it creates a new table with all columns from the TableDef
and then it saves RDD content in the same way as saveToCassandra.
The table must not exist prior to this call.
table definition used to create a new table
Selects the columns to save data to. Uses only the unique column names, and you must select at least all primary key columns. All other fields are discarded. Non-selected property/column names are left unchanged. This parameter does not affect table creation.
additional configuration object allowing to set consistency level, batch size, etc.
optional, implicit connector to Cassandra
factory for obtaining the row writer to be used to extract column values
from items of the RDD
Saves the data from RDD
to a Cassandra table.
Saves the data from RDD
to a Cassandra table. Uses the specified column names.
the name of the Keyspace to use
the name of the Table to use
additional configuration object allowing to set consistency level, batch size, etc.
Narrows down the selected set of columns.
Narrows down the selected set of columns.
Use this for better performance, when you don't need all the columns in the result RDD.
When called multiple times, it selects the subset of the already selected columns, so
after a column was removed by the previous select
call, it is not possible to
add it back.
The selected columns are NamedColumnRef instances. This type allows to specify columns for
straightforward retrieval and to read TTL or write time of regular columns as well. Implicit
conversions included in com.datastax.spark.connector package make it possible to provide
just column names (which is also backward compatible) and optional add .ttl
or .writeTime
suffix in order to create an appropriate NamedColumnRef instance.
Returns the names of columns to be selected from the table.
Returns the names of columns to be selected from the table.
Applies a function to each item, and groups consecutive items having the same value together.
Applies a function to each item, and groups consecutive items having the same value together.
Contrary to groupBy
, items from the same group must be already next to each other in the
original collection. Works locally on each partition, so items from different
partitions will never be placed in the same group.
Groups items with the same key, assuming the items with the same key are next to each other in the collection.
Groups items with the same key, assuming the items with the same key are next to each other
in the collection. It does not perform shuffle, therefore it is much faster than using
much more universal Spark RDD groupByKey
. For this method to be useful with Cassandra tables,
the key must represent a prefix of the primary key, containing at least the partition key of the
Cassandra table.
Checks for existence of keyspace, table, columns and whether the number of selected columns corresponds to the number of the columns expected by the target type constructor.
Checks for existence of keyspace, table, columns and whether the number of selected columns
corresponds to the number of the columns expected by the target type constructor.
If successful, does nothing, otherwise throws appropriate IOException
or AssertionError
.
Adds a CQL WHERE
predicate(s) to the query.
Adds a CQL WHERE
predicate(s) to the query.
Useful for leveraging secondary indexes in Cassandra.
Implicitly adds an ALLOW FILTERING
clause to the WHERE clause,
however beware that some predicates might be rejected by Cassandra,
particularly in cases when they filter on an unindexed, non-clustering column.
Returns a copy of this Cassandra RDD with specified connector
Returns a copy of this Cassandra RDD with specified connector
Allows to set custom read configuration, e.g.
Allows to set custom read configuration, e.g. consistency level or fetch size.
(cassandraJoinRDD: RDDFunctions[(Left, Right)]).sparkContext
(Since version 1.0.0) use mapPartitionsWithIndex and filter
(Since version 1.0.0) use mapPartitionsWithIndex and flatMap
(Since version 1.0.0) use mapPartitionsWithIndex and foreach
(Since version 1.2.0) use TaskContext.get
(Since version 0.7.0) use mapPartitionsWithIndex
(Since version 1.0.0) use mapPartitionsWithIndex
(Since version 1.0.0) use collect
An
RDD
that will do a selecting join betweenleft
RDD and the specified Cassandra Table This will perform individual selects to retrieve the rows from Cassandra and will take advantage of RDDs that have been partitioned with the com.datastax.spark.connector.rdd.partitioner.ReplicaPartitioner