Construct Pool from DataFrame also specifying pairs data in an additional DataFrame
Construct Pool from DataFrame also specifying pairs data in an additional DataFrame
val spark = SparkSession.builder() .master("local[4]") .appName("PoolWithPairsTest") .getOrCreate(); val srcData = Seq( Row(Vectors.dense(0.1, 0.2, 0.11), "0.12", 0x0L, 0.12f, 0), Row(Vectors.dense(0.97, 0.82, 0.33), "0.22", 0x0L, 0.18f, 1), Row(Vectors.dense(0.13, 0.22, 0.23), "0.34", 0x1L, 1.0f, 2), Row(Vectors.dense(0.23, 0.01, 0.0), "0.0", 0x1L, 1.2f, 3) ) val srcDataSchema = Seq( StructField("features", SQLDataTypes.VectorType), StructField("label", StringType), StructField("groupId", LongType), StructField("weight", FloatType) StructField("sampleId", LongType) ) val df = spark.createDataFrame(spark.sparkContext.parallelize(srcData), StructType(srcDataSchema)) val srcPairsData = Seq( Row(0x0L, 0, 1), Row(0x1L, 3, 2) ) val srcPairsDataSchema = Seq( StructField("groupId", LongType), StructField("winnerId", IntegerType), StructField("loserId", IntegerType) ) val pairsDf = spark.createDataFrame( spark.sparkContext.parallelize(srcPairsData), StructType(srcPairsDataSchema) ) val pool = new Pool(df, pairsDf) .setGroupIdCol("groupId") .setWeightCol("weight") .setSampleIdCol("sampleId") pool.data.show() pool.pairsData.show()
Construct Pool from DataFrame Call set*Col methods to specify non-default columns.
Construct Pool from DataFrame Call set*Col methods to specify non-default columns. Only features and label columns with "features" and "label" names are assumed by default.
val spark = SparkSession.builder() .master("local[4]") .appName("PoolTest") .getOrCreate(); val srcData = Seq( Row(Vectors.dense(0.1, 0.2, 0.11), "0.12", 0x0L, 0.12f), Row(Vectors.dense(0.97, 0.82, 0.33), "0.22", 0x0L, 0.18f), Row(Vectors.dense(0.13, 0.22, 0.23), "0.34", 0x1L, 1.0f) ) val srcDataSchema = Seq( StructField("features", SQLDataTypes.VectorType), StructField("label", StringType), StructField("groupId", LongType), StructField("weight", FloatType) ) val df = spark.createDataFrame(spark.sparkContext.parallelize(srcData), StructType(srcDataSchema)) val pool = new Pool(df) .setGroupIdCol("groupId") .setWeightCol("weight") pool.data.show()
Persist Datasets of this Pool with the default storage level (MEMORY_AND_DISK).
Returns Pool with eagerly checkpointed Datasets.
Returns Pool with checkpointed Datasets.
Returns Pool with checkpointed Datasets.
Whether to checkpoint Datasets immediately
used to add additional columns to data (for example estimated features) It is impossible to just write an external function for this because copyValues is protected
Number of objects in the dataset, similar to the same method of org.apache.spark.sql.Dataset
ensure that if groups are present data in partitions contains whole groups in consecutive order
dimension of formula baseline, 0 if no baseline specified
Returns Pool with eagerly locally checkpointed Datasets.
Returns Pool with locally checkpointed Datasets.
Returns Pool with locally checkpointed Datasets.
Whether to checkpoint Datasets immediately
Map over partitions for quantized Pool
Number of pairs in the dataset
Persist Datasets of this Pool with the default storage level (MEMORY_AND_DISK).
Returns Pool with Datasets persisted with the given storage level.
Create Pool with quantized features from Pool with raw features.
Create Pool with quantized features from Pool with raw features. This variant of the method is useful if QuantizedFeaturesInfo with data for quantization (borders and nan modes) has already been computed. Used, for example, to quantize evaluation datasets after the training dataset has been quantized.
Create Pool with quantized features from Pool with raw features
val spark = SparkSession.builder() .master("local[*]") .appName("QuantizationTest") .getOrCreate(); val srcData = Seq( Row(Vectors.dense(0.1, 0.2, 0.11), "0.12"), Row(Vectors.dense(0.97, 0.82, 0.33), "0.22"), Row(Vectors.dense(0.13, 0.22, 0.23), "0.34") ) val srcDataSchema = Seq( StructField("features", SQLDataTypes.VectorType), StructField("label", StringType) ) val df = spark.createDataFrame(spark.sparkContext.parallelize(srcData), StructType(srcDataSchema)) val pool = new Pool(df) val quantizedPool = pool.quantize(new QuantizationParams) val quantizedPoolWithTwoBinsPerFeature = pool.quantize(new QuantizationParams().setBorderCount(1)) quantizedPool.data.show() quantizedPoolWithTwoBinsPerFeature.data.show()
Create Pool with quantized features from Pool with raw features.
Repartition data to the specified number of partitions.
Repartition data to the specified number of partitions. Useful to repartition data to create one partition per executor for training (where each executor gets its' own CatBoost worker with a part of the training data).
Create subset of this pool with the fraction of the samples (or groups of samples if present)
Mark Datasets of this Pool as non-persistent, and remove all blocks for them from memory and disk.
Mark Datasets of this Pool as non-persistent, and remove all blocks for them from memory and disk.
Whether to block until all blocks are deleted.
Mark Datasets of this Pool as non-persistent, and remove all blocks for them from memory and disk.
Interface for saving the content out into external storage (API similar to Spark's Dataset).
CatBoost's abstraction of a dataset.
Features data can be stored in raw (features column has org.apache.spark.ml.linalg.Vector type) or quantized (float feature values are quantized into integer bin values, features column has
Array[Byte]
type) form.Raw Pool can be transformed to quantized form using
quantize
method. This is useful if this dataset is used for training multiple times and quantization parameters do not change. Pre-quantized Pool allows to cache quantized features data and so do not re-run feature quantization step at the start of an each training.