org.apache.spark.ml

feature

package feature

Feature transformers

The ml.feature package provides common feature transformers that help convert raw data or features into more suitable forms for model fitting. Most feature transformers are implemented as Transformers, which transform one DataFrame into another, e.g., HashingTF. Some feature transformers are implemented as Estimators, because the transformation requires some aggregated information of the dataset, e.g., document frequencies in IDF. For those feature transformers, calling Estimator!.fit is required to obtain the model first, e.g., IDFModel, in order to apply transformation. The transformation is usually done by appending new columns to the input DataFrame, so all input columns are carried over.

We try to make each transformer minimal, so it becomes flexible to assemble feature transformation pipelines. Pipeline can be used to chain feature transformers, and VectorAssembler can be used to combine multiple feature transformations, for example:

import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline

// a DataFrame with three columns: id (integer), text (string), and rating (double).
val df = spark.createDataFrame(Seq(
  (0, "Hi I heard about Spark", 3.0),
  (1, "I wish Java could use case classes", 4.0),
  (2, "Logistic regression models are neat", 4.0)
)).toDF("id", "text", "rating")

// define feature transformers
val tok = new RegexTokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val sw = new StopWordsRemover()
  .setInputCol("words")
  .setOutputCol("filtered_words")
val tf = new HashingTF()
  .setInputCol("filtered_words")
  .setOutputCol("tf")
  .setNumFeatures(10000)
val idf = new IDF()
  .setInputCol("tf")
  .setOutputCol("tf_idf")
val assembler = new VectorAssembler()
  .setInputCols(Array("tf_idf", "rating"))
  .setOutputCol("features")

// assemble and fit the feature transformation pipeline
val pipeline = new Pipeline()
  .setStages(Array(tok, sw, tf, idf, assembler))
val model = pipeline.fit(df)

// save transformed features with raw data
model.transform(df)
  .select("id", "text", "rating", "features")
  .write.format("parquet").save("/output/path")

Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn. The major difference is that most scikit-learn feature transformers operate eagerly on the entire input dataset, while MLlib's feature transformers operate lazily on individual columns, which is more efficient and flexible to handle large and complex datasets.

See also

scikit-learn.preprocessing

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. feature
  2. AnyRef
  3. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Type Members

  1. final class Binarizer extends Transformer with HasInputCol with HasOutputCol with DefaultParamsWritable

    Binarize a column of continuous features given a threshold.

  2. class BucketedRandomProjectionLSH extends LSH[BucketedRandomProjectionLSHModel] with BucketedRandomProjectionLSHParams with HasSeed

    :: Experimental ::

  3. class BucketedRandomProjectionLSHModel extends LSHModel[BucketedRandomProjectionLSHModel] with BucketedRandomProjectionLSHParams

    :: Experimental ::

  4. final class Bucketizer extends Model[Bucketizer] with HasInputCol with HasOutputCol with DefaultParamsWritable

    Bucketizer maps a column of continuous features to a column of feature buckets.

  5. final class ChiSqSelector extends Estimator[ChiSqSelectorModel] with ChiSqSelectorParams with DefaultParamsWritable

    Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label.

  6. final class ChiSqSelectorModel extends Model[ChiSqSelectorModel] with ChiSqSelectorParams with MLWritable

    Model fitted by ChiSqSelector.

  7. class CountVectorizer extends Estimator[CountVectorizerModel] with CountVectorizerParams with DefaultParamsWritable

    Extracts a vocabulary from document collections and generates a CountVectorizerModel.

  8. class CountVectorizerModel extends Model[CountVectorizerModel] with CountVectorizerParams with MLWritable

    Converts a text document to a sparse vector of token counts.

  9. class DCT extends UnaryTransformer[Vector, Vector, DCT] with DefaultParamsWritable

    A feature transformer that takes the 1D discrete cosine transform of a real vector.

  10. class ElementwiseProduct extends UnaryTransformer[Vector, Vector, ElementwiseProduct] with DefaultParamsWritable

    Outputs the Hadamard product (i.

  11. class HashingTF extends Transformer with HasInputCol with HasOutputCol with DefaultParamsWritable

    Maps a sequence of terms to their term frequencies using the hashing trick.

  12. final class IDF extends Estimator[IDFModel] with IDFBase with DefaultParamsWritable

    Compute the Inverse Document Frequency (IDF) given a collection of documents.

  13. class IDFModel extends Model[IDFModel] with IDFBase with MLWritable

    Model fitted by IDF.

  14. class IndexToString extends Transformer with HasInputCol with HasOutputCol with DefaultParamsWritable

    A Transformer that maps a column of indices back to a new column of corresponding string values.

  15. class Interaction extends Transformer with HasInputCols with HasOutputCol with DefaultParamsWritable

    Implements the feature interaction transform.

  16. case class LabeledPoint(label: Double, features: Vector) extends Product with Serializable

    Class that represents the features and label of a data point.

  17. class MaxAbsScaler extends Estimator[MaxAbsScalerModel] with MaxAbsScalerParams with DefaultParamsWritable

    Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.

  18. class MaxAbsScalerModel extends Model[MaxAbsScalerModel] with MaxAbsScalerParams with MLWritable

    Model fitted by MaxAbsScaler.

  19. class MinHashLSH extends LSH[MinHashLSHModel] with HasSeed

    :: Experimental ::

  20. class MinHashLSHModel extends LSHModel[MinHashLSHModel]

    :: Experimental ::

  21. class MinMaxScaler extends Estimator[MinMaxScalerModel] with MinMaxScalerParams with DefaultParamsWritable

    Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.

  22. class MinMaxScalerModel extends Model[MinMaxScalerModel] with MinMaxScalerParams with MLWritable

    Model fitted by MinMaxScaler.

  23. class NGram extends UnaryTransformer[Seq[String], Seq[String], NGram] with DefaultParamsWritable

    A feature transformer that converts the input array of strings into an array of n-grams.

  24. class Normalizer extends UnaryTransformer[Vector, Vector, Normalizer] with DefaultParamsWritable

    Normalize a vector to have unit norm using the given p-norm.

  25. class OneHotEncoder extends Transformer with HasInputCol with HasOutputCol with DefaultParamsWritable

    A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.

  26. class PCA extends Estimator[PCAModel] with PCAParams with DefaultParamsWritable

    PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components.

  27. class PCAModel extends Model[PCAModel] with PCAParams with MLWritable

    Model fitted by PCA.

  28. class PolynomialExpansion extends UnaryTransformer[Vector, Vector, PolynomialExpansion] with DefaultParamsWritable

    Perform feature expansion in a polynomial space.

  29. final class QuantileDiscretizer extends Estimator[Bucketizer] with QuantileDiscretizerBase with DefaultParamsWritable

    QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features.

  30. class RFormula extends Estimator[RFormulaModel] with RFormulaBase with DefaultParamsWritable

    :: Experimental :: Implements the transforms required for fitting a dataset against an R model formula.

  31. class RFormulaModel extends Model[RFormulaModel] with RFormulaBase with MLWritable

    :: Experimental :: Model fitted by RFormula.

  32. class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer] with DefaultParamsWritable

    A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false).

  33. class SQLTransformer extends Transformer with DefaultParamsWritable

    Implements the transformations which are defined by SQL statement.

  34. class StandardScaler extends Estimator[StandardScalerModel] with StandardScalerParams with DefaultParamsWritable

    Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

  35. class StandardScalerModel extends Model[StandardScalerModel] with StandardScalerParams with MLWritable

    Model fitted by StandardScaler.

  36. class StopWordsRemover extends Transformer with HasInputCol with HasOutputCol with DefaultParamsWritable

    A feature transformer that filters out stop words from input.

  37. class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerBase with DefaultParamsWritable

    A label indexer that maps a string column of labels to an ML column of label indices.

  38. class StringIndexerModel extends Model[StringIndexerModel] with StringIndexerBase with MLWritable

    Model fitted by StringIndexer.

  39. class Tokenizer extends UnaryTransformer[String, Seq[String], Tokenizer] with DefaultParamsWritable

    A tokenizer that converts the input string to lowercase and then splits it by white spaces.

  40. class VectorAssembler extends Transformer with HasInputCols with HasOutputCol with DefaultParamsWritable

    A feature transformer that merges multiple columns into a vector column.

  41. class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams with DefaultParamsWritable

    Class for indexing categorical feature columns in a dataset of Vector.

  42. class VectorIndexerModel extends Model[VectorIndexerModel] with VectorIndexerParams with MLWritable

    Model fitted by VectorIndexer.

  43. final class VectorSlicer extends Transformer with HasInputCol with HasOutputCol with DefaultParamsWritable

    This class takes a feature vector and outputs a new feature vector with a subarray of the original features.

  44. final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase with DefaultParamsWritable

    Word2Vec trains a model of Map(String, Vector), i.

  45. class Word2VecModel extends Model[Word2VecModel] with Word2VecBase with MLWritable

    Model fitted by Word2Vec.

Value Members

  1. object Binarizer extends DefaultParamsReadable[Binarizer] with Serializable

    Annotations
    @Since( "1.6.0" )
  2. object BucketedRandomProjectionLSH extends DefaultParamsReadable[BucketedRandomProjectionLSH] with Serializable

    Annotations
    @Since( "2.1.0" )
  3. object BucketedRandomProjectionLSHModel extends MLReadable[BucketedRandomProjectionLSHModel] with Serializable

    Annotations
    @Since( "2.1.0" )
  4. object Bucketizer extends DefaultParamsReadable[Bucketizer] with Serializable

    Annotations
    @Since( "1.6.0" )
  5. object ChiSqSelector extends DefaultParamsReadable[ChiSqSelector] with Serializable

    Annotations
    @Since( "1.6.0" )
  6. object ChiSqSelectorModel extends MLReadable[ChiSqSelectorModel] with Serializable

    Annotations
    @Since( "1.6.0" )
  7. object CountVectorizer extends DefaultParamsReadable[CountVectorizer] with Serializable

    Annotations
    @Since( "1.6.0" )
  8. object CountVectorizerModel extends MLReadable[CountVectorizerModel] with Serializable

    Annotations
    @Since( "1.6.0" )
  9. object DCT extends DefaultParamsReadable[DCT] with Serializable

    Annotations
    @Since( "1.6.0" )
  10. object ElementwiseProduct extends DefaultParamsReadable[ElementwiseProduct] with Serializable

    Annotations
    @Since( "2.0.0" )
  11. object HashingTF extends DefaultParamsReadable[HashingTF] with Serializable

    Annotations
    @Since( "1.6.0" )
  12. object IDF extends DefaultParamsReadable[IDF] with Serializable

    Annotations
    @Since( "1.6.0" )
  13. object IDFModel extends MLReadable[IDFModel] with Serializable

    Annotations
    @Since( "1.6.0" )
  14. object IndexToString extends DefaultParamsReadable[IndexToString] with Serializable

    Annotations
    @Since( "1.6.0" )
  15. object Interaction extends DefaultParamsReadable[Interaction] with Serializable

    Annotations
    @Since( "1.6.0" )
  16. object MaxAbsScaler extends DefaultParamsReadable[MaxAbsScaler] with Serializable

    Annotations
    @Since( "2.0.0" )
  17. object MaxAbsScalerModel extends MLReadable[MaxAbsScalerModel] with Serializable

    Annotations
    @Since( "2.0.0" )
  18. object MinHashLSH extends DefaultParamsReadable[MinHashLSH] with Serializable

    Annotations
    @Since( "2.1.0" )
  19. object MinHashLSHModel extends MLReadable[MinHashLSHModel] with Serializable

    Annotations
    @Since( "2.1.0" )
  20. object MinMaxScaler extends DefaultParamsReadable[MinMaxScaler] with Serializable

    Annotations
    @Since( "1.6.0" )
  21. object MinMaxScalerModel extends MLReadable[MinMaxScalerModel] with Serializable

    Annotations
    @Since( "1.6.0" )
  22. object NGram extends DefaultParamsReadable[NGram] with Serializable

    Annotations
    @Since( "1.6.0" )
  23. object Normalizer extends DefaultParamsReadable[Normalizer] with Serializable

    Annotations
    @Since( "1.6.0" )
  24. object OneHotEncoder extends DefaultParamsReadable[OneHotEncoder] with Serializable

    Annotations
    @Since( "1.6.0" )
  25. object PCA extends DefaultParamsReadable[PCA] with Serializable

    Annotations
    @Since( "1.6.0" )
  26. object PCAModel extends MLReadable[PCAModel] with Serializable

    Annotations
    @Since( "1.6.0" )
  27. object PolynomialExpansion extends DefaultParamsReadable[PolynomialExpansion] with Serializable

    The expansion is done via recursion.

  28. object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] with Logging with Serializable

    Annotations
    @Since( "1.6.0" )
  29. object RFormula extends DefaultParamsReadable[RFormula] with Serializable

    Annotations
    @Since( "2.0.0" )
  30. object RFormulaModel extends MLReadable[RFormulaModel] with Serializable

    Annotations
    @Since( "2.0.0" )
  31. object RegexTokenizer extends DefaultParamsReadable[RegexTokenizer] with Serializable

    Annotations
    @Since( "1.6.0" )
  32. object SQLTransformer extends DefaultParamsReadable[SQLTransformer] with Serializable

    Annotations
    @Since( "1.6.0" )
  33. object StandardScaler extends DefaultParamsReadable[StandardScaler] with Serializable

    Annotations
    @Since( "1.6.0" )
  34. object StandardScalerModel extends MLReadable[StandardScalerModel] with Serializable

    Annotations
    @Since( "1.6.0" )
  35. object StopWordsRemover extends DefaultParamsReadable[StopWordsRemover] with Serializable

    Annotations
    @Since( "1.6.0" )
  36. object StringIndexer extends DefaultParamsReadable[StringIndexer] with Serializable

    Annotations
    @Since( "1.6.0" )
  37. object StringIndexerModel extends MLReadable[StringIndexerModel] with Serializable

    Annotations
    @Since( "1.6.0" )
  38. object Tokenizer extends DefaultParamsReadable[Tokenizer] with Serializable

    Annotations
    @Since( "1.6.0" )
  39. object VectorAssembler extends DefaultParamsReadable[VectorAssembler] with Serializable

    Annotations
    @Since( "1.6.0" )
  40. object VectorIndexer extends DefaultParamsReadable[VectorIndexer] with Serializable

    Annotations
    @Since( "1.6.0" )
  41. object VectorIndexerModel extends MLReadable[VectorIndexerModel] with Serializable

    Annotations
    @Since( "1.6.0" )
  42. object VectorSlicer extends DefaultParamsReadable[VectorSlicer] with Serializable

    Annotations
    @Since( "1.6.0" )
  43. object Word2Vec extends DefaultParamsReadable[Word2Vec] with Serializable

    Annotations
    @Since( "1.6.0" )
  44. object Word2VecModel extends MLReadable[Word2VecModel] with Serializable

    Annotations
    @Since( "1.6.0" )

Inherited from AnyRef

Inherited from Any

Members