Package

org.apache.spark.ml

feature

Permalink

package feature

Feature transformers

The ml.feature package provides common feature transformers that help convert raw data or features into more suitable forms for model fitting. Most feature transformers are implemented as Transformers, which transform one DataFrame into another, e.g., HashingTF. Some feature transformers are implemented as Estimators, because the transformation requires some aggregated information of the dataset, e.g., document frequencies in IDF. For those feature transformers, calling Estimator!.fit is required to obtain the model first, e.g., IDFModel, in order to apply transformation. The transformation is usually done by appending new columns to the input DataFrame, so all input columns are carried over.

We try to make each transformer minimal, so it becomes flexible to assemble feature transformation pipelines. Pipeline can be used to chain feature transformers, and VectorAssembler can be used to combine multiple feature transformations, for example:

import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline

// a DataFrame with three columns: id (integer), text (string), and rating (double).
val df = sqlContext.createDataFrame(Seq(
  (0, "Hi I heard about Spark", 3.0),
  (1, "I wish Java could use case classes", 4.0),
  (2, "Logistic regression models are neat", 4.0)
)).toDF("id", "text", "rating")

// define feature transformers
val tok = new RegexTokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val sw = new StopWordsRemover()
  .setInputCol("words")
  .setOutputCol("filtered_words")
val tf = new HashingTF()
  .setInputCol("filtered_words")
  .setOutputCol("tf")
  .setNumFeatures(10000)
val idf = new IDF()
  .setInputCol("tf")
  .setOutputCol("tf_idf")
val assembler = new VectorAssembler()
  .setInputCols(Array("tf_idf", "rating"))
  .setOutputCol("features")

// assemble and fit the feature transformation pipeline
val pipeline = new Pipeline()
  .setStages(Array(tok, sw, tf, idf, assembler))
val model = pipeline.fit(df)

// save transformed features with raw data
model.transform(df)
  .select("id", "text", "rating", "features")
  .write.format("parquet").save("/output/path")

Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn. The major difference is that most scikit-learn feature transformers operate eagerly on the entire input dataset, while MLlib's feature transformers operate lazily on individual columns, which is more efficient and flexible to handle large and complex datasets.

See also

scikit-learn.preprocessing

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. feature
  2. AnyRef
  3. Any
  1. Hide All
  2. Show all
Visibility
  1. Public
  2. All

Type Members

  1. final class Binarizer extends Transformer with HasInputCol with HasOutputCol

    Permalink

    :: Experimental :: Binarize a column of continuous features given a threshold.

    :: Experimental :: Binarize a column of continuous features given a threshold.

    Annotations
    @Experimental()
  2. final class Bucketizer extends Model[Bucketizer] with HasInputCol with HasOutputCol

    Permalink

    :: Experimental :: Bucketizer maps a column of continuous features to a column of feature buckets.

    :: Experimental :: Bucketizer maps a column of continuous features to a column of feature buckets.

    Annotations
    @Experimental()
  3. class CountVectorizer extends Estimator[CountVectorizerModel] with CountVectorizerParams

    Permalink

    :: Experimental :: Extracts a vocabulary from document collections and generates a CountVectorizerModel.

    :: Experimental :: Extracts a vocabulary from document collections and generates a CountVectorizerModel.

    Annotations
    @Experimental()
  4. class CountVectorizerModel extends Model[CountVectorizerModel] with CountVectorizerParams

    Permalink

    :: Experimental :: Converts a text document to a sparse vector of token counts.

    :: Experimental :: Converts a text document to a sparse vector of token counts.

    Annotations
    @Experimental()
  5. class DCT extends UnaryTransformer[Vector, Vector, DCT]

    Permalink

    :: Experimental :: A feature transformer that takes the 1D discrete cosine transform of a real vector.

    :: Experimental :: A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).

    More information on Wikipedia.

    Annotations
    @Experimental()
  6. class ElementwiseProduct extends UnaryTransformer[Vector, Vector, ElementwiseProduct]

    Permalink

    :: Experimental :: Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector.

    :: Experimental :: Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier.

    Annotations
    @Experimental()
  7. class HashingTF extends Transformer with HasInputCol with HasOutputCol

    Permalink

    :: Experimental :: Maps a sequence of terms to their term frequencies using the hashing trick.

    :: Experimental :: Maps a sequence of terms to their term frequencies using the hashing trick.

    Annotations
    @Experimental()
  8. final class IDF extends Estimator[IDFModel] with IDFBase

    Permalink

    :: Experimental :: Compute the Inverse Document Frequency (IDF) given a collection of documents.

    :: Experimental :: Compute the Inverse Document Frequency (IDF) given a collection of documents.

    Annotations
    @Experimental()
  9. class IDFModel extends Model[IDFModel] with IDFBase

    Permalink

    :: Experimental :: Model fitted by IDF.

    :: Experimental :: Model fitted by IDF.

    Annotations
    @Experimental()
  10. class IndexToString extends Transformer with HasInputCol with HasOutputCol

    Permalink

    :: Experimental :: A Transformer that maps a column of string indices back to a new column of corresponding string values using either the ML attributes of the input column, or if provided using the labels supplied by the user.

    :: Experimental :: A Transformer that maps a column of string indices back to a new column of corresponding string values using either the ML attributes of the input column, or if provided using the labels supplied by the user. All original columns are kept during transformation.

    Annotations
    @Experimental()
    See also

    StringIndexer for converting strings into indices

  11. class MinMaxScaler extends Estimator[MinMaxScalerModel] with MinMaxScalerParams

    Permalink

    :: Experimental :: Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.

    :: Experimental :: Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as,

    Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min

    For the case E_{max} == E_{min}, Rescaled(e_i) = 0.5 * (max + min) Note that since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.

    Annotations
    @Experimental()
  12. class MinMaxScalerModel extends Model[MinMaxScalerModel] with MinMaxScalerParams

    Permalink

    :: Experimental :: Model fitted by MinMaxScaler.

    :: Experimental :: Model fitted by MinMaxScaler.

    Annotations
    @Experimental()
  13. class NGram extends UnaryTransformer[Seq[String], Seq[String], NGram]

    Permalink

    :: Experimental :: A feature transformer that converts the input array of strings into an array of n-grams.

    :: Experimental :: A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

    When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

    Annotations
    @Experimental()
  14. class Normalizer extends UnaryTransformer[Vector, Vector, Normalizer]

    Permalink

    :: Experimental :: Normalize a vector to have unit norm using the given p-norm.

    :: Experimental :: Normalize a vector to have unit norm using the given p-norm.

    Annotations
    @Experimental()
  15. class OneHotEncoder extends Transformer with HasInputCol with HasOutputCol

    Permalink

    :: Experimental :: A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.

    :: Experimental :: A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via OneHotEncoder!.dropLast because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn's OneHotEncoder, which keeps all categories. The output vectors are sparse.

    Annotations
    @Experimental()
    See also

    StringIndexer for converting categorical values into category indices

  16. class PCA extends Estimator[PCAModel] with PCAParams

    Permalink

    :: Experimental :: PCA trains a model to project vectors to a low-dimensional space using PCA.

    :: Experimental :: PCA trains a model to project vectors to a low-dimensional space using PCA.

    Annotations
    @Experimental()
  17. class PCAModel extends Model[PCAModel] with PCAParams

    Permalink

    :: Experimental :: Model fitted by PCA.

    :: Experimental :: Model fitted by PCA.

    Annotations
    @Experimental()
  18. class PolynomialExpansion extends UnaryTransformer[Vector, Vector, PolynomialExpansion]

    Permalink

    :: Experimental :: Perform feature expansion in a polynomial space.

    :: Experimental :: Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at http://en.wikipedia.org/wiki/Polynomial_expansion, "In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition". Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).

    Annotations
    @Experimental()
  19. class RFormula extends Estimator[RFormulaModel] with RFormulaBase

    Permalink

    :: Experimental :: Implements the transforms required for fitting a dataset against an R model formula.

    :: Experimental :: Implements the transforms required for fitting a dataset against an R model formula. Currently we support a limited subset of the R operators, including '.', '~', '+', and '-'. Also see the R formula docs here: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html

    Annotations
    @Experimental()
  20. class RFormulaModel extends Model[RFormulaModel] with RFormulaBase

    Permalink

    :: Experimental :: A fitted RFormula.

    :: Experimental :: A fitted RFormula. Fitting is required to determine the factor levels of formula terms.

    Annotations
    @Experimental()
  21. class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

    Permalink

    :: Experimental :: A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false).

    :: Experimental :: A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

    Annotations
    @Experimental()
  22. class StandardScaler extends Estimator[StandardScalerModel] with StandardScalerParams

    Permalink

    :: Experimental :: Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

    :: Experimental :: Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

    Annotations
    @Experimental()
  23. class StandardScalerModel extends Model[StandardScalerModel] with StandardScalerParams

    Permalink

    :: Experimental :: Model fitted by StandardScaler.

    :: Experimental :: Model fitted by StandardScaler.

    Annotations
    @Experimental()
  24. class StopWordsRemover extends Transformer with HasInputCol with HasOutputCol

    Permalink

    :: Experimental :: A feature transformer that filters out stop words from input.

    :: Experimental :: A feature transformer that filters out stop words from input. Note: null values from input array are preserved unless adding null to stopWords explicitly.

    Annotations
    @Experimental()
    See also

    http://en.wikipedia.org/wiki/Stop_words

  25. class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerBase

    Permalink

    :: Experimental :: A label indexer that maps a string column of labels to an ML column of label indices.

    :: Experimental :: A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels), ordered by label frequencies. So the most frequent label gets index 0.

    Annotations
    @Experimental()
    See also

    IndexToString for the inverse transformation

  26. class StringIndexerModel extends Model[StringIndexerModel] with StringIndexerBase

    Permalink

    :: Experimental :: Model fitted by StringIndexer.

    :: Experimental :: Model fitted by StringIndexer.

    NOTE: During transformation, if the input column does not exist, StringIndexerModel.transform would return the input dataset unmodified. This is a temporary fix for the case when target labels do not exist during prediction.

    Annotations
    @Experimental()
  27. class Tokenizer extends UnaryTransformer[String, Seq[String], Tokenizer]

    Permalink

    :: Experimental :: A tokenizer that converts the input string to lowercase and then splits it by white spaces.

    :: Experimental :: A tokenizer that converts the input string to lowercase and then splits it by white spaces.

    Annotations
    @Experimental()
    See also

    RegexTokenizer

  28. class VectorAssembler extends Transformer with HasInputCols with HasOutputCol

    Permalink

    :: Experimental :: A feature transformer that merges multiple columns into a vector column.

    :: Experimental :: A feature transformer that merges multiple columns into a vector column.

    Annotations
    @Experimental()
  29. class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams

    Permalink

    :: Experimental :: Class for indexing categorical feature columns in a dataset of Vector.

    :: Experimental :: Class for indexing categorical feature columns in a dataset of Vector.

    This has 2 usage modes:

    • Automatically identify categorical features (default behavior)
      • This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
      • Set maxCategories to the maximum number of categorical any categorical feature should have.
      • E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.
    • Index all features, if all features are categorical
      • If maxCategories is set to be very large, then this will build an index of unique values for all features.
      • Warning: This can cause problems if features are continuous since this will collect ALL unique values to the driver.
      • E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories >= 3, then both features will be declared categorical.

    This returns a model which can transform categorical features to use 0-based indices.

    Index stability:

    • This is not guaranteed to choose the same category index across multiple runs.
    • If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0. This maintains vector sparsity.
    • More stability may be added in the future.

    TODO: Future extensions: The following functionality is planned for the future:

    • Preserve metadata in transform; if a feature's metadata is already present, do not recompute.
    • Specify certain features to not index, either via a parameter or via existing metadata.
    • Add warning if a categorical feature has only 1 category.
    • Add option for allowing unknown categories.
    Annotations
    @Experimental()
  30. class VectorIndexerModel extends Model[VectorIndexerModel] with VectorIndexerParams

    Permalink

    :: Experimental :: Transform categorical features to use 0-based indices instead of their original values.

    :: Experimental :: Transform categorical features to use 0-based indices instead of their original values.

    • Categorical features are mapped to indices.
    • Continuous features (columns) are left unchanged. This also appends metadata to the output column, marking features as Numeric (continuous), Nominal (categorical), or Binary (either continuous or categorical). Non-ML metadata is not carried over from the input to the output column.

    This maintains vector sparsity.

    Annotations
    @Experimental()
  31. final class VectorSlicer extends Transformer with HasInputCol with HasOutputCol

    Permalink

    :: Experimental :: This class takes a feature vector and outputs a new feature vector with a subarray of the original features.

    :: Experimental :: This class takes a feature vector and outputs a new feature vector with a subarray of the original features.

    The subset of features can be specified with either indices (setIndices()) or names (setNames()). At least one feature must be selected. Duplicate features are not allowed, so there can be no overlap between selected indices and names.

    The output vector will order features with the selected indices first (in the order given), followed by the selected names (in the order given).

    Annotations
    @Experimental()
  32. final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase

    Permalink

    :: Experimental :: Word2Vec trains a model of Map(String, Vector), i.e.

    :: Experimental :: Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.

    Annotations
    @Experimental()
  33. class Word2VecModel extends Model[Word2VecModel] with Word2VecBase

    Permalink

    :: Experimental :: Model fitted by Word2Vec.

    :: Experimental :: Model fitted by Word2Vec.

    Annotations
    @Experimental()

Inherited from AnyRef

Inherited from Any

Members