org.apache.spark.mllib

clustering

package clustering

Visibility
  1. Public
  2. All

Type Members

  1. class KMeans extends Serializable with Logging

    K-means clustering with support for multiple parallel runs and a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).

    K-means clustering with support for multiple parallel runs and a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, they are executed together with joint passes over the data for efficiency.

    This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.

  2. class KMeansModel extends Serializable

    A clustering model for K-means.

    A clustering model for K-means. Each point belongs to the cluster with the closest center.

  3. class StreamingKMeans extends Logging

    :: DeveloperApi :: StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data.

    :: DeveloperApi :: StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data. See KMeansModel for details on algorithm and update rules.

    Use a builder pattern to construct a streaming k-means analysis in an application, like:

    val model = new StreamingKMeans() .setDecayFactor(0.5) .setK(3) .setRandomCenters(5, 100.0) .trainOn(DStream)

    Annotations
    @DeveloperApi()
  4. class StreamingKMeansModel extends KMeansModel with Logging

    :: DeveloperApi :: StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.

    :: DeveloperApi :: StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.

    The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfullness (i.e. decay). The update rule (for each cluster) is:

    c_t+1 = [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] n_t+t = n_t * a + m_t

    Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.

    The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.

    Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.

    Annotations
    @DeveloperApi()

Value Members

  1. object KMeans extends Serializable

    Top-level methods for calling K-means clustering.

Ungrouped