Param for distanceToCentroid column name.
Param for distanceToCentroid column name.
Set the names of the features corresponding to each dimension of the input vectors column
Set the names of the features corresponding to each dimension of the input vectors column
Param for probabilityByFeature column name.
Param for probabilityByFeature column name.
The ratio of standard deviation used to compute probability
The ratio of standard deviation used to compute probability
Extended KMeans algorithm.
Calculates the following: - cluster (prediction); already available in the default KMeans algorithm. - distance to cluster - probability - probability by feature (dimension)
Note: The probability by feature algorithm is based on the ideas presented in https://github.com/tupol/naive-ml; https://github.com/tupol/naive-ml/blob/master/src/main/scala/tupol/ml/clustering/KMeansGaussian.scala.
Note: The probability by feature algorithm can be rendered useless if a feature/dimension reduction algorithm is used before applying XKMeans2, as we will be unable to track back the exact feature which contributed to a record being classified as an anomaly.
Note: This is by far not a perfect solution yet, as the general assumption is that the data follows a normal distribution, which is not always the case.