SVM

Implements a soft-margin SVM using the communication-efficient distributed dual coordinate ascent algorithm (CoCoA) with hinge-loss function.

It can be used for binary classification problems, with the labels set as +1.0 to indicate a positive example and -1.0 to indicate a negative example.

The algorithm solves the following minimization problem:

min_{w in bbb"R"^{d} lambda/2 ||w||}2 + 1/n sum_(i=1)^{n l_{i}(w}Tx_i)

with w being the weight vector, lambda being the regularization constant, x_{i} in bbb"R"^d being the data points and l_{i} being the convex loss functions, which can also depend on the labels y_{i} in bbb"R". In the current implementation the regularizer is the 2-norm and the loss functions are the hinge-loss functions:

l_{i} = max(0, 1 - y_{i} * w^Tx_i

With these choices, the problem definition is equivalent to a SVM with soft-margin. Thus, the algorithm allows us to train a SVM with soft-margin.

The minimization problem is solved by applying stochastic dual coordinate ascent (SDCA). In order to make the algorithm efficient in a distributed setting, the CoCoA algorithm calculates several iterations of SDCA locally on a data block before merging the local updates into a valid global state. This state is redistributed to the different data partitions where the next round of local SDCA iterations is then executed. The number of outer iterations and local SDCA iterations control the overall network costs, because there is only network communication required for each outer iteration. The local SDCA iterations are embarrassingly parallel once the individual data partitions have been distributed across the cluster.

Further details of the algorithm can be found here.

Example:

```
val trainingDS: DataSet[LabeledVector] = env.readLibSVM(pathToTrainingFile)
val svm = SVM()
  .setBlocks(10)
svm.fit(trainingDS)
val testingDS: DataSet[Vector] = env.readLibSVM(pathToTestingFile)
  .map(lv => lv.vector)
val predictionDS: DataSet[(Vector, Double)] = svm.predict(testingDS)
```
Parameters
- org.apache.flink.ml.classification.SVM.Blocks: Sets the number of blocks into which the input data will be split. On each block the local stochastic dual coordinate ascent method is executed. This number should be set at least to the degree of parallelism. If no value is specified, then the parallelism of the input DataSet is used as the number of blocks. (Default value: None)
- org.apache.flink.ml.classification.SVM.Iterations: Defines the maximum number of iterations of the outer loop method. In other words, it defines how often the SDCA method is applied to the blocked data. After each iteration, the locally computed weight vector updates have to be reduced to update the global weight vector value. The new weight vector is broadcast to all SDCA tasks at the beginning of each iteration. (Default value: 10)
- org.apache.flink.ml.classification.SVM.LocalIterations: Defines the maximum number of SDCA iterations. In other words, it defines how many data points are drawn from each local data block to calculate the stochastic dual coordinate ascent. (Default value: 10)
- org.apache.flink.ml.classification.SVM.Regularization: Defines the regularization constant of the SVM algorithm. The higher the value, the smaller will the 2-norm of the weight vector be. In case of a SVM with hinge loss this means that the SVM margin will be wider even though it might contain some false classifications. (Default value: 1.0)
- org.apache.flink.ml.classification.SVM.Stepsize: Defines the initial step size for the updates of the weight vector. The larger the step size is, the larger will be the contribution of the weight vector updates to the next weight vector value. The effective scaling of the updates is stepsize/blocks. This value has to be tuned in case that the algorithm becomes instable. (Default value: 1.0)
- org.apache.flink.ml.classification.SVM.Seed: Defines the seed to initialize the random number generator. The seed directly controls which data points are chosen for the SDCA method. (Default value: Random value)
- org.apache.flink.ml.classification.SVM.ThresholdValue: Defines the limiting value for the decision function above which examples are labeled as positive (+1.0). Examples with a decision function value below this value are classified as negative(-1.0). In order to get the raw decision function values you need to indicate it by using the org.apache.flink.ml.classification.SVM.OutputDecisionFunction. (Default value: 0.0)
- org.apache.flink.ml.classification.SVM.OutputDecisionFunction: Determines whether the predict and evaluate functions of the SVM should return the distance to the separating hyperplane, or binary class labels. Setting this to true will return the raw distance to the hyperplane for each example. Setting it to false will return the binary class label (+1.0, -1.0) (Default value: false)

Linear Supertypes

Predictor[SVM], Estimator[SVM], WithParameters, AnyRef, Any

Instance Constructors

new SVM()

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def evaluate[Testing, PredictionValue](testing: DataSet[Testing], evaluateParameters: ParameterMap = ParameterMap.Empty)(implicit evaluator: EvaluateDataSetOperation[SVM, Testing, PredictionValue]): DataSet[(PredictionValue, PredictionValue)]

Evaluates the testing data by computing the prediction value and returning a pair of true label value and prediction value.
Evaluates the testing data by computing the prediction value and returning a pair of true label value and prediction value. It is important that the implementation chooses a Testing type from which it can extract the true label value.

Definition Classes
Predictor
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def fit[Training](training: DataSet[Training], fitParameters: ParameterMap = ParameterMap.Empty)(implicit fitOperation: FitOperation[SVM, Training]): Unit

Fits the estimator to the given input data.
Fits the estimator to the given input data. The fitting logic is contained in the FitOperation. The computed state will be stored in the implementing class.
Training
Type of the training data
training
Training data
fitParameters
Additional parameters for the FitOperation
fitOperation
FitOperation which encapsulates the algorithm logic

Definition Classes
Estimator
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
val parameters: ParameterMap

Definition Classes
WithParameters
def predict[Testing, Prediction](testing: DataSet[Testing], predictParameters: ParameterMap = ParameterMap.Empty)(implicit predictor: PredictDataSetOperation[SVM, Testing, Prediction]): DataSet[Prediction]

Predict testing data according the learned model.
Predict testing data according the learned model. The implementing class has to provide a corresponding implementation of PredictDataSetOperation which contains the prediction logic.
Testing
Type of the testing data
Prediction
Type of the prediction data
testing
Testing data which shall be predicted
predictParameters
Additional parameters for the prediction
predictor
PredictDataSetOperation which encapsulates the prediction logic

Definition Classes
Predictor
def setBlocks(blocks: Int): SVM

Sets the number of data blocks/partitions
Sets the number of data blocks/partitions
blocks
the number of blocks into which the input data will be split.
returns
itself
def setIterations(iterations: Int): SVM

Sets the number of outer iterations
Sets the number of outer iterations
iterations
the maximum number of iterations of the outer loop method
returns
itself
def setLocalIterations(localIterations: Int): SVM

Sets the number of local SDCA iterations
Sets the number of local SDCA iterations
localIterations
the maximum number of SDCA iterations
returns
itself
def setOutputDecisionFunction(outputDecisionFunction: Boolean): SVM

Sets whether the predictions should return the raw decision function value or the thresholded binary value.
Sets whether the predictions should return the raw decision function value or the thresholded binary value.
When setting this to true, predict and evaluate return the raw decision value, which is the distance from the separating hyperplane. When setting this to false, they return thresholded (+1.0, -1.0) values.
outputDecisionFunction
When set to true, and evaluate return the raw decision function values. When set to false, they return the thresholded binary values (+1.0, -1.0).
returns
itself
def setRegularization(regularization: Double): SVM

Sets the regularization constant
Sets the regularization constant
regularization
the regularization constant of the SVM algorithm
returns
itself
def setSeed(seed: Long): SVM

Sets the seed value for the random number generator
Sets the seed value for the random number generator
seed
the seed to initialize the random number generator
returns
itself
def setStepsize(stepsize: Double): SVM

Sets the stepsize for the weight vector updates
Sets the stepsize for the weight vector updates
stepsize
the initial step size for the updates of the weight vector
returns
itself
def setThreshold(threshold: Double): SVM

Sets the threshold above which elements are classified as positive.
Sets the threshold above which elements are classified as positive.
The and evaluate functions will return +1.0 for items with a decision function value above this threshold, and -1.0 for items below it.
threshold
the limiting value for the decision function above which examples are labeled as positive
returns
itself
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
var weightsOption: Option[DataSet[DenseVector]]

Stores the learned weight vector after the fit operation

Related Docs: object SVM | package classification

class SVM extends Predictor[SVM]

Parameters

Instance Constructors

new SVM()

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def evaluate[Testing, PredictionValue](testing: DataSet[Testing], evaluateParameters: ParameterMap = ParameterMap.Empty)(implicit evaluator: EvaluateDataSetOperation[SVM, Testing, PredictionValue]): DataSet[(PredictionValue, PredictionValue)]

def finalize(): Unit

def fit[Training](training: DataSet[Training], fitParameters: ParameterMap = ParameterMap.Empty)(implicit fitOperation: FitOperation[SVM, Training]): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

val parameters: ParameterMap

def predict[Testing, Prediction](testing: DataSet[Testing], predictParameters: ParameterMap = ParameterMap.Empty)(implicit predictor: PredictDataSetOperation[SVM, Testing, Prediction]): DataSet[Prediction]

def setBlocks(blocks: Int): SVM

def setIterations(iterations: Int): SVM

def setLocalIterations(localIterations: Int): SVM

def setOutputDecisionFunction(outputDecisionFunction: Boolean): SVM

def setRegularization(regularization: Double): SVM

def setSeed(seed: Long): SVM

def setStepsize(stepsize: Double): SVM

def setThreshold(threshold: Double): SVM

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

var weightsOption: Option[DataSet[DenseVector]]

Inherited from Predictor[SVM]

Inherited from Estimator[SVM]

Inherited from WithParameters

Inherited from AnyRef

Inherited from Any

Ungrouped