smile.classification (Smile Core 1.0.3 API)

Interface Summary
Interface	Description
Classifier<T>	A classifier assigns an input object into one of a given number of categories.
OnlineClassifier<T>	Classifier with online learning capability.

Class Summary
Class	Description
AdaBoost	AdaBoost (Adaptive Boosting) classifier with decision trees.
AdaBoost.Trainer	Trainer for AdaBoost classifiers.
ClassifierTrainer<T>	Abstract classifier trainer.
DecisionTree	Decision tree for classification.
DecisionTree.Trainer	Trainer for decision tree classifiers.
FLD	Fisher's linear discriminant.
FLD.Trainer	Trainer for Fisher's linear discriminant.
GradientTreeBoost	Gradient boosting for classification.
GradientTreeBoost.Trainer	Trainer for GradientTreeBoost classifiers.
KNN<T>	K-nearest neighbor classifier.
KNN.Trainer<T>	Trainer for KNN classifier.
LDA	Linear discriminant analysis.
LDA.Trainer	Trainer for linear discriminant analysis.
LogisticRegression	Logistic regression.
LogisticRegression.Trainer	Trainer for logistic regression.
Maxent	Maximum Entropy Classifier.
Maxent.Trainer	Trainer for maximum entropy classifier.
NaiveBayes	Naive Bayes classifier.
NaiveBayes.Trainer	Trainer for naive Bayes classifier for document classification.
NeuralNetwork	Multilayer perceptron neural network.
NeuralNetwork.Trainer	Trainer for neural networks.
QDA	Quadratic discriminant analysis.
QDA.Trainer	Trainer for quadratic discriminant analysis.
RandomForest	Random forest for classification.
RandomForest.Trainer	Trainer for random forest classifiers.
RBFNetwork<T>	Radial basis function networks.
RBFNetwork.Trainer<T>	Trainer for RBF networks.
RDA	Regularized discriminant analysis.
RDA.Trainer	Trainer for regularized discriminant analysis.
SVM<T>	Support vector machines for classification.
SVM.Trainer<T>	Trainer for support vector machines.

Enum Summary
Enum	Description
DecisionTree.SplitRule	The criterion to choose variable to split instances.
NaiveBayes.Model	The generation models of naive Bayes classifier.
NeuralNetwork.ActivationFunction	The types of activation functions in output layer.
NeuralNetwork.ErrorFunction	The types of error functions.
SVM.Multiclass	The type of multi-class SVMs.

Package smile.classification Description

Classification algorithms. In machine learning and pattern recognition, classification refers to an algorithmic procedure for assigning a given input object into one of a given number of categories. The input object is formally termed an instance, and the categories are termed classes.

The instance is usually described by a vector of features, which together constitute a description of all known characteristics of the instance. Typically, features are either categorical (also known as nominal, i.e. consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O"), ordinal (consisting of one of a set of ordered items, e.g. "large", "medium" or "small"), integer-valued (e.g. a count of the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure).

Classification normally refers to a supervised procedure, i.e. a procedure that produces an inferred function to predict the output value of new instances based on a training set of pairs consisting of an input object and a desired output value. The inferred function is called a classifier if the output is discrete or a regression function if the output is continuous.

The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.

A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems. The most widely used learning algorithms are AdaBoost and gradient boosting, support vector machines, linear regression, linear discriminant analysis, logistic regression, naive Bayes, decision trees, k-nearest neighbor algorithm, and neural networks (multilayer perceptron).

If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms cannot be easily applied. Many algorithms, including linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees (and boosting algorithms based on decision trees) is that they easily handle heterogeneous data.

If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of regularization.

If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, linear support vector machines, naive Bayes) generally perform well. However, if there are complex interactions among features, then algorithms such as nonlinear support vector machines, decision trees and neural networks work better. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.

There are several major issues to consider in supervised learning:

Features

The accuracy of the inferred function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.

There are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones. More generally, dimensionality reduction may seek to map the input data into a lower dimensional space prior to running the supervised learning algorithm.

Over-fitting

Over-fitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Over-fitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been over-fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

The potential for over-fitting depends not only on the number of parameters and data but also the conformability of the model structure with the data shape, and the magnitude of model error compared to the expected level of noise or error in the data.

In order to avoid over-fitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, Bayesian priors on parameters or model comparison), that can indicate when further training is not resulting in better generalization. The basis of some techniques is either (1) to explicitly penalize overly complex models, or (2) to test the model's ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.

Regularization

Regularization involves introducing additional information in order to solve an ill-posed problem or to prevent over-fitting. This information is usually of the form of a penalty for complexity, such as restrictions for smoothness or bounds on the vector space norm.

A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.

Bias-variance tradeoff

Mean squared error (MSE) can be broken down into two components: variance and squared bias, known as the bias-variance decomposition. Thus in order to minimize the MSE, we need to minimize both the bias and the variance. However, this is not trivial. Therefore, there is a tradeoff between bias and variance.

Author:: Haifeng Li