Machine Learning Practice
SVC
NuSVC
LinearSVC
These are similar methods but, accept slightly different sets of parameters.
Implementation is based on libsvm.
Faster implementation of linear SVM classification with only linear kernel.
Implementation is based on liblinear.
shape \(\rightarrow\) (n_samples, n_features)
Array \(X\) : holding the training samples
Array \(y\) : holding the class labels (strings or integers)
shape \(\rightarrow\) (n_samples)
X = [[0, 0], [1, 1]]
y = [0,1]
Step 1: Instantiate a SVC classifier estimator.
from sklearn.svm import SVC
SVC_classifier = SVC()
Step 2: Call fit method on SVC classifier object with training feature matrix and label vector as arguments.
# Model training with feature matrix X_train and
# label vector or matrix y_train
SVC_classifier.fit(X_train, y_train)
C
Regularization parameter
SVC_classifier = SVC(C=1.0)
Default:
float value
Note:
kernel
‘rbf’
SVC_classifier = SVC(kernel = 'rbf')
Default:
‘linear’
kernel = poly
, set degree
(any integer value)kernel = callable
is given it is used to pre-compute the kernel matrix from data matrices‘poly’
‘sigmoid’
‘precomputed’
gamma
‘auto’
float value
value of gamma = \(\frac{1}{\text{number of features} ^*\text{X.Var()}}\)
value of gamma = \(\frac{1}{\text{number of features}}\)
‘scale’
SVC_classifier = SVC(gamma = 'scale')
Default:
kernel = 'poly'
or 'sigmoid'
, set coef0
which is an independent term in kernel function (any integer value)
After the classifier is fit on the training data, there are few attributes which reveal the details of support vectors.
from sklearn.svm import SVC
SVC_classifier = SVC()
clf = SVC_classifier.fit(X_train, y_train)
#to view indices of the support vectors
clf.support_
#to view the support vectors
clf.support_vectors_
#to view the number of support vectors for each class
clf.n_support_
Step 1: Instantiate a NuSVC classifier estimator.
from sklearn.svm import NuSVC
NuSVC_classifier = NuSVC()
Step 2: Call fit method on NuSVC classifier object with training feature matrix and label vector as arguments.
# Model training with feature matrix X_train and
# label vector or matrix y_train
NuSVC_classifier.fit(X_train, y_train)
\(\nu\) is an upper bound on the fraction of margin errors and and a lower bound of the fraction of support vectors.
Value of \(\nu\) should \(\in (0,1]\)
Default:
\(\nu = 0.5\)
Instead of C in SVC, \(\nu\) is introduced in NuSVC to control the number of support vectors and margin errors.
Other parameters for NuSVC are same as that of SVC.
Step 1: Instantiate a LinearSVC classifier estimator.
from sklearn.svm import LinearSVC
LinearSVC_classifier = LinearSVC()
Step 2: Call fit method on SVC classifier object with training feature matrix and label vector as arguments.
# Model training with feature matrix X_train and
# label vector or matrix y_train
LinearSVC_classifier.fit(X_train, y_train)
LinearSVC_classifier = Linear_SVC(penalty = 'l2')
Default:
penalty
coef_
vectors that are sparse.
LinearSVC_classifier = Linear_SVC(loss = 'squared_hinge')
Default:
Combination not supported:
penalty='l1'
and loss='hinge'
loss
parameter
'hinge' - standard SVM loss
'squared_hinge' - square of the hinge loss
C
dual
fit_intercept
Regularization parameter
To calculate the intercept for the model.
multi_class
‘ovr’
‘crammer_singer’
decision_function_shape
‘ovo’
‘ovr’
Some Parameters:
C (float, default=1.0): It is a regularization parameter. The strength of the regularization is inversely proportional to C. It should always be positive.
kernel (‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’, default=’rbf’):Specifies the kernel type to be used in the algorithm.
degree (int, default=3): Degree of the polynomial kernel function (‘poly’). Ignored for all other kernels.
gamma(‘scale’, ‘auto’ or float, default=’scale’): Kernel coefficient for ‘rbf’ (Gaussian), ‘poly’(Polynomial) and ‘sigmoid’.
cache_size(float, default=200): It specifies the size of the kernel cache (in MB).
max_iter(int, default=-1): It represents hard limit on iterations within solver, or -1 for no limit.
random_state(int, default=None): Controls the pseudo random number generation for shuffling the data for probability estimates.
Some Parameters: (continued...)
Class: \(\colorbox{lightgrey}{sklearn.svm.NuSVC}\)
Some parameters:
nu(float, default=0.5): An upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors. It should be in the interval (0, 1].
kernel(‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’, default=’rbf’)
Specifies the kernel type to be used in the algorithm.
degree (int, default=3): Degree of the polynomial kernel function (‘poly’). Ignored for all other kernels.
gamma(‘scale’, ‘auto’ or float, default=’scale’): Kernel coefficient for ‘rbf’, ‘poly’(Polynomial) and ‘sigmoid’.
cache_size(float, default=200): It specifies the size of the kernel cache (in MB).
max_iter(int, default=-1): It represents hard limit on iterations within solver, or -1 for no limit.
random_state(int, default=None): Controls the pseudo random number generation for shuffling the data for probability estimates.
Some Parameters: (continued...)
Class: \(\colorbox{lightgrey}{sklearn.svm.LinearSVC}\)
Some parameters:
penalty(‘l1’, ‘l2’, default=’l2’): It specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
loss(‘hinge’, ‘squared_hinge’, default=’squared_hinge’): It specifies the loss function.
C (float, default=1.0): It is a regularization parameter. The strength of the regularization is inversely proportional to C. It should always be positive.
Some parameters: Continued...
fit_intercept(bool, default=True): Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
dual(bool, default=True): Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when .
max_iter(int, default=1000): It represents the maximum number of iterations to be run.
The below example shows how to plot the decision surface for four SVM classifiers with different kernels.
We are considering only the first two features of the iris dataset.
- an array x of shape (n_samples, n_features) holding the training samples.
- an array y of class labels (strings or integers), of shape (n_samples).
Below is the plot for maximum margin separating hyperplane within a two-class separable dataset using a Support Vector Machine classifier with linear kernel.
Binary classification with RBF kernel using non-linear SVC.
Illustration of decision function learnt by SVC
SVM with univariate feature selection
It shows how to perform univariate feature selection before running a SVC (support vector classifier) to improve the classification scores.
This model achieves the best performance when we select around 10% of features.
1. Multi-class classification
2. Scores and Probabilities
3. Unbalanced problems
Multi-class classification
There are two approaches for multi-class classification:
SVC and NuSVC implement the “one-vs-one” approach for multi-class classification. In total, n_classes * (n_classes - 1) / 2 classifiers are constructed and each one trains data from two classes.
To provide a consistent interface with other classifiers, the decision_function_shape option allows to monotonically transform the results of the “one-vs-one” classifiers to a “one-vs-rest” decision function of shape (n_samples, n_classes).
LinearSVC implements “one-vs-rest” multi-class strategy, thus training n_classes models.
Scores and Probabilities
Unbalanced problems
In problems, where it is desired to give more importance to certain classes or certain individual samples, these parameters can be used:
The example illustrates the decision boundary of an unbalanced problem, with and without weight correction.
SVM: Separating hyperplane for unbalanced classes
The figure below illustrates the effect of sample weighting on the decision boundary. The size of the circles is proportional to the sample weights:
SVM: Weighted samples
Some Parameters:
C (float, default=1.0): It is a regularization parameter. The strength of the regularization is inversely proportional to C. It should always be positive.
kernel (‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’, default=’rbf’):
Specifies the kernel type to be used in the algorithm.
degree (int, default=3): Degree of the polynomial kernel function (‘poly’). Ignored for all other kernels.
gamma(‘scale’, ‘auto’ or float, default=’scale’): Kernel coefficient for ‘rbf’ (Gaussian), ‘poly’(Polynomial) and ‘sigmoid’.
epsilon(float, default=0.1): Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.
cache_size(float, default=200): It specifies the size of the kernel cache (in MB).
max_iter(int, default=-1): It represents hard limit on iterations within solver, or -1 for no limit.
random_state(int, default=None): Controls the pseudo random number generation for shuffling the data for probability estimates.
Some Parameters: (continued...)
Class: \(\colorbox{lightgrey}{sklearn.svm.NuSVR}\)
Some parameters:
nu(float, default=0.5): It represents an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. It should be in the interval (0, 1]. By default, 0.5 will be taken.
C(float, default=1.0): A penalty parameter of the error term.
kernel(‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’, default=’rbf’):
Specifies the kernel type to be used in the algorithm.
degree (int, default=3): Degree of the polynomial kernel function (‘poly’). Ignored for all other kernels.
gamma(‘scale’, ‘auto’ or float, default=’scale’): Kernel coefficient for ‘rbf’ (Gaussian), ‘poly’(Polynomial) and ‘sigmoid’.
cache_size(float, default=200): It specifies the size of the kernel cache (in MB).
max_iter(int, default=-1): It represents hard limit on iterations within solver, or -1 for no limit.
Some Parameters: (continued...)
Class: \(\colorbox{lightgrey}{sklearn.svm.LinearSVR}\)
Some parameters:
epsilon(float, default=0.0): Epsilon parameter in the epsilon-insensitive loss function. Note that the value of this parameter depends on the scale of the target variable y.
loss(epsilon_insensitive’, ‘squared_epsilon_insensitive’, default=’epsilon_insensitive’) : It specifies the loss function. The epsilon-insensitive loss (standard SVR) is the L1 loss, while the squared epsilon-insensitive loss is the L2 loss.
C (float, default=1.0): It is a regularization parameter. The strength of the regularization is inversely proportional to C. It should always be positive.
Some parameters: Continued...
fit_intercept(bool, default=True): Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
dual(bool, default=True): Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
max_iter(int, default=1000): It represents the maximum number of iterations to be run.
The class OneClassSVM implements a One-Class SVM which we use in outlier detection. Outlier detection and novelty detection are both used for anomaly detection
Novelty detection:
The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.
Outlier detection:
The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.
An overview of outlier detection method
An example using a one-class SVM for novelty detection.
One-class SVM is an unsupervised algorithm that learns a decision function for novelty detection: classifying new data as similar or different to the training set
Note: If the data is very sparse \(n_{features}\) should be replaced by the average number of non-zero features in a sample vector.
The kernel function can be any of the following:
- \(d\) is specified by parameter
degree
- \(r\) is specified by parameter
coef\(\theta\)
- \(\gamma\) is specified by parameter
gamma
- \(\gamma>0\)
- \(r\) is specified by parameter
coef\(\theta\)
The parameters that must be considered while training an SVM with the Radial Basis Function (RBF) kernel are: and .
\(c\)
gamma
One can define their own kernels by either giving the kernel as a python function or by precomputing the Gram matrix.
Using Python function as Kernels
Using the Gram matrix
kernel
- (n_samples_1, n_features)
- (n_samples_2, n_features)
- (n_samples_1, n_samples_2)
kernel='precomputed'
Classifiers with custom kernels behave the same way as any other classifiers with few exceptions:
- Field is now empty.
support_vectors_
It will plot the decision surface and the support vectors.
Example: SVM with custom kernel
The figure below shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called " support vectors " :
Primal Problem:
\(\min \dfrac{1}{2}w^Tw+c\sum\limits_{i=1}^{n}\zeta_i\)
subject to \(~~~~~~~~~~~~y_i(w^T\phi(x_i)+b)\geq 1-\zeta_i, ~~~~~~~~\zeta_i\geq 0, i: 1,\cdots, n.\)
Given training vectors \(x_i\in \mathbb{R}^p\), \(i=1, \cdots, n\), in two classes, and a vector \(y \in \{-1, 1\}^n\), our goal is to find \(w \in \mathbb{R}^p\) and \(b \in \mathbb{R}\) such that the prediction given by \(sign(w^T\phi(x)+b)\) is correct for most samples.
Dual Problem:
\(\min\limits_{\alpha}\dfrac{1}{2}\alpha^TQ\alpha - e^T\alpha\)
subject to \(~~~~~~~~~~~~~~~y^T\alpha = 0, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\)
\(0\leq \alpha_i\leq c, i = 1, \cdots, n.\)
Once the optimization problem is solved, the output of decision_function for a given sample \(x\) becomes:
\(\sum\limits_{i\in SV}y_i\alpha_iK(x_i, x)+b\)
These parameters can be accessed through the attributes:
\(\nearrow\)
\(\longrightarrow\)
\(\searrow\)
- dual_coef_
- support_vectors_
- intercept_
: holds the product \((y_i \alpha_i^*)\)
: holds the support vectors
: holds the independent term \(b\) .
The primal problem can be formulated as
\(\min\limits_{w, b}\dfrac{1}{2}w^Tw+c\sum\limits_{i=1}\max(0, 1-y_i(w^T\phi(x_i)+b))\)
\(\color{blue}{\Huge \curvearrowright}\)
A margin error corresponds to a sample that lies on the wrong side of its margin boundary is either misclassified, or is correctly classified but does not lie beyond the margin.
Given training vectors \(x_i\in \mathbb{R}^p, i: 1,\cdots, n\), and a vector \(y\in \mathbb{R}^n\), \(\epsilon\)-SVR solves the following primal problem:
\(\min\limits_{w, b, \zeta, \zeta*}\dfrac{1}{2}w^Tw+C\sum\limits_{i=1}^{n}(\zeta_i+\zeta_i^*)\)
subject to \(~~~~~~~~~~~~y_i-w^T\phi(x_i)-b\leq \epsilon +\zeta_i \)
\(w^T\phi(x_i)+b-y_i\leq \epsilon +\zeta_i^*\)
\(\zeta_i, \zeta_i^*\geq 0, i: 1, \cdots, n\)
Primal problem
Dual Problem
\(\min\limits_{\alpha, \alpha^*}\dfrac{1}{2}(\alpha-\alpha^*)^TQ(\alpha-\alpha^*)+\epsilon e^T(\alpha+\alpha^*)-y^T(\alpha-\alpha^*)\)
subject to \(~~~~~~~~~~~~~~~~~~~e^T(\alpha-\alpha^*) = 0\)
\(0\leq \alpha_i, \alpha_i^*\leq C, i=1, \cdots, n\)
Here, we are penalizing samples whose prediction is at least \(\epsilon\) away from their true target.
Dual Problem
\(\min\limits_{\alpha, \alpha^*}\dfrac{1}{2}(\alpha-\alpha^*)^TQ(\alpha-\alpha^*)+\epsilon e^T(\alpha+\alpha^*)-y^T(\alpha-\alpha^*)\)
subject to \(~~~~~~~~~~~~~~~~~~~e^T(\alpha-\alpha^*) = 0\)
\(0\leq \alpha_i, \alpha_i^*\leq C, i=1, \cdots, n\)
Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function \(\phi\).
The prediction is:
\(\sum\limits_{i \in SV}(\alpha_i - \alpha_i^*)K(x_i, x)+b\)
These parameters can be accessed through the attributes:
\(\nearrow\)
\(\longrightarrow\)
\(\searrow\)
- dual_coef_
: holds the difference \((\alpha_i - \alpha_i^*)\)
- support_vectors_
: holds the support vectors
- intercept_
: holds the independent term \(b\) .
Primal problem:
\(\min\limits_{w, b}\dfrac{1}{2}w^Tw+C\sum\limits_{i=1}\max(0, \mid y_i-(w^T\phi(x_i)+b))\mid-\epsilon)\)
- Different SVM formulations
- Efficient multi-class classification
- Cross validation for model selection
- Probability estimates
- Various kernels (including precomputed kernel matrix)
-GUI demonstrating SVM classification and regression
- Automatic model selection which can generate contour cross validation accuracy.
- Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
- Cross validation for model evaluation.
- Automatic parameter selection
- Probability estimates (logistic regression only)