Machine Learning Practice
There are broadly two types of APIs based on their functionality:
Specific
Generic
Uses gradient descent for opt
Specialized solvers for opt
Need to specify loss function
All sklearn estimators for classification implement a few common methods for model training, prediction and evaluation.
Model training
fit(X, y[, coef_init, intercept_init, …])
Prediction
predict(X)
decision_function(X)
predicts class label for samples
predicts confidence score for samples.
Evaluation
score(X, y[, sample_weight])
Return the mean accuracy on the given test data and labels.
There a few common miscellaneous methods as follows:
get_params([deep])
gets parameter for this estimator.
converts coefficient matrix to dense array format.
set_params(**params)
densify()
sparsify()
sets the parameters of this estimator.
converts coefficient matrix to sparse format.
Now let's study how to implement different classifiers with sklearn APIs.
Let's start with implementation of least square classification (LSC) with RidgeClassifier API.
Binary classification:
Multiclass classification:
Step 1: Instantiate a classification estimator without passing any arguments to it. This creates a ridge classifier object.
from sklearn.linear_model import RidgeClassifier
ridge_classifier = RidgeClassifier()
Step 2: Call fit method on ridge classifier object with training feature matrix and label vector as arguments.
Note: The model is fitted using X_train and y_train.
# Model training with feature matrix X_train and
# label vector or matrix y_train
ridge_classifier.fit(X_train, y_train)
Set alpha
to float value. The default value is 0.1.
from sklearn.linear_model import RidgeClassifier
ridge_classifier = RidgeClassifier(alpha=0.001)
Using one of the following solvers
svd
cholesky
sparse_cg
lsqr
sag
, saga
uses a Singular Value Decomposition of the feature matrix to compute the Ridge coefficients.
lbfgs
uses scipy.linalg.solve
function to obtain the closed-form solution
uses the conjugate gradient solver of scipy.sparse.linalg.cg
.
uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr
and it is fastest.
uses a Stochastic Average Gradient descent iterative procedure
'saga' is unbiased and more flexible version of 'sag'
uses L-BFGS-B algorithm implemented in scipy.optimize.minimize
.
can be used only when coefficients are forced to be positive.
sparse_cg
' solver.When both n_samples
and n_features
are large, use ‘sag
’ or ‘saga
’ solvers.
Note that fast convergence is only guaranteed on features with approximately the same scale.
auto
chooses the solver automatically based on the type of data
ridge_classifier = RidgeClassifier(solver=auto)
if solver == 'auto':
if return_intercept:
# only sag supports fitting intercept directly
solver = "sag"
elif not sparse.issparse(X):
solver = "cholesky"
else:
solver = "sparse_cg"
Default choice for solver is auto
.
If data is already centered, set fit_intercept as false, so that no intercept will be used in calculations.
ridge_classifier = RidgeClassifier(fit_intercept=True)
Default:
Use predict
method to predict class labels for samples
# Predict labels for feature matrix X_test
y_pred = ridge_classifier.predict(X_test)
Other classifiers also use the same predict method.
Step 2: Call predict method on classifier object with feature matrix as an argument.
Step 1: Arrange data for prediction in a feature matrix of shape (#samples, #features) or in sparse matrix format.
RidgeClassifierCV implements RidgeClassifier with built-in cross validation.
Let's implement perceptron classifier with Perceptron API.
It is a simple classification algorithm suitable for large-scale learning.
SGDClassifier
Perceptron uses SGD for training.
Perceptron()
SGDClassifier(loss="perceptron", eta0=1, learning_rate="constant", penalty=None)
Step 1: Instantiate a Perceptron estimator without passing any arguments to it to create a classifier object.
from sklearn.linear_model import Perceptron
perceptron_classifier = Perceptron()
Step 2: Call fit method on perceptron estimator object with training feature matrix and label vector as arguments.
# Model training with feature matrix X_train and
# label vector or matrix y_train
perceptron_classifier.fit(X_train, y_train)
Perceptron can be further customized with the following parameters:
penalty
(default = 'l2')
alpha
(default = 0.0001)
l1_ratio
(default = 0.15)
fit_intercept
(default = True)
max_iter
(default = 1000)
tol
(default = 1e-3)
eta0
(default = 1)
early_stopping
(default = False)
validation_fraction
(default = 0.1)
n_iter_no_change
(default = 5)
Let's implement logistic regression classifier with LogisticRegression API.
Step 1: Instantiate a classifier estimator without passing any arguments to it. This creates a logistic regression object.
from sklearn.linear_model import LogisticRegression
logit_classifier = LogisticRegression()
Step 2: Call fit method on logistic regression classifier object with training feature matrix and label vector as arguments
# Model training with feature matrix X_train and
# label vector or matrix y_train
logit_classifier.fit(X_train, y_train)
Logistic regression uses specific algorithms for solving the optimization problem in training. These algorithms are known as solvers.
The choice of the solver depends on the classification problem set up such as size of the dataset, number of features and labels.
solver
newton-cg
’, ‘sag
’, ‘saga
’ and ‘lbfgs
’ handle multinomial loss.liblinear
’ is limited to one-versus-rest schemes‘newton-cg
’
‘lbfgs
’
‘liblinear
’
‘sag
‘saga
’
logit_classifier = LogisticRegression(solver='lbfgs')
liblinear
', 'lbfgs
' and 'newton-cg
' are robust.By default, logistic regression uses lbfgs solver.
Regularization is applied by default because it improves numerical stability.
penalty
logit_classifier = LogisticRegression(penalty='l2')
By default, it uses L2 penalty.
L2 penalty is supported by all solvers
L1 penalty is supported only by a few solvers.
Solver | Penalty |
---|---|
‘newton-cg ’ |
[‘l2’, ‘none’] |
‘lbfgs ’ |
[‘l2’, ‘none’] |
‘liblinear ’ |
[‘l1’, ‘l2’] |
‘sag ’ |
[‘l2’, ‘none’] |
‘saga ’ |
[‘elasticnet’, ‘l1’, ‘l2’, ‘none’] |
LogisticRegression classifier has a class_weight parameter in its constructor.
What purpose does it serve?
Exercise: Read stack overflow discussion on this parameter.
This parameter is available in classifier estimators in sklearn.
LogisticRegressionCV implements logistic regression with in built cross validation support to find the best values of C and l1_ratio parameters according to the specified scoring attribute.
These classifiers can also be implemented with a generic SGDClassifier API by setting the loss parameter appropriately.
Let's study SGDClassifier API.
We need to set loss parameter appropriately to build train classifier of our interest with SGDClassifier
loss
parameter
'hinge' - (soft-margin) linear Support Vector Machine
'modified_huber' - smoothed hinge loss brings tolerance to outliers as well as probability estimates
'log' - logistic regression
'squared_hinge' - like hinge but is quadratically penalized
'perceptron' - linear loss used by the perceptron algorithm
‘squared_error’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’ - regression losses
By default SGDClassifier uses hinge loss and hence trains linear support vector machine classifier.
SGDClassifier(loss='log')
LogisticRegression(solver='sgd')
SGDClassifier(loss='hinge')
Linear Support vector machine
Advantages:
Disadvantages:
Step 1: Instantiate a SGDClassifer estimator by setting appropriate loss parameter to define classifier of interest. By default it uses hinge loss, which is used for training linear support vector machine.
from sklearn.linear_model import SGDClassifier
SGD_classifier = SGDClassifier(loss='log')
Step 2: Call fit method on SGD classifier object with training feature matrix and label vector as arguments.
# Model training with feature matrix X_train and
# label vector or matrix y_train
SGD_classifier.fit(X_train, y_train)
Here we have used `log` loss that defines a logistic regression classifier.
penalty
(1 - l1_ratio) * L2 + l1_ratio * L1
SGD_classifier = SGDClassifier(penalty='l2')
(l1_ratio
controls the convex combination of L1 and L2 penalty. default=0.15)
Default:
alpha
SGD_classifier = SGDClassifier(max_iter=100)
Default:
max_iter = 1000
The maximum number of passes over the training data (aka epochs) is an integer that can be set by the max_iter
parameter.
learning_rate
Stopping criteria
warm_start
‘constant’
‘optimal’
‘invscaling’
‘adaptive’
average
‘True’
‘False’
tol
n_iter_no_change
max_iter
early_stopping
validation_fraction
We learnt how to implement the following classifiers with sklearn APIs:
Alternatively we can use SGDClassifier with appropriate loss setting for implementing these classifiers:
Classification estimators implements a few common methods like fit, score, decision_function, and predict.
Let's extend these classifiers to multi-learning (multi-class, multi-label & multi-output) settings.
Multilabel
total #labels = 2
Multiclass multioutput
total #labels > 2
We will refer both these models as multi-label classification models, where # of output labels > 1.
Multiclass, multilabel, multioutput problems are referred to as multi-learning problems.
Multiclass classification
(sklearn.multiclass)
Multilabel classification
(sklearn.multioutput)
problem types
meta-estimators
OneVsOneClassifier
OneVsRestClassifier
OutputCodeClassifier
MultiOutputClassifier
ClassifierChain
Inherently multiclass
Multiclass as OVO
Multiclass as OVR
Multilabel
Inherently multiclass
Multilabel
LogisticRegression (multi_class = 'multinomial')
RidgeClassifier
LogisticRegressionCV (multi_class = 'multinomial')
RidgeClassifierCV
Multiclass as OVR
Perceptron
LogisticRegression (multi_class = 'ovr')
SGDClassifier
LogisticRegressionCV (multi_class = 'ovr')
RidgeClassifier
RidgeClassifierCV
First we will study multiclass APIs in sklearn.
In Iris dataset,
In MNIST digit recognition dataset,
from sklearn.preprocessing import LabelBinarizer
y = np.array(['apple', 'pear', 'apple', 'orange'])
y_dense = LabelBinarizer().fit_transform(y)
[[1 0 0] [0 0 1] [1 0 0] [0 1 0]]
Let's say, you are given labels as part of the training set, how do we check if they are is suitable for multi-class classification?
from sklearn.utils.multiclass import type_of_target
type_of_target(y)
type_of_target can determine different types of multi-learning targets.
target_type
‘multiclass’
‘multiclass-multioutput’
‘unknown’
y
‘multilabel-indicator’
>>> type_of_target([1, 0, 2])
'multiclass'
>>> type_of_target([1.0, 0.0, 3.0])
'multiclass'
>>> type_of_target(['a', 'b', 'c'])
'multiclass'
>>> type_of_target(np.array([[1, 2], [3, 1]]))
'multiclass-multioutput'
multiclass
multiclass-multioutput
multilabel-indicator
type_of_target(np.array([[0, 1], [1, 1]]))
'multilabel-indicator'
>>> type_of_target([[1, 2]])
'multilabel-indicator'
Apart from these, there are three more types, type_of_target can determine targets corresponding to regression and binary classification.
All classifiers in scikit-learn perform multiclass classification out-of-the-box.
OneVsRest classifier also supports multilabel classification. We need to supply labels as indicator matrix of shape \((n, k)\).
from sklearn.multiclass import OneVsRestClassifier
OneVsRestClassifier(LinearSVC(random_state=0)).fit(X, y)
OneVsOne classifier processes subset of data at a time and is useful in cases where the classifier does not scale with the data.
from sklearn.multiclass import OneVsOneClassifier
OneVsOneClassifier(LinearSVC(random_state=0)).fit(X, y)
OneVsRestClassifier
OneVsOneClassifier
Fits one classifier per pair of classes.
At prediction time, the class which received the most votes is selected.
Input Feature Matrix (X)
Classifier #1
Classifier #2
Classifier #k
Class #1
Class #2
Class #k
MultiOutputClassifier
ClassifierChain
So far we learnt how to train classifiers for binary, multi-class and multi-label/output cases.
We will learn how to evaluate these classifiers with different scoring functions and with cross-validation.
We will also study how to set hyper-parameters for classifiers.
Many cross-validation and HPT methods discussed in the regression context are also applicable in classifiers.
There may be issues like class imbalance in classification, which tend to impact the cross validation folds.
The overall class distribution and the ones in folds may be different and this has implications in effective model training.
sklearn.model_selection module provides three stratified APIs to create folds such that the overall class distribution is replicated in individual folds.
sklearn.model_selection module provides the following three stratified APIs to create folds such that the overall class distribution is replicated in individual folds.
Note: Folds obtained via StratifiedShuffleSplit may not be completely different.
cv specifies cross validation iterator
scoring specifies scoring function to use for HPT
cs specifies regularization strengths to experiment with.
refit = True
refit = False
Scores averaged across folds, values corresponding to the best score are selected and final refit with these parameters
the coefs, intercepts and C that correspond to the best scores across folds are averaged.
Now let's look at classification metrics implemented in sklearn.
sklearn.metrics implements a bunch of classification scoring metrics based on true labels and predicted labels as inputs.
accuracy_score
balanced_accuracy_score
top_k_accuracy_score
roc_auc_score
precision_score
recall_score
f1_score
score(actual_labels, predicted_labels)
confusion_matrix
evaluates classification accuracy by computing the confusion matrix with each row corresponding to the true class.from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_predicted)
Entry \(i,j\) in a confusion matrix
Example:
number of observations actually in group \(i\), but predicted to be in group \(j\).
Confusion matrix can be displayed with ConfusionMatrixDisplay API in sklearn.metrics.
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
The classification_report
function builds a text report showing the main classification metrics.
from sklearn.metrics import classification_report
print(classification_report(y_true, y_predicted))
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_true, y_predicted)
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_scores, pos_label=2)
average
parameter.
calculates the mean of the binary metrics
computes the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.
gives each sample-class pair an equal contribution to the overall metric
calculates the metric over the true and predicted classes for each sample in the evaluation data, and returns their average
returns an array with the score for each class
macro
weighted
micro
samples
None
For a given class variable \(y\) and dependent feature vector \(x_1\) through \(x_m\),
the naive conditional independence assumption is given by:
Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods.
ComplementNB
GaussianNB
BernoulliNB
CategoricalNB
MultinomialNB
GaussianNB
implements the Gaussian Naive Bayes algorithm for classification
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
Instantiate a GaussianNBClassifer estimator and then call fit method using X_train and y_train.
MultinomialNB
implements the naive Bayes algorithm for multinomially distributed data
(text classification)
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
Instantiate a MultinomialNBClassifer estimator and then call fit method using X_train and y_train.
ComplementNB
implements the complement naive Bayes (CNB) algorithm.
from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()
cnb.fit(X_train, y_train)
Instantiate a ComplementNBClassifer estimator and then call fit method using X_train and y_train.
CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.
BernoulliNB
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
Instantiate a BernoulliNBClassifer estimator and then call fit method using X_train and y_train.
CategoricalNB
implements the categorical naive Bayes algorithm suitable for classification with discrete features that are categorically distributed
from sklearn.naive_bayes import CategoricalNB
canb = CategoricalNB()
canb.fit(X_train, y_train)
Instantiate a CategoricalNBClassifer estimator and then call fit method using X_train and y_train.
assumes that each feature, which is described by the index \(i\), has its own categorical distribution.
KNeighborsClassifier
RadiusNeighborsClassifier
Step 1: Instantiate a KNeighborsClassifer estimator without passing any arguments to it to create a classifer object.
from sklearn.neighbors import KNeighborsClassifier
kneighbor_classifier = KNeighborsClassifier()
Step 2: Call fit method on KNeighbors classifier object with training feature matrix and label vector as arguments.
# Model training with feature matrix X_train and
# label vector or matrix y_train
kneighbor_classifier.fit(X_train, y_train)
n_neighbors
parameter.
kneighbor_classifier = KNeighborsClassifier(n_neighbors = 3)
n_neighbors = 5
weights
kneighbor_classifier = KNeighborsClassifier(weights= 'uniform')
Default:
weights
parameter also accepts a user-defined function which takes an array of distances as input, and returns an array of the same shape containing the weights.def user_weights(weights_array):
return weights_array
kneighbor_classifier = KNeighborsClassifier(weights=user_weights)
Example:
algorithm
‘ball_tree’ will use BallTree
‘kd_tree’ will use KDTree
‘brute’ will use a brute-force search
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to the fit method.
kneighbor_classifier = KNeighborsClassifier(algorithm='auto')
Default:
leaf_size
For 'ball_tree' and 'kd_tree' algorithms, there are some other parameters to be set.
metric
p
Step 1: Instantiate a RadiusNeighborsClassifer estimator without passing any arguments to it to create a classifer object.
from sklearn.neighbors import RadiusNeighborsClassifier
radius_classifier = RadiusNeighborsClassifier()
Step 2: Call fit method on RadiusNeighbors classifier object with training feature matrix and label vector as arguments.
# Model training with feature matrix X_train and
# label vector or matrix y_train
radius_classifier.fit(X_train, y_train)
radius
parameter.
radius_classifier = RadiusNeighborsClassifier(radius=1.0)
r = 1.0
weights
algorithm
‘uniform’
‘distance’
[callable] function
default = 'uniform'
‘ball_tree’
‘kd_tree’
‘brute’
default = ‘auto’
‘auto’
leaf_size
metric
p
default = 30
default = 'minkowski'
default = 2
We shall discuss two modules:
Multiclass classification
(sklearn.multiclass)
Multilabel classification
(sklearn.multioutput)
problem types
meta-estimators
OneVsOneClassifier
OneVsRestClassifier
OutputCodeClassifier
MultiOutputClassifier
ClassifierChain
n_classes
possible classes.
from sklearn.utils.multiclass import type_of_target
type_of_target(y)
Input paramter : y (array-like)
Output : target_type (string)
type_of_target(y)
OneVsRestClassifier
OneVsOneClassifier
OutputCodeClassifier
Constructs one classifier per pair of classes.
At prediction time, the class which received the most votes is selected.
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
clf = OneVsRestClassifier(SGDClassifier(loss = 'hinge'))
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('y_test',y_test)
print('y_pred',y_pred)
We shall use iris dataset which contains 150 datapoints each with 4 features and 3 labels.
from sklearn import datasets
from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
clf = OneVsOneClassifier(SGDClassifier(loss = 'hinge'))
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('y_test',y_test)
print('y_pred',y_pred)
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.utils.multiclass import type_of_target
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
X, y = make_multilabel_classification(n_classes=3, random_state=0)
print('target is a', type_of_target(y))
clf = MultiOutputClassifier(KNeighborsClassifier(n_neighbors=2))
clf = clf.fit(X, y)
y_test = y[-2:]
y_pred = clf.predict(X[-2:])
print('y_test \n',y_test)
print('y_pred \n',y_pred)
We shall create a synthetic multilabel dataset for this example.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.multioutput import ClassifierChain
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import jaccard_score
from sklearn.linear_model import LogisticRegression
# Load a multi-label dataset
X, Y = fetch_openml("yeast", version=4, return_X_y=True)
Y = Y == "TRUE"
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
base_lr = LogisticRegression()
ovr = OneVsRestClassifier(base_lr)
ovr.fit(X_train, Y_train)
Y_pred_ovr = ovr.predict(X_test)
ovr_jaccard_score = jaccard_score(Y_test, Y_pred_ovr, average="samples")
chains = [ClassifierChain(base_lr, order="random", random_state=i) for i in range(4)]
for chain in chains:
chain.fit(X_train, Y_train)
Y_pred_chains = np.array([chain.predict(X_test) for chain in chains])
Y_pred_chains
chain_jaccard_scores = [
jaccard_score(Y_test, Y_pred_chain >= 0.5, average="samples")
for Y_pred_chain in Y_pred_chains]
Y_pred_ensemble = Y_pred_chains.mean(axis=0)
ensemble_jaccard_score = jaccard_score(Y_test, Y_pred_ensemble >= 0.5, average="samples")
model_scores = [ovr_jaccard_score] + chain_jaccard_scores
model_scores.append(ensemble_jaccard_score)
model_names = ("Independent","Chain 1","Chain 2","Chain 3","Chain 4","Ensemble")
# Let us plot all the scores
x_pos = np.arange(len(model_names))
fig, ax = plt.subplots(figsize=(7, 4))
ax.grid(True)
ax.set_title("Classifier Chain Ensemble Performance Comparison")
ax.set_xticks(x_pos)
ax.set_xticklabels(model_names, rotation="vertical")
ax.set_ylabel("Jaccard Similarity Score")
ax.set_ylim([min(model_scores) * 0.9, max(model_scores) * 1.1])
colors = ["r"] + ["b"] * len(chain_jaccard_scores) + ["g"]
ax.bar(x_pos, model_scores, alpha=0.5, color=colors)
plt.tight_layout()
plt.show()
Output of ClassifierChain
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
import numpy as np
X, y = np.random.randint(1,50,50), np.hstack(([0] * 45, [1] * 5))
print('X',X)
print('y',y)
skf = StratifiedKFold(n_splits=3)
print('StratifiedKFold')
count = 1
for train, test in skf.split(X, y):
print('Split', count)
print('train - {} | test - {}'.format(np.bincount(y[train]), np.bincount(y[test])))
print('train',X[train])
print('test',X[test])
count+=1
print('StratifiedShuffleSplit')
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=0)
count = 1
for train_index, test_index in sss.split(X, y):
print('Split', count)
print('train - {} | test - {}'.format(np.bincount(y[train_index]), np.bincount(y[test_index])))
print('train',X[train_index])
print('test',X[test_index])
count+=1
Example to compare StratifiedKFold and StratifiedShuffleSplit
Output:
Calibration curves
compare how well the probabilistic predictions of a binary classifier are calibrated.
plots the true frequency of the positive label against its predicted probability, for binned predictions.
x axis : average predicted probability in each bin
y axis : fraction of positives, i.e. the proportion of samples whose class is the positive class (in each bin).
Image Source: https://scikit-learn.org/stable/modules/calibration.html
LogisticRegression returns well calibrated predictions by default as it directly optimizes Log loss.
GaussianNB tends to push probabilities to 0 or 1.
RandomForestClassifier peaks at approximately 0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare.
LinearSVC focus on difficult to classify samples that are close to the decision boundary (the support vectors).
from sklearn.metrics from sklearn.calibration import CalibratedClassifierCV
calibrated_clf = CalibratedClassifierCV()
estimates the parameters of a classifier and subsequently calibrates a classifier
CalibratedClassifierCV
base_estimator
method
ensemble
cv
is not 'prefit'
.
cv='prefit'
.
cv
Class: sklearn.linear_model.RidgeClassifier
Some Parameters:
Class: sklearn.linear_model.LogisticRegression
Some Parameters:
'none' - no penalty is added
'l2' - add a L2 penalty term and it is the default choice
'l1' - add a L1 penalty term
'elasticnet' - both L1 and L2 penalty terms are added
solver (default = 'lbfgs')
'liblinear' - uses a coordinate descent (CD) algorithm
'lbfgs' - an optimizer in the family of quasi-Newton methods.
'newton-cg', 'sag', 'saga'
SGDClassifier(loss='log')
LogisticRegression(solver='sgd')
SGDClassifier(loss='hinge')
Linear Support vector machine
Class: sklearn.linear_model.SGDClassifier
This estimator implements regularized linear models with SGD.
The gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing learning rate.
Some parameters
penalty - 'l2’, ‘l1’, ‘elasticnet’ (default = 'l2')
loss (default = 'hinge')
'hinge' - (soft-margin) linear Support Vector Machine,
'modified_huber' - smoothed hinge loss brings tolerance to outliers as well as probability estimates
'log' - logistic regression
'squared_hinge' - like hinge but is quadratically penalized
'perceptron' - linear loss used by the perceptron algorithm
regression losses - ‘squared_error’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
alpha (default = 0.0001)
constant that multiplies the regularization term.
fit_intercept (default = True)
If False, the data is assumed to be already centered.
max_iter (default = 1000)
maximum number of passes over the training data (aka epochs).
learning_rate (default = ’optimal’)
‘constant’: eta = eta0
(default eta0=0.0, initial learning rate)
‘optimal’: eta = 1.0 / (alpha * (t + t0))
where t0 is chosen by a heuristic proposed by Leon Bottou.
‘invscaling’: eta = eta0 / pow(t, power_t)
‘adaptive’: eta = eta0
, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
tol (default = 1e-3)
stopping criterion.
If it is not None, training will stop when (loss > best_loss - tol) for n_iter_no_change
consecutive epochs.
Convergence is checked against the training loss or the validation loss depending on the early_stopping
parameter.
early_stopping (default = False)
to terminate training when validation score is not improving.
If set to True, it will automatically set aside a stratified fraction of training data as validation and terminate training when validation score returned by the score
method is not improving by at least tol for n_iter_no_change consecutive epochs
validation_fraction (default = 0.1)
proportion of training data to set aside as validation set for early stopping
. Must be between 0 and 1.
Only used if
is True.early_stopping
n_iter_no_change (default = 5)
Number of iterations with no improvement to wait before stopping fitting.
Convergence is checked against the training loss or the validation loss depending on the
parameter.early_stopping
class_weight (default = None) {class_label: weight} or “balanced”,
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
.
Preset for the class_weight fit parameter. Weights associated with classes. If not given, all classes are supposed to have weight one.
It is a simple classification algorithm suitable for large-scale learning.
Class: sklearn.linear_model.Perceptron
Some Parameters:
penalty - 'l2’, ‘l1’, ‘elasticnet’ (default = 'l2')
alpha - (default = 0.0001)
l1_ratio - (default = 0.15)
fit_intercept - (default = True)
max_iter - (default = 1000)
tol - (default = 1e-3)
eta0 - (default = 1)
early_stopping - (default = False)
validation_fraction - (default = 0.1)
n_iter_no_change - (default = 5)
scikit-learn
implements two different nearest neighbors classifiers.KNeighborsClassifier | RadiusNeighborsClassifier |
---|---|
implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. | implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user. |
most commonly used technique choice of the value k is highly data-dependent |
used in cases where the data is not uniformly sampled |
larger k suppresses the effects of noise, but makes the classification boundaries less distinct. | user specifies a fixed radius r, such that points in sparser neighborhoods use fewer nearest neighbors for the classification |
Class: sklearn.neighbors.KNeighborsClassifier
Some Parameters
Class: sklearn.neighbors.RadiusNeighborsClassifier
Some Parameters
Range of parameter space to use by default for radius_neighbors queries
weights (‘uniform’, ‘distance’, [callable], default = ’uniform')
algorithm (‘ball_tree’, ‘kd_tree’, ‘brute’, ‘auto’, default = 'auto'
leaf_size (default = 30)
p (default = 2)
metric (default = ’minkowski’)
Distance metric to use for the tree.
GaussianNB
MultinomialNB
ComplementNB
CategoricalNB
BernoulliNB
implements the Gaussian Naive Bayes algorithm for classification
implements the naive Bayes algorithm for multinomially distributed data (text classification)
implements the complement naive Bayes (CNB) algorithm. (suited for imbalanced data sets)
implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions
implements the categorical naive Bayes algorithm for categorically distributed data.
Multiclass classification
(sklearn.multiclass)
Multilabel classification
(sklearn.multioutput)
problem types
meta-estimators
OneVsOneClassifier
OneVsRestClassifier
OutputCodeClassifier
MultiOutputClassifier
ClassifierChain
sklearn.utils.multiclass.type_of_target
determines the type of data indicated by the target.
Parameters : y (array-like), Returns : target_type (string)
target_type | y |
---|---|
'continuous' | array-like of floats that are not all integers and is 1d or a column vector. |
'continuous-multioutput' | 2d array of floats that are not all integers, and both dimensions are of size > 1. |
‘binary’ | contains <= 2 discrete values and is 1d or a column vector. |
‘multiclass’ | contains more than two discrete values, is not a sequence of sequences, and is 1d or a column vector. |
‘multiclass-multioutput’ | 2d array that contains more than two discrete values, is not a sequence of sequences, and both dimensions are of size > 1. |
‘unknown’ | array-like but none of the above, such as a 3d array, sequence of sequences, or an array of non-sequence objects. |
>>> from sklearn.utils.multiclass import type_of_target
>>> import numpy as np
>>> type_of_target([0.1, 0.6])
'continuous'
>>> type_of_target([1, -1, -1, 1])
'binary'
>>> type_of_target(['a', 'b', 'a'])
'binary'
>>> type_of_target([1.0, 2.0])
'binary'
>>> type_of_target([1, 0, 2])
'multiclass'
>>> type_of_target([1.0, 0.0, 3.0])
'multiclass'
>>> type_of_target(['a', 'b', 'c'])
'multiclass'
>>> type_of_target(np.array([[1, 2], [3, 1]]))
'multiclass-multioutput'
>>> type_of_target(np.array([[1.5, 2.0], [3.0, 1.6]]))
'continuous-multioutput'
continuous-multioutput
continuous
binary
multiclass
multiclass-multioutput
Constructs one classifier per pair of classes.
At prediction time, the class which received the most votes is selected. In the event of a tie, it selects the class with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels computed by the underlying binary classifiers.
classifiers needed = \(\dfrac{n_{classes}\times(n_{classes} - 1)}{2} \)
slower than one-vs-the-rest, due to its \(O(n_{classes}^2)\) complexity.
advantage : for kernel algorithms which don’t scale well
target_type | y |
---|---|
‘multilabel-indicator’ | label indicator matrix, an array of two dimensions with at least two columns, and at most 2 unique values. |
type_of_target(np.array([[0, 1], [1, 1]]))
'multilabel-indicator'
>>> type_of_target([[1, 2]])
'multilabel-indicator'
multilabel-indicator
from sklearn import datasets
from sklearn.multiclass import OutputCodeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
clf = OutputCodeClassifier(SGDClassifier(loss = 'hinge'),code_size=2,)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('y_test',y_test)
print('y_pred',y_pred)
MultiOutputClassifier | ClassifierChain |
---|---|
Strategy consists of fitting one classifier per target. | Way of combining a number of binary classifiers into a single multi-label model that is capable of exploiting correlations among targets. |
Allows multiple target variable classifications. Able to estimate a series of target functions that are trained on a single predictor matrix to predict a series of responses. |
For a multi-label classification problem with N classes, N binary classifiers are assigned an integer between 0 and N-1. These integers define the order of models in the chain. |
Calibration curves
compare how well the probabilistic predictions of a binary classifier are calibrated.
plots the true frequency of the positive label against its predicted probability, for binned predictions.
x axis : average predicted probability in each bin
y axis : fraction of positives, i.e. the proportion of samples whose class is the positive class (in each bin).
Image Source: https://scikit-learn.org/stable/modules/calibration.html
LogisticRegression returns well calibrated predictions by default as it directly optimizes Log loss.
GaussianNB tends to push probabilities to 0 or 1.
RandomForestClassifier peaks at approximately 0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare.
LinearSVC focus on difficult to classify samples that are close to the decision boundary (the support vectors).
ensemble = True
ensemble = False
Model selection for classification
Class: sklearn.model_selection.StratifiedKFold
Some Parameters:
n_splits (default = 5)
Number of folds. Must be at least 2.
shuffle (default = False)
to shuffle or not to shuffle each class’s samples before splitting into batches.
samples within each split will not be shuffled.
random_state RandomState instance or None, (default=None)
set random_state
when shuffle = True
because it affects the ordering of the indices, which controls the randomness of each fold for each class.
Class: sklearn.model_selection.StratifiedShuffleSplit
Some Parameters:
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
import numpy as np
X, y = np.random.randint(1,50,50), np.hstack(([0] * 45, [1] * 5))
print('X',X)
print('y',y)
skf = StratifiedKFold(n_splits=3)
print('StratifiedKFold')
count = 1
for train, test in skf.split(X, y):
print('Split', count)
print('train - {} | test - {}'.format(np.bincount(y[train]), np.bincount(y[test])))
print('train',X[train])
print('test',X[test])
count+=1
print('StratifiedShuffleSplit')
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=0)
count = 1
for train_index, test_index in sss.split(X, y):
print('Split', count)
print('train - {} | test - {}'.format(np.bincount(y[train_index]), np.bincount(y[test_index])))
print('train',X[train_index])
print('test',X[test_index])
count+=1
Example to compare StratifiedKFold and StratifiedShuffleSplit
Output:
Cs
values and l1_ratios
values.Class: sklearn.linear_model.LogisticRegressionCV
Some Parameters:
'Cs' (default = 10)
Each of the values in Cs describes the inverse of regularization strength.
If int, then a grid of values = logarithmic scale between \(1e^{-4}\) & \(1e^4\).
'cv' (default = None)
The default cross-validation generator used is Stratified K-Folds.
If an integer is provided, then it is the number of folds used.
scoring (default = None)
A string or scorer(estimator, X, y)
. (default scoring option used is ‘accuracy’).
penalty (‘l1’, ‘l2’, ‘elasticnet’, default=‘l2’)
refit (default = True)
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters.
Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
l1_ratios list of float, (default = None)
The list of Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1
.
Only used if penalty='elasticnet'.
A value of 0 is equivalent to using penalty='l2', while 1 is equivalent to using penalty='l1'.
For 0 < l1_ratio <1
, the penalty is a combination of L1 and L2.
Multiclass classification
Note: The multiclass and multilabel metrics also work for binary classification.
1. sklearn.metrics.precision_recall_curve
2. sklearn.metrics.roc_curve
3. sklearn.metrics.det_curve
average
parameter.
"macro"
- calculates the mean of the binary metrics, giving equal weight to each class.
"weighted"
- computes the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.
"micro"
- gives each sample-class pair an equal contribution to the overall metric (except as a result of sample-weight). (preferred in multilabel settings, including multiclass classification where a majority class is to be ignored.)
"samples"
- calculates the metric over the true and predicted classes for each sample in the evaluation data, and returns their (sample_weight-weighted) average.
"None"
will return an array with the score for each class.
1. sklearn.metrics.confusion_matrix
2. sklearn.metrics.balanced_accuracy_score
3. sklearn.metrics.cohen_kappa_score
4. sklearn.metrics.hinge_loss
5. sklearn.metrics.matthews_corrcoef
6. sklearn.metrics.roc_auc_score
7. sklearn.metrics.top_k_accuracy_score
1. sklearn.metrics.accuracy_score
2. sklearn.metrics.multilabel_confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
multilabel_confusion_matrix(y_true, y_pred,labels=["ant", "bird", "cat"])
3. sklearn.metrics.classification_report
4. sklearn.metrics.zero_one_loss
normalize
parameter is True, this function returns the fraction of misclassifications (float), else it returns the number of misclassifications (int).
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
4. sklearn.metrics.hamming_loss
5. sklearn.metrics.log_loss
predict_proba
method.6. sklearn.metrics.jaccard_score
y_true
.7. Precision, recall and F-measures
Note: Best value is 1 and the worst value is 0 for these scores.
sklearn.metrics.precision_score |
computes precision which is intuitively the ability of the classifier not to label as positive a sample that is negative. |
sklearn.metrics.recall_score |
computes recall which is intuitively the ability of the classifier to find all the positive samples. |
sklearn.metrics.f1_score |
computes harmonic mean of precision and recall |
sklearn.metrics.fbeta_score |
computes weighted harmonic mean of precision and recall |
sklearn.metrics.average_precision_score |
computes the average precision from prediction scores (this score does not supports multiclass) |
8. sklearn.metrics.precision_recall_fscore_support