Machine Learning Practice
helps in creating a baseline for regression.
DummyRegressor
from sklearn.dummy import DummyRegressor
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X_train, y_train)
dummy_regr.predict(X_test)
dummy_regr.score(X_test, y_test)
mean
median
quantile
constant
Strategy
quantile
constant
Normal equation
Iterative optimization
from sklearn.linear_model import LinearRegression
linear_regressor = LinearRegression()
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor()
Step 1: Instantiate object of a suitable linear regression estimator from one of the following two options
Step 2: Call fit method on linear regression object with training feature matrix and label vector as arguments.
# Model training with feature matrix X_train and
# label vector or matrix y_train
linear_regressor.fit(X_train, y_train)
Works for both single and multi-output regression.
SGDRegressor
loss= 'squared error'
loss = 'huber'
penalty = 'l1'
penalty = 'l2'
penalty = 'elasticnet'
learning_rate = 'constant'
learning_rate = 'optimal'
learning_rate = 'invscaling'
learning_rate = 'adaptive'
early_stopping = 'True'
early_stopping = 'False'
It's a good idea to use a random seed of your choice while instantiating SGDRegressor object. It helps us get reproducible results.
Set
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(random_state=42)
random_state
to seed of your choice.
Note: In the rest of the presentation, we won't set the random seed for sake of brevity. However while coding, always set the random seed in the constructor.
SGD is sensitive to feature scaling, so it is highly recommended to scale input feature matrix.
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
sgd = Pipeline([
('feature_scaling', StandardScaler())),
('sgd_regressor', SGDRegressor())])
sgd.fit(X_train, y_train)
Features extracted using PCA should be scaled by some constant \(c\) such that the average L2 norm of the training data equals one.
Note
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(shuffle=True)
learning_rate = 'constant'
learning_rate = 'invscaling'
learning_rate = 'adaptive'
What is the default setting?
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(random_state=42)
learning_rate = 'invscaling'
eta0 = 1e-2
Learning rate reduces after every iteration: eta = eta0 / pow(t, power_t)
power_t = 0.25
Note: You can make changes to these parameters to speed up or slow down the training process.
How to use set constant learning rate ?
learning_rate = 'constant'
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(learning_rate='constant',
eta0=1e-2)
Constant learning rate is used throughout the training.
eta0 = 1e-2
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(learning_rate='adaptive',
eta0=1e-2)
How to set adaptive learning rate?
When the stopping criterion is reached, the learning rate is divided by 5, and the training loop continues.
The algorithm stops when the learning rate goes below \(10^{-6}\).
The learning rate is kept to initial value as long as the training loss decreases.
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(max_iter=100)
Set max_iter to desired #epochs. The default value is 1000.
Remember one epoch is one full pass over the training data.
SGD converges after observing approximately \(10^6\) training samples. Thus, a reasonable first guess for the number of iterations for \(n\) sampled training set is
Practical tip
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(loss='squared_error',
max_iter=500,
tol=1e-3,
n_iter_no_change=5)
The SGDRegreesor stops
tol)
for n_iter_no_change
consecutive epochsmax_iter.
Option #1
tol,
n_iter_no_change
, max_iter.
The SGDRegreesor stops when
tol
for n_iter_no_change
consecutive epochs.max_iter.
Option #2
early_stopping,
validation_fraction
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(loss='squared_error',
early_stopping=True
max_iter=500,
tol=1e-3,
validation_fraction=0.2,
n_iter_no_change=5)
Set aside validation_fraction
percentage records from training set as validation set. Use score
method to obtain validation score.
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(loss='squared_error')
'squared_error'
Set loss parameter to one of the supported values
It also supports other losses as documented in sklearn API
{studied in this course}
Averaged SGD updates the weight vector to average of weights from previous updates.
Option #1: Averaging across all updates
average=True
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(average=True)
Option #2: Set
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(average=10)
average
average=10
to int value.
Averaging begins once the total number of samples seen reaches
average
Setting
starts averaging after seeing 10 samples
Averaged SGD works best with a larger number of features and a higher eta0
Set
warm_start = TRUE
while instantiating object of SGDRegressor
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(warm_start=True)
warm_start = False
By default
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
penalty=None, learning_rate="constant", eta0=0.0005)
for epoch in range(1000):
sgd_reg.fit(X_train, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_val_predict)
Make use of
warm_start = TRUE
linear_regressor.intercept_
\(\hat{y} = \color{red}{w_0} \color{black}+ \color{red}{w_1} \color{black}{x_1} + \color{red}{w_2} \color{black}{x_2} + \ldots + \color{red}{w_m} \color{black}{x_m}\) = \( \color{red}{\mathbf{w}^T}\color{black} \mathbf{x}\)
linear_regressor.coef_
Note: These code snippets works for both LinearRegression and SGDRegressor, and for that matter to all regression estimators that we will study in this module. Why?
All of them are estimators.
The weights \(\color{red}{w_1, w_2, \ldots, w_m}\) are stored in coef_ class variable.
The intercept \(\color{red}{w_0}\) is stored in intercept_ class variable.
Step 2: Call predict method on linear regression object with feature matrix as an argument.
# Predict labels for feature matrix X_test
linear_regressor.predict(X_test)
Step 1: Arrange data for prediction in a feature matrix of shape (#samples, #features) or in sparse matrix format.
Same code works for all regression estimators.
STEP 1: Split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
STEP 4: Calculate test error (a.k.a. generalization error)
STEP 3: Calculate training error (a.k.a. empirical error)
Compare training and test errors
STEP 2: Fit linear regression estimator on training set.
Using score method on linear regression object:
# Evaluation on the eval set with
# 1. feature matrix
# 2. label vector or matrix (single/multi-output)
linear_regressor.score(X_test, y_test)
residual sum of squares:
\(u=(\mathbf{Xw}-\mathbf{y})^T\)\((\mathbf{Xw}-\mathbf{y} )\)
total sum of square
The score returns \(R^2\) or coefficient of determination
\(v =(\mathbf{y}-\mathbf{\hat{y}_{mean}})^T (\mathbf{y}-\mathbf{\hat{y}_{mean}})\)
Sum of squared error (actual and mean predicted label)
Sum of squared error (actual and predicted label)
The score returns \(R^2\) or coefficient of determination
The best possible score is 1.0.
The score can be negative (because the model can be arbitrarily worse).
A constant model that always predicts the expected value of \(y\), would get a score of 0.0.
\(u\), sum of squared error = 0
When?
\(u = v \)
sklearn provides a bunch of regression metrics to evaluate performance of the trained estimator on the evaluation set.
mean_absolute_error
mean_squarred_error
These metrics can also be used in multi-output regression setup.
r2_score
Same as output of
score
from sklearn.metrics import mean_absolute_error
eval_score = mean_absolute_error(y_test, y_predicted)
from sklearn.metrics import mean_squarred_error
eval_score = mean_squarred_error(y_test, y_predicted)
from sklearn.metrics import r2_score
eval_score = r2_score(y_test, y_predicted)
mean_squared_log_error
mean_absolute_percentage_error
median_absolute_error
from sklearn.metrics import mean_squared_log_error
eval_score = mean_squared_log_error(y_test, y_predicted)
from sklearn.metrics import mean_absolute_percentage_error
eval_score = mean_absolute_percentage_error(y_test, y_predicted)
from sklearn.metrics import median_absolute_error
eval_score = median_absolute_error(y_test, y_predicted)
Use metrics
max_error
from sklearn.metrics import max_error
test_error = max_error(y_test, y_predicted)
This metrics can, however, be used only for single output regression. It does not support multi-output regression.
Worst case error on test set can be calculated as follows:
from sklearn.metrics import max_error
train_error = max_error(y_train, y_predicted)
Worst case error on train set can be calculated as follows:
Convert error metric to score metric by adding neg_ suffix.
Function | Scoring |
---|---|
metrics.mean_absolute_error | neg_mean_absolute_error |
metrics.mean_squared_error | neg_mean_squared_error |
metrics.mean_squared_error | neg_root_mean_squared_error |
metrics.mean_squared_log_error | neg_mean_squared_log_error |
metrics.median_absolute_error | neg_median_absolute_error |
In case, we get comparable performance on train and test with this split, is this performance guaranteed on other splits too?
We use cross validation for robust performance evaluation.
Cross-validation performs robust evaluation of model performance
This enables us to estimate variability in generalization performance of the model.
ShuffleSplit
KFold
RepeatedKfold
LeaveOneOut
sklearn implements the following cross validation iterators
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import linear_regression
lin_reg = linear_regression()
score = cross_val_score(lin_reg, X, y, cv=5)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.linear_model import linear_regression
lin_reg = linear_regression()
kfold_cv = KFold(n_splits=5, random_state=42)
score = cross_val_score(lin_reg, X, y, cv=kfold_cv)
Alternate way of writing the same thing
which is same as
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import linear_regression
lin_reg = linear_regression()
loocv = LeaveOneOut()
score = cross_val_score(lin_reg, X, y, cv=loocv)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.linear_model import linear_regression
lin_reg = linear_regression()
n = X.shape[0]
kfold_cv = KFold(n_splits=n)
score = cross_val_score(lin_reg, X, y, cv=kfold_cv)
from sklearn.linear_model import linear_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
lin_reg = linear_regression()
shuffle_split = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
score = cross_val_score(lin_reg, X, y, cv=shuffle_split)
It is also called random permutation based cross validation strategy.
In each iteration, it shuffles order of data samples and then splits it into train and test.
from sklearn.linear_model import linear_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
lin_reg = linear_regression()
shuffle_split = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
score = cross_val_score(lin_reg, X, y, cv=shuffle_split,
scoring='neg_mean_absolute_error')
parameter can be set to one of the scoring schemes implemented in sklearn as follows
scoring
neg_mean_absolute_error
neg_mean_squared_error
neg_root_mean_squared_error
neg_mean_squared_log_error
neg_median_absolute_error
r2
max_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=40, test_size=0.3, random_state=0)
cv_results = cross_validate(
regressor, data, target, cv=cv, scoring="neg_mean_absolute_error")
The results are stored in python dictionary with the following keys:
fit_time
score_time
test_score
estimator
train_score
(optional)
(optional)
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=40, test_size=0.3,
random_state=0)
cv_results = cross_validate(
regressor, data, target,
cv=cv, scoring="neg_mean_absolute_error",
return_train_score=True,
return_estimator=True)
return_estimator = True
return_train_score = True
The estimators can be accessed through
estimator
key of the dictionary returned by
cross_validate
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=40, test_size=0.3,
random_state=0)
cv_results = cross_validate(
regressor, data, target,
cv=cv,
scoring=["neg_mean_absolute_error", "neg_mean_squared_error"]
return_train_score=True,
return_estimator=True)
allows us to specify multiple scoring metrics
cross_validate
cross_val_score
unlike
from sklearn.model_selection import learning_curve
results = learning_curve(
lin_reg, X_train, y_train, train_sizes=train_sizes, cv=cv,
scoring="neg_mean_absolute_error")
train_size, train_scores, test_scores = results[:3]
# Convert the scores into errors
train_errors, test_errors = -train_scores, -test_scores
STEP 1: Instantiate an object of learning_curve class with estimator, training data, size, cross validation strategy and scoring scheme as arguments.
STEP 2: Plot training and test scores as function of the size of training sets. And make assessment about model fitment: under/overfitting or right fit.
STEP 1: Fit linear models with different number of features.
STEP 2: For each model, obtain training and test errors.
STEP 3: Plot #features vs error graph - one each for training and test errors.
STEP 4: Examine the graphs to detect under/overfitting.
We can replace #features with any other tunable hyperparameter to do this diagnosis for setting that hyperparameter to the appropriate value.