Machine Learning Practice Course
1. Steps in ML projects
2. Illustration through practical set up
Excellent wine company wants to develop ML model for predicting wine quality on certain physiochemical characteristics in order to replace expensive quality sensor.
Let's understand steps involved in addressing this problem.
1.1 Frame the problem
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
Load basic libraries
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(data_url, sep=";")
Now that the data is loaded, let's examine it.
Let's look at a few data samples with head() method.
data.head()
It's a good idea to understand significance of each feature by consulting the experts.
Feature | Significance |
---|---|
Fixed acidity | Most acids involved with wine or fixed or nonvolatile (do not evaporate readily) |
Volitile acidity | The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste |
Citric acid | Found in small quantities, citric acid can add 'freshness' and flavor to wines |
Residual sugar | it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. |
Chlorides | The amount of salt in the wine. |
|
|
Alcohol | The percentage of alcohol contents in the wine. |
feature_list = data.columns[:-1].values
label = [data.columns[-1]]
print ("Feature list:", feature_list)
print ("Label:", label)
Feature list: ['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar' 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH' 'sulphates' 'alcohol'] Label: ['quality']
Let's use info() method to get quick description of data.
data.info()
In order to understand nature of numeric attribites, we use describe() method.
data.describe()
This one prints count and statistical properties - mean, standard deviations and quartiles.
data['quality'].value_counts()
The information can be viewed through histogram plot.
sns.set()
data.quality.hist()
plt.xlabel('Wine Quality')
plt.ylabel('Count')
Note taller bars for quality 5 and 6 compared to the other qualities.
In a similar manner, we can plot all numerical attributes with histogram plot for quick examination.
A few observations based on these plots:
Before any further exploration, it's a good idea to separate test set and do not look at it in order to have a clean evaluation set.
Let's write a function to split the data into training and test. Make sure to set the seed so that we get the same test set in the next run.
def split_train_test(data, test_ratio):
# set the random seed.
np.random.seed(42)
# shuffle the dataset.
shuffled_indices = np.random.permutation(len(data))
# calculate the size of the test set.
test_set_size = int(len(data) * test_ratio)
# split dataset to get training and test sets.
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(data, 0.2)
Scikit-Learn provides a few functions for creating test sets based on
train_test_split() function performs random sampling with
Provision for processing multiple datasets with an identical number of rows and selecting the same indices from these datasets.
from sklearn.model_selection import train_test_split
We can read the documentation for this function by using the following line of code:
?train_test_split
from sklearn.model_selection import train_test_split
Let's perform random sampling on our dataset:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
Recall the label distribution in our dataset: It's not uniform!
sns.set()
data.quality.hist()
plt.xlabel('Wine Quality')
plt.ylabel('Count')
How do we sample in such cases?
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["quality"]):
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
Let's examine the test set distribution by the wine quality that was used for stratified sampling.
strat_dist = strat_test_set["quality"].value_counts() / len(strat_test_set)
Now compare this with the overall distribution:
overall_dist = data["quality"].value_counts() / len(data)
Let's look at them side-by-side:
dist_comparison = pd.DataFrame({'overall': overall_dist, 'stratified': strat_dist})
dist_comparison['diff(s-o)'] = dist_comparison['stratified'] - dist_comparison['overall']
dist_comparison['diff(s-o)_pct'] = 100*(dist_comparison['diff(s-o)']/dist_comparison['overall'])
You can notice that there is a small difference in most strata.
dist_comparison
random_dist = test_set["quality"].value_counts() / len(test_set)
random_dist
Let's contrast this with random sampling:
Compare the difference in distribution of stratified and uniform sampling:
dist_comparison.loc[:, ['diff(s-o)_pct', 'diff(r-o)_pct']]
Performed on training set.
In case of large training set -
Sample examples to form exploration set.
Enables to understand features and their relationship among themselves and with output label.
In our case, we have a small training data and we use it all for data exploration. There is no need to create a separate exploration set.
It's a good idea to create a copy of the training set so that we can freely manipulate it without worrying about any manipulation in the original set.
exploration_set = strat_train_set.copy()
With seaborn library:
sns.scatterplot(x='fixed acidity', y='density', hue='quality',
data=exploration_set)
With matplotlib:
exploration_set.plot(kind='scatter', x='fixed acidity', y='density', alpha=0.5,
c="quality", cmap=plt.get_cmap("jet"))
Let's calculate correlations between our features.
corr_matrix = exploration_set.corr()
Let's check features that are correlated with the label, which is quality in our case.
corr_matrix['quality']
Notice that quality has strong positive correlation with alcohol content [0.48] and strong negative correlation with volitile acidity [-0.38].
Let's visualize correlation matrix with heatmap:
plt.figure(figsize=(14,7))
sns.heatmap(corr_matrix, annot=True)
Another option to visualize the relationship between the feature is with scatter matrix.
from pandas.plotting import scatter_matrix
attribute_list = ['citric acid', 'pH', 'alcohol', 'sulphates', 'quality']
scatter_matrix(exploration_set[attribute_list])
For convenience of visualization, we show it for a small number of attributes.
We often need to preprocess the data before using it for model building due to variety of reasons:
Typical steps in data preprocessing are as follows:
It's a good practice to make a copy of the data and apply preprocessing on that copy. This ensures that in case something goes wrong, we will at least have original copy of the data intact.
# Copy all features leaving aside the label.
wine_features = strat_train_set.drop("quality", axis=1)
# Copy the label list
wine_labels = strat_train_set['quality'].copy()
Let's first check if there are any missing values in feature set: One way to find that out is column-wise.
wine_features.isna().sum() # counts the number of NaN in each column of wine_features
In this dataset, we do not have any missing values.
In case, we have non-zero numbers in any columns, we have a problem of missing values.
Sklearn provides the following methods to drop rows containing missing values:
It provides SimpleImputer class for filling up missing values with. say, median value.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
The strategy contains instructions as how to replace the missing values. In this case, we specify that the missing value should be replaced by the median value.
imputer.fit(wine_features)
In case, the features contains non-numeric attributes, they need to be dropped before calling the fit method on imputer object.
SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=nan, strategy='median', verbose=0)
Let's check the statistics learnt by the imputer on the training set:
imputer.statistics_
array([ 7.9 , 0.52 , 0.26 , 2.2 , 0.08 , 14. , 39. , 0.99675, 3.31 , 0.62 , 10.2 ])
Note that these are median values for each feature. We can cross-check it by calculating median on the feature set:
wine_features.median()
Finally we use the trained imputer to transform the training set such that the missing values are replaced by the medians:
tr_features = imputer.transform(wine_features)
This returns a Numpy array and we can convert it to the dataframe if needed:
tr_features.shape
(1279, 11)
wine_features_tr = pd.DataFrame(tr_features, columns=wine_features.columns)
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
One issue with this representation is that the ML algorithm would assume that the two nearby values are closer than the distinct ones.
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
transform_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),])
wine_features_tr = transform_pipeline.fit_transform(wine_features)
Let's understand what is happening here:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
transform_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),])
wine_features_tr = transform_pipeline.fit_transform(wine_features)
from sklearn.compose import ColumnTransformer
In our dataset, we do not have features of mixed types. All our features are numeric.
num_attribs = list(wine_features)
cat_attribs = ["place_of_manufacturing"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
wine_features_tr = full_pipeline.fit_transform(wine_features)
For the illustration purpose, here is an example code snippet:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(wine_features_tr, wine_labels)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Now that we have a working model of a regression, let's evaluate performance of the model on training as well as test sets.
from sklearn.metrics import mean_squared_error
quality_predictions = lin_reg.predict(wine_features_tr)
mean_squared_error(wine_labels, quality_predictions)
0.4206571060060278
Let's evaluate performance on the test set.
# copy all features leaving aside the label.
wine_features_test = strat_test_set.drop("quality", axis=1)
# copy the label list
wine_labels_test = strat_test_set['quality'].copy()
# apply transformations
wine_features_test_tr = transform_pipeline.fit_transform(wine_features_test)
# call predict function and calculate MSE.
quality_test_predictions = lin_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)
0.39759130875015164
Let's visualize the error between the actual and predicted values.
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual quality')
plt.ylabel('Predicted quality')
The model seem to be making errors on the best and poor quality wines.
Let's try another model: DecisionTreeRegressor.
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(wine_features_tr, wine_labels)
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best')
Notice similarity between two code snippets.
Linear regression | Decision Trees |
---|---|
lin_reg.fit(wine_features_tr, wine_labels) | tree_reg.fit(wine_features_tr, wine_labels) |
quality_predictions = tree_reg.predict(wine_features_tr)
mean_squared_error(wine_labels, quality_predictions)
0.0
quality_test_predictions = tree_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)
0.58125
Note that the training error is 0, while the test error is 0.58. This is an example of an overfitted model.
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual quality')
plt.ylabel('Predicted quality')
We can use cross-validation (CV) for robust evaluation of model performance.
from sklearn.model_selection import cross_val_score
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
scores = cross_val_score(lin_reg, wine_features_tr, wine_labels,
scoring="neg_mean_squared_error", cv=10)
lin_reg_mse_scores = -scores
display_scores(lin_reg_mse_scores)
Scores: [0.56364537 0.4429824 0.38302744 0.40166681 0.29687635 0.37322622 0.33184855 0.50182048 0.51661311 0.50468542]
Mean: 0.431639217212196
Standard deviation: 0.08356359730413976
scores = cross_val_score(tree_reg, wine_features_tr, wine_labels,
scoring="neg_mean_squared_error", cv=10)
tree_mse_scores = -scores
display_scores(tree_mse_scores)
Scores: [0.6171875 0.6875 0.6328125 0.5078125 0.4609375 0.640625 0.65625 0.7109375 0.859375 1.07874016]
Mean: 0.6852177657480315
Standard deviation: 0.16668343331737054
Let's compare scores of Linear regression (LinReg) and decision tree (DT)regressions:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(wine_features_tr, wine_labels)
scores = cross_val_score(forest_reg, wine_features_tr, wine_labels,
scoring="neg_mean_squared_error", cv=10)
forest_mse_scores = -scores
display_scores(forest_mse_scores)
Scores: [0.36989922 0.41363672 0.29063438 0.31722344 0.21798125 0.30233828 0.27124922 0.38747344 0.42379219 0.46229449]
Mean: 0.34565226131889765
Standard deviation: 0.0736322184302973
quality_test_predictions = forest_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)
0.34449875
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual quality')
plt.ylabel('Predicted quality')
Random forest looks more promising than the other two models.
Model diagnosis |
Remedy |
---|---|
Underfitting | Models with more capacity |
Less constraints/regularization | |
Overfitting | More data |
Simpler model | |
More constraints/regularization |
from sklearn.model_selection import GridSearchCV
For example, there are number of hyperparameters in RandomForest regression such as:
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
Here the parameter grid contains two combinations:
Let's compute the total combinations evaluated here:
Let's create an object of GridSearchCV:
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
The first one results in \( 3 \times 4 =12 \) combinations.
The total number of combinations evaluated by the parameter grid \( 12 + 6 =18 \)
Let's create an object of GridSearchCV:
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
Let's launch the hyperparameter search:
grid_search.fit(wine_features_tr, wine_labels)
The best parameter combination can be obtained as follows:
grid_search.best_params_
{'max_features': 6, 'n_estimators': 30}
Let's find out the error at different parameter settings:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(-mean_score, params)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(-mean_score, params)
As you can notice the lowest MSE is obtained for the best parameter combination.
Let's obtain the best estimator as follows:
grid_search.best_estimator_
Note: GridSearchCV is initialized with refit=True option, which retrains the best estimator on the full training set. This is likely to lead us to a better model as it is trained on a larger dataset.
from sklearn.model_selection import RandomizedSearchCV
Analysis of the model provides useful insights about features. let's obtain the feature importance as learnt by the model:
feature_importances = grid_search.best_estimator_.feature_importances_
sorted(zip(feature_importances, feature_list), reverse=True)
Now that we have a reasonable model, we evaluate its performance on the test set. The following steps are involved in the process:
# copy all features leaving aside the label.
wine_features_test = strat_test_set.drop("quality", axis=1)
# copy the label list
wine_labels_test = strat_test_set['quality'].copy()
# apply transformations
wine_features_test_tr = transform_pipeline.fit_transform(wine_features_test)
2. Use the predict method with the trained model and the test set.
quality_test_predictions = grid_search.best_estimator_.predict(
wine_features_test_tr)
3.Compare the predicted labels with the actual ones and report the evaluation metrics.
mean_squared_error(wine_labels_test, quality_test_predictions)
0.35345138888888883
from scipy import stats
confidence = 0.95
squared_errors = (quality_test_predictions - wine_labels_test) ** 2
stats.t.interval(confidence, len(squared_errors) - 1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors))
(0.29159276569581916, 0.4153100120819586)
4.It's a good idea to get 95% confidence interval of the evaluation metric. It can be obtained by the following code:
Once we have satisfactory model based on its performance on the test set, we reach the prelaunch phase.
Before launch,
In this module, we studied steps involved in end to end machine learning project with an example of prediction of wine quality.