Machine Learning Techniques
It is a supervised learning problem where the output label is a real number.
Single label regression
Multi label regression
Single output label: \(y \in \mathbf{R}\)
Multiple output labels - total labels \(k\): \( \mathbf{y} \in \mathbf{R}^k \)
e.g. Predict temperature for tomorrow based on weather features of past days
e.g. Predict temperature for next 5 days based on weather features of past days
All concepts will be discussed for single label regression problem set up.
In the end, we will extend them to multi-label regression problem set up.
It is a machine learning algorithm that predicts real valued output label based on linear combination of input features.
A set of \(n\) ordered pairs of a feature vector, \(\mathbf{x}\) and a label \(y\) representing examples. We denote it by \(D\):
\( D = \left\{ (\mathbf{X}, \mathbf{y})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{n} \)
\(\mathbf{X}\) is a feature matrix corresponding to \(n\) training examples, each represented with \(m\) features and has shape \(n \times m\). In this matrix, each feature vector is transposed and represented as a row.
Feature vector for \(i\)-th training example \(\mathbf{x}^{(i)}\) can be obtained by \(\mathbf{X}[i]\)
For single label problem, \(\mathbf{y}\) is a label vector of shape \(n \times 1\).
The \(i\)-th entry in this vector, \(\mathbf{y}[i]\) gives label for \(i\)-th example, which is denoted by \(y^{(i)} \in \mathbb{R}\)
No. of
Floors
Sample House Pricing data is presented below.
Area
Age
House 1
House 2
House 3
House 4
Features (m=3)
Examples (n=4)
\( D = \left\{ (\mathbf{X}, \mathbf{y})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{n} \)
No. of
Floors
It has three features: area in square feet, no. of floors and age in years, for 4 houses.
Area
Age
House 1
House 2
House 3
House 4
No. of
Floors
Here number of samples \(n\) is 4 and number of features \(m\) is 3
Area
Age
House 1
House 2
House 3
House 4
Area
House Pricing data, first feature only
House 1
House 2
House 3
House 4
House Pricing data, one sample only
House 1
No. of Floors
Area
Age
House Pricing data, one sample only
House 1
No. of
Floors
Area
Age
Price
House Pricing data, labels (price in Rs.)
House 1
House 2
House 3
House 4
Price
House 1
House 2
House 3
House 4
No. of
Floors
Area
Age
\( D = \left\{ (\mathbf{X}_{4 \times 3}, \mathbf{y}_{4\times 1})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{4} \)
Label \(y\) is computed as a linear combination of features \(x_1,x_2,...,x_m\):
\(y = \color{red}{w_0} \color{black}+ \color{red}{w_1} \color{black}{x_1} + \color{red}{w_2} \color{black}{x_2} + \ldots + \color{red}{w_m} \color{black}{x_m}\)
\(y = \color{red}{w_0} \color{blue}{x_0} \color{black}+ \color{red}{w_1} \color{black}{x_1} + \ldots + \color{red}{w_m} \color{black}{x_m} \)
\(y = \sum\limits_{i=0}^{m} \color{red}{w_i} \color{black}{x_i}\)
It can be written compactly as
Here \(\color{red}{w_0,w_1, w_2, \ldots, w_m}\color{black}\) are weights
Take a simple model with a single feature \(x_1\):
\(y = \color{red}{w_0} \color{black}+ \color{red}{w_1} \color{black}{x_1} \)
# model parameters (2): \( \color{red}{w_0} \color{black}, \color{red}{w_1} \)
Each new combination of \( \color{red}{w_0} \color{black}, \color{red}{w_1} \) leads to a new model.
Total # of possible models = infinite
# Features (m) | Model | Geometry | # Params (# w) |
---|---|---|---|
line | 2 | ||
plane | 3 | ||
hyperplane in (m +1)-D |
\(m+1\)
\(y = \sum\limits_{i=0}^{m} \color{red}{w_i} \color{black}{x_i}\)
\(y = \color{red}{w_0} \color{black}+ \color{red}{w_1} \color{black}{x_1} \)
\(y = \color{red}{w_0} \color{black}+ \color{red}{w_1} \color{black}{x_1} + \color{red}{w_2} \color{black}{x_2} \)
\( \ge 3 \)
\(1\)
\(2\)
\(y = \sum\limits_{j=0}^{m} \color{red}{w_j} \color{black}{x_j}\) = \( \color{red}{\mathbf{w}^T}\color{black} \mathbf{x}\)
All weights \(w_0, w_1, \ldots, w_m\) can be arranged in a vector \(\mathbf{w}\) with shape \((m+1) \times 1\)
Solve vector matrix product and verify it corresponds to the linear combination of features.
The features for an \(i\)-th example can also be represented as a feature vector \(\mathbf{x}^{(i)}\). Note that we add a special feature \(x_0\).
Setting \( x_0 = 1 \)
Let's multiply these vectors to obtain \(y^{(i)}\):
\(y^{(i)} = \sum\limits_{j=0}^{m} \color{red}{w_j} \color{black}{{x_j}^{(i)}}\)
\(y^{(i)} = \color{red}{w_0} \color{black}\times 1 + \color{red}{w_1} \color{black}\times x_1^{(i)} + \color{black} \ldots + \color{red}{w_m} \color{black} \times x_m^{(i)}\)
\(y^{(i)} = \color{red}{w_0} \color{blue}{x_0} \color{black} + \color{red}{w_1} \color{black} x_1^{(i)} + \color{red}{w_2} \color{black} x_2^{(i)} + \ldots + \color{red}{w_m} \color{black} x_m^{(i)}\)
A weight vector \(\mathbf{w}\) with shape \(\left(m+1 \right) \times 1\).
A feature matrix \(\mathbf{X}\) has shape \((n, m+1)\) containing features for \(n\) examples.
The label vector \(\widehat{\mathbf{y}}\) containing labels for all the \(n\) examples can be computed as follows:
def predict(X, w):
# Make sure features and weights have compatible shape.
# Check to make sure that the shapes are compatible.
assert X.shape[-1]==w.shape[0], "X and w don't have compatible dimensions"
y = X @ w
return y
Implement the model inference in vectorized form:
\(\mathbf{y} = \color{black}{\mathbf{X}}\color{red} \mathbf{w} \)
\(y = \color{red}{w_0}\color{black}+\color{red}{w_1} \color{black}{x_1}\)
Loss or error function
SSE is the sum of square of difference between actual and predicted labels for each training point.
The total loss \(J(\mathbf{w})\) is sum of errors at each training point:
We divide this by \(\dfrac{1}{2}\) for mathematical convenience in later use:
The loss is dependent on the value of \(\mathbf{w}\) - as these values change, we get a new model, which will result in different prediction and hence different error at each training point.
Recall, \(\hat\mathbf{y}= \mathbf{X} \mathbf{w} \)
\(= \frac{1}{2} \left( \color{blue}{{\mathbf{X} \mathbf{w}}} \color{black}-\mathbf{y} \right)^T \left( \color{blue}{{\mathbf{X} \mathbf{w}}} \color{black}-\mathbf{y} \right) \)
def loss(features, labels, weights):
# Compute error vector
e = predict(features, weights) - labels
# Compute loss
loss = (1/2) * (np.transpose(e) @ e)
return loss
\( h_\mathbf{w}(\mathbf{x}): y = \color{red} w_0\color{black}+ \color{red}w_1 \color{black}x_1\)
Model
Loss (SSE)
How do we find the model with the least loss or error?
Key calculation is the derivative of loss function \(J(\mathbf{w})\) with respect to \(\mathbf{w}\).
Let's look at a couple of useful identities for this:
\(\dfrac{\partial}{\partial \mathbf{w}} \left( \mathbf{w} \mathbf{x} \right) = \dfrac{\partial}{\partial \mathbf{w}} \left( \mathbf{x}^T \mathbf{w}\right) = \dfrac{\partial}{\partial \mathbf{w}} \left( \mathbf{w}^T \mathbf{x} \right) = \mathbf{x} \)
Derivative of linear function:
Derivative of quadratic function:
Similar to \(\dfrac{d}{dw} (xw) = x\)
Similar to \(\dfrac{d}{dw} (xw^2) = 2xw\)
Set the partial derivative to 0 and solve with analytical method to obtain the weight vector
Iteratively change the weights based on the partial derivative of the loss function until convergence (Iterative optimization)
Let's set \(\dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}}\) to 0 and solve for \(\mathbf{w}\):
Recall \(\dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = \mathbf{X}^T \mathbf{X} \mathbf{w} - \mathbf{X}^T \mathbf{y} \)
Calculate gradient of loss function w.r.t. the weights \(w_0\), \(w_1\) .... \(w_m\), which are \(\dfrac{\partial J(\mathbf{w})}{\partial w_0}\), \(\dfrac{\partial J(\mathbf{w})}{\partial w_1}\) .... \(\dfrac{\partial J(\mathbf{w})}{\partial w_m}\) respectively.
Here \(\alpha\) is learning rate.
After this step, we have a new weight vector.
This vectorized implementation will make sure that all the parameters are updated in one go.
\( h_\mathbf{w}(\mathbf{x}): y = \color{red} w_0\color{black}+ \color{red}w_1 \color{black}x_1\)
Model
Loss (SSE)
GD is a very generic optimization algorithm that can be used for learning weight vectors of most of the ML models.
GD obtains optimal weight vector by making small changes to their values in each iteration proportional to the gradient of the loss function.
Once GD reaches the minima of the loss function, we obtain the optimal weights corresponding to the minima.
The weight vector changes by a very small margin in successive iterations or in last few iterations.
After completing a fixed number of iterations.
1. How do we set the learning rate \(\alpha\) ?
2. How do we decide the number of iterations?
Try different values of \(\alpha\): \( \{0.0001, 0.001, 0.01, 0.1, 1 \} \)
Top one, with \(\alpha = 0.0001\) takes longer to reach optimal point, compared with \(\alpha = 0.001\)
Visualization based diagnosis is not possible for real-word problems with more features, learning curves are our best friends in general case.
just right \(\alpha\)
small \(\alpha\)
too large \(\alpha\)
Mini-batch gradient descent (MBGD)
A couple of variations for faster weight updates and hence faster convergence. Note that GD uses all \(n\) training examples for weight updates:
Stochastic gradient descent (SGD)
Uses \(k << n\) examples for weight update in each iteration.
Uses \(k = 1\) examples for weight update in each iteration.
The key time consuming step in GD is the the gradient computation, which performs summation over all training examples:
All steps are same as GD except each step processes a small number of examples.
When we use \(k = 1\) in mini-batch GD, it is called stochastic gradient descent (SGD).
Total weight updates after processing full training set once:
Step by step trajectories
SGD computes weight updates based on a single example as against \(k\) examples used in mini-batch GD.
Visualization based diagnosis is not possible for real-word problems with more features, learning curves are our best friends in general case.
Uses root-mean-squared-error (RMSE) measure, which is derived from SSE.
Square-root is used to compare the error on the same unit and the same scale as that of the output label.
Division by \(n\), the size of the dataset, enables us to compare model performance on datasets of difference sizes on the same footing.
(1) Data
(2) Model
(3) Loss function
(4) Optimization procedure
(5) Evaluation
Sum of squared error (SSE)
Linear combination of features
Features and label that is real number.
(1) Normal eq. (2) GD/MBGD/SGD
Root mean squared error (RMSE)
(1) Data
(2) Model
(3) Loss function
(4) Optimization procedure
(5) Evaluation
\(J(\mathbf{w}) = \frac{1}{2} \left( \color{blue}{{\mathbf{X} \mathbf{w}}} \color{black}-\mathbf{y} \right)^T \left( \color{blue}{{\mathbf{X} \mathbf{w}}} \color{black}-\mathbf{y} \right) \)
\(h_\mathbf{w}: \mathbf{y} = \color{blue}{{\mathbf{X} \mathbf{w}}} \)
\( D = \left\{ (\mathbf{X}, \mathbf{y})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{n} \)
(1) Normal eq. (2) GD/MBGD/SGD
Many times the relationship between the input features and the output label is non-linear and simple linear models are not adequate to learn such mappings.
\(\phi_k\)(features) is called polynomial transformation of \(k\)-th order.
Linear regression without polynomial transformation.
Clearly, this is not a great fit and we need to search for a better model.
\(h_\mathbf{w}(x)=\color{red}w_0\color{black}+\color{red}w_1\color{black}x\)
In this case, we are able to visualize the model fitment since we are dealing with data in 1D feature space.
As the number of features grow, it won't be possible for us to visualize the fitment in this manner.
We rely on learning curve to determine quality of the fitment:
In this case, validation loss is less than the training loss (which is unusal), this is due to small amount of the validation data.
Linear model underfits: it is not enough to model the relationship between features and labels as present in the training data.
For training data with a single feature \(x_1\),
Note that the model is a non-linear function of \(x\), but is a linear function of weight vector \(\mathbf{w} = [w_0, w_1, \ldots, w_k]\).
Let's represent this transformation in form of a vector \(\mathbf{\phi}\) with \(k\) components.
Each component denotes a specific transform to be applied to the input feature.
The polynomial regression model becomes:
import itertools
import functools
def polynomial_transform(x, degree):
'''Performs transformation of input x into deg d polynomial features.
Arguments:
x: Data of shape (n,)
degree: degree of polynomial
Returns:
Polynomial transformation of x
'''
if x.ndim == 1:
x = x[:, None]
x_t = x.transpose()
features = [np.ones(len(x))]
for degree in range(1, degree + 1):
for items in itertools.combinations_with_replacement(x_t, degree):
features.append(functools.reduce(lambda x, y: x * y, items))
return np.asarray(features).transpose()
polynomial_transform(np.array([[1, 2], [3, 4]]), degree=3)
An example:
array([[ 1., 1., 2., 1., 2., 4., 1., 2., 4., 8.],
[ 1., 3., 4., 9., 12., 16., 27., 36., 48., 64.]])
Output:
Let's fit a polynomial model of different orders \( k =\{0, 1, 2, \ldots, 9\} \) on this data.
(1) Train polynomial model of degrees: \(\{0, 1, \ldots, k\}\)
(2) Calculate training and validation RMSE.
(3) Plot degree vs. RMSE plot.
Observe that higher degree polynomial features have higher weights than others.
\(h_\mathbf{w}(\mathbf{x})= \sum\limits_{j=0}^{j=9}w_jx^j\)
\(h_\mathbf{w}(\mathbf{x})= \sum\limits_{j=0}^{j=7}w_jx^j\)
Observe that higher degree polynomial features have higher weights than others.
Train a polynomial model of degree 9 with two datasets - (i) with 15 points and (ii) with 100 points.
Smoother fitness
Overfitting is caused by larger weights assigned to the higher order polynomial terms.
At a high level, let's modifies loss function by adding a penalty term as follows:
Regularization: Add a penalty term in the loss function for such large weights to control this behaviour.
Regularization modifies the loss function, which leads to the change in derivative of the loss function and hence a change in the weight update rule in gradient descent procedure.
Ridge regression uses second norm of the weight vector as a penalty term:
\(J(\mathbf{w}) = \frac{1}{2} \sum\limits_{i=1}^{n} (\mathbf{w}^T \mathbf{x}^{(i)} - y^{(i)})^2 + \frac{\lambda}{2} \sum_{i=1}^{m} w_i^2 \)
This can be written in a vectorized form as follows
These steps are same as gradient calculation in linear regression except for the penalty term.
Let's set \(\dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}}\) to 0 and solving for \(\mathbf{w}\):
Typically above chart looks like a bowl shape, however, due to small number of features and samples, the chart takes shape like above.
The most appropriate value of \(\lambda\) results in lowest cross validation error.
Most appropriate value of \(\lambda\) from above figure is 0.1.
We need specialized optimization algorithms to estimate weight vector in lasso regression, which are beyond scope of this course.
We will use sklearn implementation of Lasso in ML practice course.