Regularization: Bias Variance Tradeoff, l2 regularization, Early stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble methods, Dropout

Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 6

References/Acknowledgments

The Problem

Mitesh M. Khapra

Mitesh M. Khapra

\(+ Variance\)

\(+ \sigma^2\) (irreducible error)

\(+ Variance\)

\(+ \sigma^2\) (irreducible error)

model complexity

Intuitions developed so far

\(ε ∼ N (0, σ^2 )\)

\(E[( \hat f(x) − f(x))^2 ]\)

\(E[( \hat y − y)^ 2 ]\)

\(∴ E[( \hat f(x) − f(x))^2 ] = E[( \hat y − y)^2 ] − E[ε^2 ] + 2E[ ε( \hat f(x) − f(x)) ]\)

\(E[( \hat y_i − y_i)^ 2 ]\)

\(∴ E[( \hat f(x_i) − f(x_i))^2 ] = E[( \hat y_i − y_i)^2 ] − E[ε^2_i ] + 2E[ ε_i( \hat f(x_i) − f(x_i)) ]\)

\(∵\) covariance \((X, Y )\)

Mitesh M. Khapra

\frac{\sigma^2}{n} \sum_{i=1}^n\frac{\partial \hat f (x_i)}{\partial y_i}

Different forms of regularization

Mitesh M. Khapra

Different forms of regularization

\(H = Q \Lambda Q^T\) [\(Q\) is orthogonal, \(QQ^T = Q^TQ = \mathbb I\)]

\begin{bmatrix} \ & & \ & & \\ & \ & & \ & \\ & & \ & & & \ &\\ & & & \ & & & & \ & \end{bmatrix}

\frac {1}{\lambda_1 + \alpha}

\frac {1}{\lambda_2 + \alpha}

\frac {1}{\lambda_n + \alpha}

\begin{bmatrix} \ & & \ & & \\ & \ & & \ & \\ & & \ & & & \ &\\ & & & \ & & & & \ & \end{bmatrix}

\frac {\lambda_1}{\lambda_1 + \alpha}

\frac {\lambda_2}{\lambda_2 + \alpha}

\frac {\lambda_n}{\lambda_n + \alpha}

\begin{bmatrix} \ & & \ & & \\ & \ & & \ & \\ & & \ & & & \ &\\ & & & \ & & & & \ & \end{bmatrix}

\frac {1}{\lambda_1 + \alpha}

\frac {1}{\lambda_2 + \alpha}

\frac {1}{\lambda_n + \alpha}

\begin{bmatrix} \ & & \ & & \\ & \ & & \ & \\ & & \ & & & \ &\\ & & & \ & & & & \ & \end{bmatrix}

\frac {\lambda_1}{\lambda_1 + \alpha}

\frac {\lambda_2}{\lambda_2 + \alpha}

\frac {\lambda_n}{\lambda_n + \alpha}

\(\tilde w\)

Mitesh M. Khapra

Different forms of regularization

[given training data]

rotated by \(20\degree\)

rotated by \(65\degree\)

shifted vertically

shifted horizontally

changed some pixels

Mitesh M. Khapra

Other forms of regularization

Mitesh M. Khapra

Other forms of regularization

\(x_1 + ε_1\)

\(x_2 + ε_2\)

\(x_k + ε_k\)

\(x_n + ε_n\)

Mitesh M. Khapra

Other forms of regularization

minimize : \( \displaystyle \sum_{i=0}^{9} p_i \space log \space q_i\)

0	0	1	0	0	0	0	0	0	0

Mitesh M. Khapra

Other forms of regularization

\(\text{Error}\)

\(\text {Steps}\)

\(Validation \) \(error\)

\(Training\) \(error\)

\(return\) \(this\) \(model\)

\(\text{Error}\)

\(\text {Steps}\)

\(Validation \) \(error\)

\(Training\) \(error\)

\(return\) \(this\) \(model\)

\(\text{Error}\)

\(\text {Steps}\)

\(Validation \) \(error\)

\(Training\) \(error\)

\(return\) \(this\) \(model\)

Things to be remember

Mitesh M. Khapra

Other forms of regularization

\(y_{final}\)

\(Logistic Regression\)

\(Naive Bayes\)

\(y_{final}\)

\(Logistic \)

\(Regression\)

\(Logistic \)

\(Regression\)

\(Logistic \)

\(Regression\)

Mitesh M. Khapra

Other forms of regularization

\mathscr{L}(\theta)=\frac{1}{N}\sum \limits_{i=1}^{N}l \big(f(x_i,\theta),y\big)

Data \(x_i\)

1. Data Augmentation

2. Noise Injection

Architecture choice \(f(\cdot)\)

2. Skip connections (CNN)

3. Weight sharing

Penalize cost \(\mathscr{L}(\cdot)\)

Optimizer \(\nabla\)

2. Early stopping

\mathscr{L}(\theta)=\frac{1}{N}\sum \limits_{i=1}^{N}l \big(f(x_i,\theta),y\big)

Data Augmentation

Noise Injection

Skip connections (CNN)

Weight sharing

\(L_1\), \(L_2\)

Large initial learning rate

Small initial learning rate

Early stopping

Recap

Appendix

Regularization: Bias Variance Tradeoff, l2 regularization, Early stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble methods, Dropout

Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 6

AI4Bharat, Department of Computer

Science and Engineering, IIT Madras

References/Acknowledgments

Chapter 7, Deep Learning book

Ali Ghodsi’s Video Lectures on Regularization [1] [2]

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

So far, we have focused on minimizing the objective function using a variety of optimization algorithms

Deep learning models typically have BILLIONS of parameters whereas the training data may have only MILLIONS of samples

Therefore, they are called over-parameterized models

Over-parameterized models are prone to a phenomenon called over-fitting

To understand this, let's start with bias and variance of a model with respect to its capacity

The Problem

Module 6.1 : Bias and Variance

Mitesh M. Khapra

AI4Bharat, Department of Computer

Science and Engineering, IIT Madras

We will begin with a quick overview of bias, variance and the trade-off between them.

Let us consider the problem of fitting a curve through a given set of points

We consider two models :

\(Simple \)

(\(degree:1\))

\(y = \hat {f}(x) = w_1x +w_0\)

\(Complex \)

(\(degree:25\))

Note that in both cases we are making an assumption about how \(y\) is related to \(x\). We have no idea about the true relation \(f(x)\)

The points were drawn from a sinusoidal function (the true \(f(x)\))

The training data consists of 500 points

\(y = \hat {f}(x) = \displaystyle \sum_{i=1}^{25} w_i x^i+w_0\)

We sampled 500 points from the training data and train a simple and a complex model

We repeat the process ‘k’ times to train multiple models (each model sees a different sample of the training data)

We make a few observations from these plots

The points were drawn from a sinusoidal function (the true \(f(x)\))

Simple models trained on different samples of the data do not differ much from each other

However they are very far from the true sinusoidal curve (under fitting)

On the other hand, complex models trained on different samples of the data are very different from each other (high variance)

Let \(f(x)\) be the true model (sinusoidal in this case) and \(\hat {f}(x)\) be our estimate of the model (simple or complex, in this case) then,

Bias \(\hat {f}(x)\) \(= E\) [\(\hat {f}(x)\)] \(- f(x)\)

Green Line: Average value of \(\hat {f}(x)\)for the simple model

Blue Curve: Average value of \(\hat {f}(x)\) for the complex model

Red Curve: True model (\(f(x))\)

\(E\) [\(\hat {f}(x)\)] is the average (or expected) value of the model

We can see that for the simple model the average value (green line) is very far from the true value \(f(x)\) (sinusoidal function)

Mathematically, this means that the simple model has a high bias

On the other hand, the complex model has a low bias

We now define,

Variance \(\hat{f}(x) = E\bigg[(\hat{f}(x) - E[\hat{f}(x)])^2\bigg]\) (Standard definition from statistics)

Roughly speaking it tells us how much the different \(\hat {f}(x)\)’s (trained on different samples of the data) differ from each other)

It is clear that the simple model has a low variance whereas the complex model has a high variance

In summary (informally)

Simple model: high bias, low variance

Complex model: low bias, high variance

There is always a trade-off between the bias and variance

Both bias and variance contribute to the mean square error. Let us see how

Module 6.2 : Train error vs Test error

Mitesh M. Khapra

AI4Bharat, Department of Computer

Science and Engineering, IIT Madras

Consider a new point (\(x, y)\) which was not seen during training

If we use the model \(\hat {f}(x)\) to predict the value of \(y\) then the mean square error is given by

\(E[(y −\) \(\hat {f}(x))^2 ]\)

(average square error in predicting \(y\) for many such unseen points)

We can show that

See proof here

\(E[(y −\) \(\hat {f}(x))^2 \)] \(= Bias^2\)

The parameters of \(\hat {f}(x)\) (all \(w_i\) ’s) are trained using a training set \(\{(x_i , y_i)\} ^n_ {i=1}\)

However, at test time we are interested in evaluating the model on a validation (unseen) set which was not used for training

This gives rise to the following two entities of interest:

\(train_{err}\) (say, mean square error)

\(test_{err}\) (say, mean square error)

Typically these errors exhibit the trend shown in the adjacent figure

\(E[(y −\) \(\hat {f}(x))^2 \)] \(= Bias^2\)

High bias

High variance

Sweet spot- -perfect tradeoff -ideal model complexity

Let there be \(n\) training points and \(m\) test (validation) points

Intuitions developed so far

\(train_{err}\) = \(\frac {1}{n} \displaystyle \sum_{i=1}^n (y_{i} - \hat f {(x_i)})^2\)

Sweet spot-
-perfect tradeoff
-ideal model
complexity