CS6910: Fundamentals of Deep Learning
Lecture 5
Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD, AdaGrad, AdaDelta, RMSProp, Adam (and its variants)
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Module 5.1: Learning Parameters :
Infeasible (Guess Work)
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Input for Training
\(\{x_i,y_i\}_{i=1}^N \rightarrow N \ \text{pairs of} \ (x,y) \)
Training Objective
Find \( w\) and \(b \) such that
\( \underset{(w,b) \in R} {minimize}\) \( \mathscr{L(w,b)}= \frac{1}{N}\sum \limits_{i=1}^N (y_i-f(x_i))^2 \)
What does it mean to train the network?
Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)
At the end of training we expect to find \(w^*\), \(b^*\) such that:
\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)
In other words
We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid
What does it mean to train the network?
In other words
We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid
Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)
At the end of training we expect to find \(w^*\), \(b^*\) such that:
\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)
Let us see this in more detail ...
Can we find \( w^*,b^* \) manually?
Let us start with a Random Guess: \( w=3,b=-1\)
Clearly, it is not good, but how bad is it ?
Let us revisit \(\mathscr{L} (w, b) \) to see how bad it is ...
We want \(\mathscr{L} (w, b) \) to be as close to zero as possible
Let us try some other values for \((w,b)\) by changing the sliders
w |
b |
Loss |
---|---|---|
0.50 | 0.00 | 0.0730 |
-0.10 | 0.00 | 0.14 |
0.94 | -0.94 | 0.0214 |
1.42 | -1.73 | 0.0028 |
1.65 | -2.08 | 0.0003 |
1.78 | -2.27 | 0.0000 |
A few Snapshots
Let us Look for something better
than our "Guess work Algorithm"..
Since we have only 2 points and 2 parameters \((w, b)\) we can easily plot \(\mathscr{L}(w, b)\) for different values of \((w, b)\) and pick the one where \(\mathscr{L}(w, b)\) is minimum
Random Search on Error Surface
But of course, this becomes intractable once you have many more data points and many more parameters !!
Further, even here we have plotted the error surface only for a small range of \((w, b)\) [from (−6, +6) and not from \((-\infty, +\infty)\)]
Let us look at the geometric interpretation of our “guess work” algorithm in terms of this error surface
Play with the Search Space
w |
b |
Loss |
---|---|---|
0.50 | 0.00 | 0.0730 |
-0.10 | 0.00 | 0.14 |
0.94 | -0.94 | 0.0214 |
1.42 | -1.73 | 0.0028 |
1.65 | -2.08 | 0.0003 |
1.78 | -2.27 | 0.0000 |
Module 5.2: Learning Parameters :
Gradient Descent
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Goal
Find a better way of traversing the error surface so that we can reach the minimum value quickly without resorting to brute force search!
Vector of parameters, say,
randomly initialized
\( \theta = [w,b]\)
\( \Delta \theta = [\Delta w, \Delta b]\)
change in the value of
\(w,b\)
We moved in the
direction of \(\Delta \theta\)
Let us be a bit conservative:
move only by a small amount \(\eta \)
\(\theta_{new}=\theta+\eta \cdot \Delta \theta \)
Question: What is the right \(\Delta \theta\) to use ?
The answer comes from Taylor series
What is Taylor Series?
Taylor series is a way of approximating any continuously differentiable function \(\mathscr{L(w)}\) using polynomials of degree \(n\). The higher the degree the better the approximation!.
\(\mathscr{L}(w)=\mathscr{L}(w_0)+\frac{\mathscr{L'}(w_0)}{1!}(w-w_0)+\frac{\mathscr{L''}(w_0)}{2!}(w-w_0)^2+\frac{\mathscr{L'''}(w_0)}{3!}(w-w_0)^3+ \cdots\)
Linear Approximation (\( n=1 \))
\( \mathscr{L}(w)=\mathscr{L}(w_0)+ \frac{\mathscr{L'}(w_0)}{1!}(w-w_0) \)
Quadratic Approximation (\(n=2\))
\(\mathscr{L}(w)=\mathscr{L}(w_0)+\frac{\mathscr{L'}(w_0)}{1!}(w-w_0)+ \frac{\mathscr{L''}(w_0)}{2!}(w-w_0)^2 \)
Taylor Series Approximation
Try with different \( L(w) \) such as \(exp(w),sin(w),w^3 \cdots \) and see how closely Taylor series approximate the function at the point \(w = w_0\) and around its neighbour \(\epsilon \)
Approximating 2D function
For ease of notation, let \(\Delta \theta = u\) , then from Taylor series, we have,
[ \( \eta\) is typically small, so \( \eta^2,\eta^3 \cdots\rightarrow 0\)]
Note that the move (\(\eta u\)) would be favorable only if
[i.e., if the new loss is less than the previous loss]
This implies,
But, what is the range of \( u^T \nabla_{\theta}\mathscr{L}(\theta) < 0 \)?
Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathscr{L}(\theta)\)
Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\), then we know that
\(-1 \leq \cos(\beta) = \frac{ u^T \nabla_{\theta} \mathscr{L}(\theta)}{ ||u^T|| * ||\nabla_{\theta} \mathscr{L}(\theta)||} \leq 1 \)
Multiply throught \(k = ||u^T|| * ||\nabla_{\theta} \mathscr{L}(\theta)|| \)
\(-k \leq k*\cos(\beta) = u^T \nabla_{\theta} \mathscr{L}(\theta) \leq k \)
Thus \( \mathscr{L}(\theta+\eta u) - \mathscr{L}(\theta) = u^T \nabla_{\theta} \mathscr{L}(\theta) = k*\cos(\beta) \) is more negative when \(\cos(\beta)=-1\) (i.e., \(\beta\)=\(180^o\))
Gradient Descent Rule
The direction \(u\) that we intend to move in should be at 180° w.r.t. the gradient
In other words, move in a direction opposite to the gradient
Parameter Updation Rule
\( w_{t+1} = w_t - \eta \nabla w_t\)
\(b_{t+1} = b_t - \eta \nabla b_t \)
where, \( \nabla w_t = \frac{\partial \mathscr{L}(w,b)}{\partial w}\), at \(w=w_t,b=b_t \),
and, \( \nabla b_t = \frac{\partial \mathscr{L}(w,b)}{\partial b}\), at \(w=w_t,b=b_t \),
So we now have a more principled way of moving in the (\(w,b\)) plane than our “guess work” algorithm
Let us create an algorithm for this rule ...
To see this algorithm in practice let us first derive \( \nabla w \) and \( \nabla b\) for our toy neural network
max_iterations = 1000
w = random()
b = random()
while max_iterations:
w = w - eta*dw
b = b - eta*db
max_iterations -= 1
Algorithm:gradient_descent()
\(t \leftarrow 0;\)
max_iterarions \(\leftarrow 1000\);
w,b \(\leftarrow\) initialize randomly
while \(t <\) max_iterations do
\(w_{t+1} \leftarrow w_t-\eta \nabla w_t\);
\(b_{t+1} \leftarrow b_t-\eta \nabla b_t\);
\( t \leftarrow t+1;\)
end
Let us assume there is only one point to fit
So If we have only one point (\(x,y\)), we have,
For two points,
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]
def f(x,w,b):
return 1/(1+np.exp(-(w*x+b)))
def error(w,b):
err = 0.0
for x,y in zip(X,Y):
fx = f(x,w,b)
err += (fx-y)**2
return 0.5*err
def grad_b(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)
def grad_w(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)*x
def do_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
Let us do_gradient_descent()
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]
def f(x,w,b):
return 1/(1+np.exp(-(w*x+b)))
def error(w,b):
err = 0.0
for x,y in zip(X,Y):
fx = f(x,w,b)
err += (fx-y)**2
return 0.5*err
def grad_b(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)
def grad_w(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)*x
def do_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]
def f(x,w,b):
return 1/(1+np.exp(-(w*x+b)))
def error(w,b):
err = 0.0
for x,y in zip(X,Y):
fx = f(x,w,b)
err += (fx-y)**2
return 0.5*err
def grad_b(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)
def grad_w(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)*x
def do_gradient_descent():
w,b,eta,max_epochs = -4,-4,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
When the curve is steep the gradient \( \frac{\Delta y}{\Delta x}\) is large.
When the curve is gentle the gradient \( \frac{\Delta y}{\Delta x}\) is small.
Recall that our weight updates are proportional to the gradient \( w = w - \eta \nabla w \)
Hence in the areas where the curve is gentle the updates are small whereas in the areas where the curve is steep the updates are large
Drag the blue point to see how the slope changes.
Let’s see what happens when we start from a different point
Irrespective of where we start from once we hit a surface which has a gentle slope, the progress slows down
Try: \(w_{init}=-2,b_{init}=2\)
Try: \(w_{init}=-3,b_{init}=2\)
Set \(t_{max}=150\)
Module 5.3 : Contours
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Visualizing things in 3d can sometimes become a bit cumbersome
Can we do a 2d visualization of this traversal along the error surface
Yes, let’s take a look at something known as contours
Suppose I take horizontal slices of this error surface at regular intervals along the vertical axis
How would this look from the topview ?
Suppose I take horizontal slices of this error surface at regular intervals along the vertical axis
How would this look from the topview ?
A small distance between the contours indicates a steep slope along that direction
A large distance between the contours indicates a gentle slope along that direction
Gradient Descent
Module 5.4 : Momentum Based Gradient Descent
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
It is helpful to remember that the first iteration of any optimization algorithm starts with computing \(\nabla w_0\).
Some observations about gradient descent
It takes a lot of time to navigate regions having a gentle slope
This is because the gradient in these regions is very small
Can we do something better?
Yes, let’s take a look at Momentum based gradient descent
Before moving further
Suppose that we want to calculate the average of \(n\) numbers, say \(n=4\), \(x_n=[1,2,3,4]\). How do we do it?
Suppose that we want to calculate the average of \(n\) numbers, say \(n=4\). However, this time not all the numbers are readily available, they are coming one at a time \(t\). How do we do it?
Approach: 1
Accumulate all numbers as it comes.
Update the average as a new number comes in
This is not a good approach if we are only interested in the average!
It requires us to store all numbers and therefore involves redundant computation
Is there a better way of doing it?
Approach: 2
Do we need to sum all the numbers every time a new number comes in?
with initial condition, \(s_0=0\)
No!. We can also compute the average as follows
Update rule for momentum based gradient descent
with, \( u_{-1}=0\), \(w_0 = rand()\) and \(0 \leq \beta < 1\)
In addition to the current update, also look at the history of updates.
Intuition
If I am repeatedly being asked to move in the same direction then I should probably gain some confidence and start taking bigger steps in that direction
Just as a ball gains momentum while rolling down a slope
The exponentially weighted average of current and all past gradients.
def do_mgd(max_epochs):
w,b,eta = -2,-2,1.0
prev_uw,prev_ub,beta = 0,0,0.9
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(w,b,x,y)
db += grad_b(w,b,x,y)
uw = beta*prev_uw+eta*dw
ub = beta*prev_ub+eta*db
w = w - vw
b = b - vb
prev_uw = uw
prev_ub = ub
Some Observations and questions
Even in the regions having gentle slopes, momentum based gradient descent is able to take large steps because the momentum carries it along
Is moving fast always good? Would there be a situation where momentum would cause us to run pass our goal?
Let us change our input data so that we end up with a different error surface and then see what happens ...
In this case, the error is high on either side of the minima valley
Could momentum be detrimental in such cases... let’s see....
Momentum based gradient descent oscillates in and out of the minima valley as the momentum carries it out of the valley
Takes a lot of u-turns before finally converging
\(w_0=-2, b_0=-4, \eta=1 \)
Despite these u-turns it still converges faster than vanilla gradient descent
After 50 iterations (25 iterations per second) momentum-based method has reached an error of 0.00001 whereas vanilla gradient descent is still stuck at an error of 0.16
Module 5.5 : Nesterov Accelerated Gradient Descent
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Questions
Can we do something to reduce these oscillations ?
Yes, let’s look at Nesterov accelerated gradient
Intuition
Look before you leap
Update rule for NAG
with, \( u_{-1}=0\) and \(0 \leq \beta < 1\)
Recall that \( u_t = \beta u_{t-1}+\nabla w_t\)
So, we know that we are going to move by at least by \( \beta u_{t-1}\) and then a bit more by \(\nabla w_t\)
Why not make use of this to look ahead instead of relying only on the current \(\nabla w_t\)
def do_nag(max_epochs):
w,b,eta = -2,-2,1.0
prev_vw,prev_vb,beta = 0,0,0.9
for i in range(max_epochs):
dw,db = 0,0
# do partial updates
v_w = beta*prev_vw
v_b = beta*prev_vb
for x,y in zip(X,Y):
# Look ahead
dw += grad_w(w-v_w,b-v_b,x,y)
db += grad_b(w-v_w,b-v_b,x,y)
vw = beta*prev_vw+eta*dw
vb = beta*prev_vb+eta*db
w = w - vw
b = b - vb
prev_vw = vw
prev_vb = vb
Observations about NAG
Looking ahead helps NAG in correcting its course quicker than momentum-based gradient descent.
Hence the oscillations are smaller and the chances of escaping the minima valley are also smaller.
Why do we care for an algorithm with quicker convergence?
Not only about time but also about the cost!
@Mitesh sir, will be used upon your approval
Module 5.6 : Stochastic And Mini-Batch Gradient Descent
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Let’s digress a bit and talk about the stochastic version of these algorithms...
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]
def do_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
Notice that the algorithm goes over the entire data once before updating the parameters
Why?
Because this is the true gradient of the loss as derived earlier (sum of the gradients of the losses corresponding to each data point)
No approximation.
Hence, theoretical guarantees hold (in other words each step guarantees that the loss will decrease)
What’s the flipside?
Imagine we have a million points in the training data.
To make 1 update to \(w, b\) the algorithm makes a million calculations. Obviously very slow!!
Can we do something better?.
Yes, let’s look at stochastic gradient descent
def do_stochastic_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
def do_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
Notice that the algorithm updates the parameters for every single data point
Now if we have a million data points we will make a million updates in each epoch (1 epoch = 1 pass over the data; 1 step = 1 update)
What is the flipside ?
Stochastic because we are estimating the total gradient based on a single data point.
It is an approximate (rather stochastic) gradient
Almost like tossing a coin only once and estimating P(heads).
No guarantee that each step will decrease the loss
Let’s see this algorithm in action when we have a few data points
We see many oscillations. Why ?
Because we are making greedy decisions.
Each point is trying to push the parameters in a direction most favorable to it (without being aware of how this affects other points)
A parameter update which is locally favorable to one point may harm other points (it is almost as if the data points are competing with each other)
Can we reduce the oscillations by improving our stochastic estimates of the gradient? (currently estimated from just 1 data point at a time)
Yes, let’s look at mini-batch gradient descent
def do_stochastic_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,500
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
def do_minibatch_stochastic_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,500
mini_batch_size = 25
for i in range(max_epochs):
dw,db,num_points_seen = 0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
num_points_seen += 1
if num_points_seen%minibatch_size == 0:
w = w - eta*dw
b = b - eta*db
dw,db = 0,0
Notice that the algorithm updates the parameters after it sees mini_batch_size number of data points
The stochastic estimates are now slightly better
Let’s see this algorithm in action when we have \(k = 25\)
With a batch size of \(k=25\) the oscillations have reduced to a good extent.
With a batch size of \(k=25\) the oscillations have reduced to a good extent. Why ?
Because we now have slightly better estimates of the gradient [analogy: we are now tossing the coin \(k=25\) times to estimate P(heads)]
The higher the value of \(k\) the more accurate are the estimates
Of course, there are still oscillations and they will always be there as long as we are using an approximate gradient as opposed to the true gradient
Some things to remember..
1 epoch = one pass over the entire data
Algorithm | # of steps in 1 epoch |
---|---|
Vanilla (Batch) Gradient Descent | |
Stochastic Gradient Descent | |
Mini Batch Gradient Descent |
1 step = one update of the parameters
\(N\) = Number of data points
\(B\) = Mini Batch size
Similarly, we can have stochastic versions of Momentum based gradient descent and Nesterov accelerated based gradient descent
\(\beta =0.5\)
SGD with Momentum
SGD-NAG-Momentum
\(\beta=0.85\)
While the stochastic versions of both Momentum [green] and NAG [blue] exhibit oscillations the relative advantage of NAG over Momentum still holds (i.e., NAG takes relatively shorter u-turns)
Further both of them are faster than stochastic gradient descent (after 500 steps, stochastic gradient descent [black] still exhibits a very high error whereas NAG and Momentum are close to convergence)
And, of course, you can also have the mini batch version of Momentum and NAG...I leave that as an exercise
Module 5.7 : Tips for Adjusting learning Rate and Momentum
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Before moving on to advanced optimization algorithms let us revisit the problem of learning rate in gradient descent
One could argue that we could have solved the problem of navigating gentle slopes by setting the learning rate high (i.e., blow up the small gradient by multiplying it with a large \(\eta\))
Let us see what happens if we set the learning rate to 10
On the regions which have a steep slope, the already large gradient blows up further
It would be good to have a learning rate which could adjust to the gradient ...
we will see a few such algorithms soon..
Tips for initial learning rate
Tune learning rate [Try different values on a log scale: 0.0001, 0.001, 0.01, 0.1. 1.0]
Run a few epochs with each of these and figure out a learning rate which works best
Now do a finer search around this value [for example, if the best learning rate was 0.1 then now try some values around it: 0.05, 0.2, 0.3]
Disclaimer: these are just heuristics ... no clear winner strategy
Tips for annealing learning rate
Step Decay:
Halve the learning rate after every 5 epochs or
Halve the learning rate after an epoch if the validation error is more than what it was at the end of the previous epoch
Tips for annealing learning rate
Exponential Decay:
\(\eta=\eta_0^{-kt}\) where \(\eta_0\) and \(k\) are
hyperparameters and \(t\) is a step number
Tips for annealing learning rate
1/t Decay:
\(\eta=\frac{\eta_0}{1+kt}\) where \(\eta_0\) and \(k\) are
hyperparameters and \(t\) is a step number
Tips for momentum
The following schedule was suggested by Sutskever et. al., 2013
\( \beta_t = min(1-2^{-1-\log_2(\lfloor \frac{t}{250} \rfloor+1)},\beta_{max}) \)
where, \(\beta_{max}\) is choosen from \(\{0.999,0.995,0.99,0.9,0\}\)
Module 5.8: Line Search
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Just one last thing before we move on to some other algorithms ...
def do_line_search_gradient_descent(max_epochs):
w,b,etas = -2,-2,[0.1,0.5,1,2,10]
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(w,b,x,y)
db += grad_b(w,b,x,y)
min_error = 10000 # some large value
for eta in etas:
temp_w = w - eta * dw
temp_b = b - eta * db
if error(temp_w,temp_b) < min_error:
best_w = temp_w
best_b = temp_b
min_error = error(best_w,best_b)
w = best_w
b = best_b
In practice, often a line search is done to find a relatively better value of \(\eta\)
Update \(w\) using different values of \(\eta\)
Now retain that updated value of \(w\) which gives the lowest loss
Essentially at each step, we are trying to use the best \(\eta\) value from the available choices
What’s the flipside?
def do_line_search_gradient_descent(max_epochs):
w,b,etas = -2,-2,[0.1,0.5,1,2,10]
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(w,b,x,y)
db += grad_b(w,b,x,y)
min_error = 10000 # some large value
for eta in etas:
temp_w = w - eta * dw
temp_b = b - eta * db
if error(temp_w,temp_b) < min_error:
best_w = temp_w
best_b = temp_b
min_error = error(best_w,best_b)
w = best_w
b = best_b
A line search can be done to find a relatively better value of \(\eta\)
Update \(w\) using different values of \(\eta\)
Now retain that updated value of \(w\) which gives the lowest loss
Essentially at each step, we are trying to use the best \(\eta\) value from the available choices
What’s the flipside? We are doing many more computations in each step
Blue: Gradient descent with fixed learning rate \(\eta=1\)
Black: Gradient descent with line search
Let us see line search in action
Convergence is faster than vanilla gradient descent
We see some oscillations, but note that these oscillations are different from what we see in momentum and NAG
Module 5.9 : Gradient Descent with Adaptive Learning Rate
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Given this network, it should be easy to see that given a single point (\(\mathbf{x},b\))
If there are \(n\) points, we can just sum the gradients over all the \(n\) points to get the total gradient
What happens if the feature \(x^2\) is very sparse? (i.e., if its value is 0 for most inputs)
\(\nabla w^2\) will be 0 for most inputs (see formula) and hence \(w^2\) will not get enough updates
If \(x^2\) happens to be sparse as well as important we would want to take the updates to \(w^2\) more seriously.
Can we have a different learning rate for each parameter which takes care of the frequency of features?
Intuition
Decay the learning rate for parameters in proportion to their update history (more updates means more decay)
... and a similar set of equations for \(b_t\)
Update Rule for AdaGrad
To see this in action we need to first create some data where one of the features is sparse
How would we do this in our toy network ?
Well, our network has just two parameters \(w \)and \(b\).
def do_adagrad(max_epochs):
#Initialization
w,b,eta = -2,-2,0.1
v_w,v_b,eps = 0,0,1e-8
for i in range(max_epochs):
# zero grad
dw,db = 0,0
for x,y in zip(X,Y):
#compute the gradients
dw = grad_w(w,b,x,y)
db = grad_b(w,b,x,y)
#compute intermediate values
v_w = v_w + dw**2
v_b = v_b + db**2
#update parameters
w = w - eta*dw/(np.sqrt(v_w)+eps)
b =b - eta*db/(np.sqrt(v_b)+eps)
Take some time to think about it
To see this in action we need to first create some data where one of the features is sparse
How would we do this in our toy network ?
Well, our network has just two parameters \(w \)and \(b\). Of these, the input/feature corresponding to \(b\) is always on (so can’t really make it sparse)
def do_adagrad(max_epochs):
#Initialization
w,b,eta = -2,-2,0.1
v_w,v_b,eps = 0,0,1e-8
for i in range(max_epochs):
# zero grad
dw,db = 0,0
for x,y in zip(X,Y):
#compute the gradients
dw = grad_w(w,b,x,y)
db = grad_b(w,b,x,y)
#compute intermediate values
v_w = v_w + dw**2
v_b = v_b + db**2
#update parameters
w = w - eta*dw/(np.sqrt(v_w)+eps)
b =b - eta*db/(np.sqrt(v_b)+eps)
Take some time to think about it
To see this in action we need to first create some data where one of the features is sparse
How would we do this in our toy network ?
Well, our network has just two parameters \(w \)and \(b\). Of these, the input/feature corresponding to \(b\) is always on (so can’t really make it sparse)
The only option is to make \(x\) sparse
Solution: We created 500 random \((x, y)\) pairs and then for roughly 80% of these pairs we set \(x\) to \(0\) thereby, making the feature for \(w\) sparse
def do_adagrad(max_epochs):
#Initialization
w,b,eta = -2,-2,0.1
v_w,v_b,eps = 0,0,1e-8
for i in range(max_epochs):
# zero grad
dw,db = 0,0
for x,y in zip(X,Y):
#compute the gradients
dw = grad_w(w,b,x,y)
db = grad_b(w,b,x,y)
#compute intermediate values
v_w = v_w + dw**2
v_b = v_b + db**2
#update parameters
w = w - eta*dw/(np.sqrt(v_w)+eps)
b =b - eta*db/(np.sqrt(v_b)+eps)
Take some time to think about it
There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?
Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis
Why?
There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?
Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis
Why? Because in our data, the feature corresponding to \(w\) is sparse and hence \(w\) undergoes very few updates
There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?
Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis
Why? Because in our data, the feature corresponding to \(w\) is sparse and hence \(w\) undergoes very few updates ...on the other hand b is very dense and undergoes many updates
Such sparsity is very common in large neural networks containing \(1000s\) of input features and hence we need to address it
Let's see what AdaGrad does..
Adagrad
Adagrad - SGD (*New)
learning rate \(\eta\) = 0.1 for all the algorithms
Momentum \(\gamma = 0.9\)
Number of points 500, 80% set to zero
Momentum and NAG converge quickly near the optima than Adagrad.
Adagrad slows down near the optima due to decaying learning rate
Adagrad
learning rate \(\eta\) = 0.1 for all the algorithms
Momentum \(\gamma = 0.9\)
Number of points 500, 80% set to zero
Adagrad slows down near the optima due to decaying learning rate
Recall that,
Since \(x\) is sparse, the gradient is zero for most of the time steps.
Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays slowly.
By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates
Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator
In practice, this does not work so well if we remove the square root from the denominator
By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates
Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator
In practice, this does not work so well if we remove the square root from the denominator (something to ponder about)
What’s the flipside?
By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates
Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator
In practice, this does not work so well if we remove the square root from the denominator (something to ponder about)
What’s the flipside? over time the effective learning rate for \(b\) will decay to an extent that there will be no further updates to \(b\)
Can we avoid this?
Intuition
Adagrad decays the learning rate very aggressively (as the denominator grows)
Update Rule for RMSprop
... and a similar set of equations for \(b_t\)
As a result, after a while, the frequent parameters will start receiving very small updates because of the decayed learning rate
To avoid this why not decay the denominator and prevent its rapid growth
Recall that,
Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays rapidly for \(b\)
AdaGrad
RMSProp
Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays slowly (compared to adagrad) for \(b\)
def do_rmsprop(max_epochs):
#Initialization
w,b,eta = -4,4,0.1
beta = 0.5
v_w,v_b,eps = 0,0,1e-4
for i in range(max_epochs):
# zero grad
dw,db = 0,0
for x,y in zip(X,Y):
#compute the gradients
dw = grad_w(w,b,x,y)
db = grad_b(w,b,x,y)
#compute intermediate values
v_w = beta*v_w +(1-beta)*dw**2
v_b = beta*v_b + (1-beta)*db**2
#update parameters
w = w - eta*dw/(np.sqrt(v_w)+eps)
b =b - eta*db/(np.sqrt(v_b)+eps)
However, why there are oscillations?
Does it imply that after some iterations, there is a chance that the learning rate remains constant so that the algorithm possibly gets into an infinite oscillation around the minima?
RMSProp converged more quickly than AdaGrad by being less aggressive on decay
RMSProp
AdaGrad
In the case of AdaGrad, the learning rate monotonically decreases due to the ever-growing denominator!
In the case of RMSProp, the learning rate may increase, decrease, or remains constant due to the moving average of gradients in the denominator.
The figure below shows the gradients across 500 iterations. Can you see the reason for the oscillation around the minimum?
What is the solution?
If learning rate is constant, there is chance that the descending algorithm oscillate around the local minimum.
We have to set the initial learning rate appropriately, in this case, setting \(\eta=0.05\) solves the propblem.
Sensitive to initial learning rate, initial conditions of parameters and corresponding gradients
If the initial gradients are large, the learning rates will be low for the remainder of training (in AdaGrad)
Later, if a gentle curvature is encountered, no way to increase the learning rate (In Adagrad)
AdaDelta
Avoids setting initial learning rate \(\eta_0\).
AdaDelta
Avoids setting initial learning rate \(\eta_0\).
Reason why it is called Adaptive Delta
We want the learning rate to decrease
We want the learning rate to increase
Surface with high curvature
\(\Delta w_0 < \nabla w_0\)
\(u_0 < v_0\)
A very small update
A smaller update as \(\eta < 1\)
\(u_0 < v_1\)
We want the learning rate to decrease. The algorithm does it
Surface with low curvature
\(\Delta w_0 < \nabla w_0\)
\(u_0 < v_0\)
A very small update
A bigger update as \(\eta > 1\)
\(u_0 > v_1\)
Let's say \(\nabla w_1\) is near zero and squaring it pushes it towards zero, so we can safely ignore its contribution (just for argument)
We want the learning rate to increase. The algorithm does it!
def do_adadelta(max_epochs):
#Initialization
w,b= -4,-4
beta = 0.99
v_w,v_b,eps = 0,0,1e-4
u_w,u_b = 0,0
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
#compute the gradients
dw += grad_w(w,b,x,y)
db += grad_b(w,b,x,y)
v_w = beta*v_w + (1-beta)*dw**2
v_b = beta*v_b + (1-beta)*db**2
delta_w = dw*np.sqrt(u_w+eps)/(np.sqrt(v_w+eps))
delta_b = db*np.sqrt(u_b+eps)/(np.sqrt(v_b+eps))
u_w = beta*u_w + (1-beta)*delta_w**2
u_b = beta*u_b + (1-beta)*delta_b**2
w = w - delta_w
b = b - delta_b
Observe the gradient profile along the trajectory
It starts off with high curvature region and eventually reaches the low curvature region
Let's take a look at how the learning rate changed dynamically over the iterations.
What's your guess?
Let's put it all together in one place
Adam (Adaptive Moments)
Intuition
Do everything that RMSProp and AdaDelta does to solve the decay problem of Adagrad
Incorporating classical
momentum
\(L_2\) norm
Plus use a cumulative history of the gradients
Typically, \(\beta_1=0.9\), \(\beta_2=0.999\)
def do_adam_sgd(max_epochs):
#Initialization
w,b,eta = -4,-4,0.1
beta1,beta2 = 0.5,0.5
m_w,m_b,v_w,v_b = 0,0,0,0
for i in range(max_epochs):
dw,db = 0,0
eps = 1e-10
for x,y in zip(X,Y):
#compute the gradients
dw = grad_w_sgd(w,b,x,y)
db = grad_b_sgd(w,b,x,y)
#compute intermediate values
m_w = beta1*m_w+(1-beta1)*dw
m_b = beta1*m_b+(1-beta1)*db
v_w = beta2*v_w+(1-beta2)*dw**2
v_b = beta2*v_b+(1-beta2)*db**2
m_w_hat = m_w/(1-np.power(beta1,i+1))
m_b_hat = m_b/(1-np.power(beta1,i+1))
v_w_hat = v_w/(1-np.power(beta2,i+1))
v_b_hat = v_b/(1-np.power(beta2,i+1))
#update parameters
w = w - eta*m_w_hat/(np.sqrt(v_w_hat)+eps)
b = b - eta*m_b_hat/(np.sqrt(v_b_hat)+eps)
Million Dollar Question: Which algorithm to use?
Adam seems to be more or less the default choice now \((\beta_1 = 0.9, \beta_2 = 0.999 \ \text{and} \ \epsilon = 1e^{-8} )\).
Although it is supposed to be robust to initial learning rates, we have observed that for sequence generation problems \(\eta = 0.001, 0.0001\) works best
Having said that, many papers report that SGD with momentum (Nesterov or classical) with a simple annealing learning rate schedule also works well in practice (typically, starting with \(\eta = 0.001, 0.0001 \)for sequence generation problems)
Adam might just be the best choice overall!!
Some *recent work suggests that there is a problem with Adam and it will not converge in some cases. Also, it is observed that the models trained using Adam optimizer usually don't generalize well.
Explanation for why we need bias correction in Adam
Update Rule for Adam
Note that we are taking a running average of the gradients as \(m_t\)
The reason we are doing this is that we don’t want to rely too much on the current gradient and instead rely on the overall behaviour of the gradients over many timesteps
One way of looking at this is that we are interested in the expected value of the gradients and not on a single point estimate computed at time t
However, instead of computing \(E[\nabla w_t]\) we are computing \(m_t\) as the exponentially moving average
Ideally we would want \(E[m_t ]\) to be equal to \(E[\nabla w_t ]\)
Let us see if that is the case
Recall the momentum equations from module 5.4
What we have now in Adam is a slight modification to the above equation (let's treat \(\beta_1\) as \(\beta\) )
Recall the momentum equations from module 5.4
What we have now in Adam is a slight modification to the above equation
Let's take expectation on both sides
Assumption: All \(\nabla w_\tau\) comes from the same distribution,i.e.,
\(E[\nabla w_\tau] = E[\nabla w] \quad \forall \tau\)
Assumption: All \(\nabla w_\tau\) comes from the same distribution,i.e.,
\(E[\nabla w_\tau] = E[\nabla w] \quad \forall \tau\)
The last ration is the sum of GP with common ratio \(\beta\)
Hence we apply the bias correction because then the expected value of \(\hat{m_t}\) is the same as the expected value of \(E[\nabla w_t]\)
Let's take expectation on both sides
AdaMax
Replaced \(L_2\) norm by \(L_{max}\) norm in ADAM
Do the same modification for RMSprop and call it MaxaProp
Update Rule for Adam
Update Rule for AdaMax
def do_adamax_sgd(max_epochs):
#Initialization
w,b,eta = -4,-4,0.1
beta1,beta2 = 0.9,0.99
m_w,m_b,v_w,v_b = 0,0,0,0
m_w_hat,m_b_hat,v_w_hat,v_b_hat = 0,0,0,0
for i in range(max_epochs):
dw,db = 0,0
eps = 1e-10
for x,y in zip(X,Y):
#compute the gradients
dw += grad_w_sgd(w,b,x,y)
db += grad_b_sgd(w,b,x,y)
#compute intermediate values
m_w = beta1*m_w+(1-beta1)*dw
m_b = beta1*m_b+(1-beta1)*db
v_w = np.max([beta2*v_w,np.abs(dw)])
v_b = np.max([beta2*v_b,np.abs(db)])
m_w_hat = m_w/(1-np.power(beta1,i+1))
m_b_hat = m_b/(1-np.power(beta1,i+1))
#update parameters
w = w - eta*m_w_hat/(v_w+eps)
b = b - eta*m_b_hat/(v_b+eps)
Intuition
We know that NAG is superior to Momentum.
We just need to modify \(m_t\) as it depends on the gradient.
Why not just incorporate it with ADAM?
Update Rule for Adam
Recall the equation for Momentum
Recall the equation for NAG
We skip the multiplication of factor \(1-\beta\) with \(\nabla w_t\) for brevity
Update Rule for Adam
Recall the equation for Momentum
Recall the equation for NAG
So can we write \(m_t\) as follows?
Would it be helpful if we untangle \(\nabla(w_t-\beta_1 m_{t-1})\)by rewriting NAG ?
Yes. But at the cost of intuition
Try computing bias correction like we did earlier..
NAG
Rewritten NAG
Observe that the momentum vector \(m_{t-1}\) is used twice (while computing gradient and current momentum vector). As a consequence, there are two weight updates (which is costly for big networks)!
Look ahead
Look ahead
Now, look ahead is computed only at the weight update equation with the current momentum vector \(m_t\)
Now, we just need to plug in this to Adam. Instead of \(m_t\) we use bias-corrected version, \(\hat{m_t}\).
Update Rule for Adam
Let's extend the weight update equation (treat \(\beta_1 as \beta\)) and let \(\nabla w_t=g_t\)
We replace \(\hat{m}_{t-1}\) by \(\hat{m_t}\) (as we did while rewritting NAG).Here is the update rule for NAdam.
Relaxing the requirement of \(\beta^{t-1}\) in the denominator of \(\hat{m}_{t-1}\). we can write the update equation as
Rewritten NAG
We need not to modify \(v_t\)
Update Rule for NAdam
def do_adamax_sgd(max_epochs):
#Initialization
w,b,eta = -4,-4,0.1
beta1,beta2 = 0.9,0.99
m_w_hat,m_b_hat,v_w_hat,v_b_hat = 0,0,0,0
for i in range(max_epochs):
dw,db = 0,0
eps = 1e-10
for x,y in zip(X,Y):
#compute the gradients
dw += grad_w_sgd(w,b,x,y)
db += grad_b_sgd(w,b,x,y)
#compute intermediate values
m_w = beta1*m_w+(1-beta1)*dw
m_b = beta1*m_b+(1-beta1)*db
v_w = beta2*v_w+(1-beta2)*dw**2
v_b = beta2*v_b+(1-beta2)*db**2
m_w_hat = m_w/(1-beta1**(i+1))
m_b_hat = m_b/(1-beta1**(i+1))
v_w_hat = v_w/(1-beta2**(i+1))
v_b_hat = v_b/(1-beta2**(i+1))
#update parameters
w = w - (eta/np.sqrt(v_w_hat+eps))* \\
(beta1*m_w_hat+(1-beta1)*dw/(1-beta1**(i+1)))
b = b - (eta/(np.sqrt(v_b_hat+eps)))*\\
(beta1*m_b_hat+(1-beta1)*db/(1-beta1**(i+1)))
Parameterized version of NAdam
SGD with momentum outperformed Adam in certain problems like object detection and machine translation
It is attributed exponentially averaging squared gradients \((\nabla w)^2\) in \(v_t\) of Adam which suppresses the high-valued gradient information over iterations. So, use \(max()\) function to retain it.
Update Rule for Adam
Update Rule for AMSGrad
Note that there are no bias corrections in AMSGrad!
Well, no one knows what AMSGrad stands for!
Once again, which optimizer to use?
It is common for new learners to start out using off-the-shelf optimizers, which are later replaced by custom-designed ones.
-Pedro Domingos
@mitesh sir, we haven't introduced the concept of regularization so far in any of the previous lectures. So do we need to include AdamW here? (nevertheless, students from B.Sc program are introduced to regularization techniques in MLT course)
\(L_2\) regularization and weight decay are equivalent for SGD algorithms but not for algorithms that use adaptive gradient like Adam
It is known that using \(L_2\) regularization and weight decay are equivalent for SGD algorithms. In other words, they are coupled as shown below
The terms on the right can be rearranged and written as
However, it is observed that it is not the case with the Adam optimizer! (In general, any adaptive gradient based optimizers)
Update Rule for Adam
Update Rule for AdamW
Decoupling weight Decay
Therefore, if you wish to implement weight decay, do it directly with the update equation. The \(L_2\) regularization and weight decay are no longer coupled for adaptive gradient-based optimizers.
Now we can extend the same to Nadam and call it NadamW, \(RMSpropRMSpropW, and AdaGradW.
A similar situation rises when we apply normalization techniques (will learn on the next set of lectures) to the deep learning model with adaptive gradients. To rectify it, modify Adam and call it AdamP
Now we need to digress a bit to a new set of adaptive learning rate schemes (as it is a pre-requisite for RAdam (Rectified Adam))
Are you excited?
Learning rate Schemes
Based on epochs
Based on validation
1. Step Decay
Based on Gradients
2. Exponential Decay
3. Cyclical
4. Cosine annealing
1. Line search
2. Log search
1. AdaGrad
2. RMSProp
3.AdaDelta
4.Adam
5.AdaMax
6.NAdam
7.AMSGrad
8.AdamW
5. Warm-Restart
Suppose the loss surface looks as shown in the figure. The surface has a saddle point.
Suppose further that the parameters are initialized to \((w_0,b_0)\) (yellow point on the surface) and the learning rate \(\eta\) is decreased exponentially over iterations.
After some iterations, the parameters \((w,b)\) will reach near the saddle point.
Since the learning rate has decreased exponentially, the algorithm has no way of coming out of the saddle point (despite the possibility of coming out of it)
What if we allow the learning rate to increase after some iterations? at least there is a chance to escape the saddle point.
Rationale: Often, difficulty in minimizing the loss arises from saddle points rather than poor local minima.
Therefore, it is beneficial if the learning rate schemes support a way to increase the learning rate near the saddle points.
Adaptive learning rate schemes may help in this case. However, it comes with an additional computational cost.
A simple alternative is to vary the learning rate cyclically
CLR : Triangular
Where, \(\mu\) is called a step size
def cyclic_lr(iteration,max_lr,base_lr,step_size):
cycle = np.floor(1+iteration/(2*step_size))
x = np.abs(iteration/step_size - 2*cycle + 1)
lr = base_lr + (max_lr-base_lr)*np.maximum(0, (1-x))
return lr
def do_gradient_descent_clr(max_epochs):
w,b = -2,0
w_trace,b_trace = [w],[b]
for i in range(max_epochs):
dw,db = 0,0
dw = grad_w(w,b)
db = grad_b(w,b)
w = w - cyclic_lr(i,max_lr=0.1,base_lr=0.001,step_size=30) * dw
b = b - cyclic_lr(i,max_lr=0.1,base_lr=0.001,step_size=30) * db
return w_trace,b_trace
Cosine Annealing (Warm Re-Start)
Let's see how changing the learning rate cyclically also helps faster convergence with the following example
Cosine Annealing (Warm Re-Start)
Reached minimum at 46th Iteration (of course, if we run it for a few more iterations, it crosses the minimum)
However, we could use techniques such as early stopping to roll back to the minimum
Cosine Annealing (Warm Re-Start)
\(\eta_{max}\): Maximum value for the learning rate
\(\eta_{min}\): Minimum value for the learning rate
\(t\): Current epoch
\(T\): Restart interval (can be adaptive)
Modified formula from the paper for batch gradient (original deals with SGD and restarts after a particular epoch \(T_i\))
\(\eta_t=\eta_{max}\), for \(t=0\)
\(\eta_t=\eta_{min}\), for \(t=T\)
Note the abrupt changes after \(T\) (That's why it is called warm re-start)
Warm-start
There are other learning rate schedulers like warm-start are found to be helpful in achieving quicker convergence in architectures like transformers
Typically, we set the initial learning rate to a high value and then decay it.
On the contrary, using a low initial learning rate helps the model to warm and converge better. This is called warm-start
warmupSteps=4000
CS6910: Lecture 5
By Arun Prakash
CS6910: Lecture 5
- 6,348