CS6910: Fundamentals of Deep Learning

Lecture 5

Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD, AdaGrad, AdaDelta, RMSProp, Adam (and its variants)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

For most of the lecture, I have borrowed ideas from the videos by Ryan Harris on “visualize backpropagation” (available on youtube)

Some content is based on the course CS231n by Andrej Karpathy and others

Acknowledgements

Module 5.1: Learning Parameters :

Infeasible (Guess Work)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

1
x
\sigma
f(x) = \frac{1}{1+e^{-(wx+b)}}
\hat{y}=f(x)
w
b

Input for Training

\(\{x_i,y_i\}_{i=1}^N \rightarrow N \ \text{pairs of} \ (x,y) \)

Training Objective

Find \( w\) and \(b \) such that

\( \underset{(w,b) \in R} {minimize}\) \(   \mathscr{L(w,b)}= \frac{1}{N}\sum \limits_{i=1}^N (y_i-f(x_i))^2 \)

What does it mean to train the network?

Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)

x
y
(x_1,y_1)
(x_2,y_2)
f(x) = \frac{1}{1+e^{-(wx+b)}}

At the end of training we expect to find \(w^*\), \(b^*\) such that:

\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)

In other words

We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid

What does it mean to train the network?

x
y
(x_1,y_1)
(x_2,y_2)
f(x) = \frac{1}{1+e^{-(wx+b)}}

In other words

We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid

Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)

At the end of training we expect to find \(w^*\), \(b^*\) such that:

\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)

Let us see this in more detail ...

Can we find \( w^*,b^* \) manually?

Let us start with a Random Guess: \( w=3,b=-1\)

Clearly, it is not good, but how bad is it ?

Let us revisit \(\mathscr{L} (w, b) \) to see how bad it is ...

\mathscr{L(w,b)}= \frac{1}{N} \sum \limits_{i=1}^N (y_i-f(x_i))^2
= \frac{1}{2} * \big( (y_1-f(x_1))^2 + (y_2-f(x_2))^2 \big)
= \frac{1}{2} * \big( (0.9-f(2.5))^2 + (0.2-f(0.5))^2 \big)
= \frac{1}{2} * \big( (0.90-0.99)^2 + (0.20-0.62)^2 \big)
=0.099

We want  \(\mathscr{L} (w, b) \) to be as close to zero as possible

Let us try some other values for \((w,b)\) by changing the sliders

w

b

Loss

0.50 0.00 0.0730
-0.10 0.00 0.14
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

A few Snapshots

Let us Look for something better

than our "Guess work Algorithm"..

Since we have only 2 points and 2 parameters \((w, b)\) we can easily plot \(\mathscr{L}(w, b)\) for different values of \((w, b)\) and pick the one where \(\mathscr{L}(w, b)\) is minimum

Random Search on Error Surface

But of course, this becomes intractable once you have many more data points and many more parameters !!

Further, even here we have plotted the error surface only for a small range of \((w, b)\) [from (−6, +6) and not from \((-\infty, +\infty)\)]

Let us look at the geometric interpretation of our “guess work” algorithm in terms of this error surface

Play with the Search Space

w

b

Loss

0.50 0.00 0.0730
-0.10 0.00 0.14
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

Module 5.2: Learning Parameters :

Gradient Descent

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Goal

Find a better way of traversing the error surface so that we can reach the minimum value quickly without resorting to brute force search!

Vector of parameters, say,
randomly initialized

\( \theta = [w,b]\)

\( \Delta \theta = [\Delta w, \Delta b]\)

change in the value of

\(w,b\)

We moved in the

direction of \(\Delta \theta\)

Let us be a bit conservative:

move only by a small amount \(\eta \)

\(\theta_{new}=\theta+\eta \cdot \Delta \theta \)

Question: What is the right \(\Delta \theta\) to use ?

The answer comes from Taylor series

\theta
\Delta \theta
\color{red} \eta \cdot \Delta \theta
\theta_{new}

What is Taylor Series?

Taylor series is a way of approximating any continuously differentiable function \(\mathscr{L(w)}\) using polynomials of degree \(n\). The higher the degree the better the approximation!.

 \(\mathscr{L}(w)=\mathscr{L}(w_0)+\frac{\mathscr{L'}(w_0)}{1!}(w-w_0)+\frac{\mathscr{L''}(w_0)}{2!}(w-w_0)^2+\frac{\mathscr{L'''}(w_0)}{3!}(w-w_0)^3+ \cdots\)

w
\mathscr{L}(w)
w_0

Linear Approximation (\( n=1 \))

\( \mathscr{L}(w)=\mathscr{L}(w_0)+ \frac{\mathscr{L'}(w_0)}{1!}(w-w_0) \)

Quadratic Approximation (\(n=2\))

 \(\mathscr{L}(w)=\mathscr{L}(w_0)+\frac{\mathscr{L'}(w_0)}{1!}(w-w_0)+ \frac{\mathscr{L''}(w_0)}{2!}(w-w_0)^2 \)

\mathscr{L}(w_0)

Taylor Series Approximation

Try with different \( L(w) \) such as \(exp(w),sin(w),w^3 \cdots  \) and see how closely Taylor series approximate the function at the point \(w = w_0\) and around its neighbour \(\epsilon \)

Approximating 2D function

For ease of notation, let \(\Delta \theta  =  u\) , then from Taylor series, we have,

\mathscr{L}(\theta+ \eta u) = \mathscr{L}(\theta)+\eta*u^T \nabla_{\theta}\mathscr{L}(\theta)+\frac{\eta^2}{2!}*u^T\nabla_{\theta}^2\mathscr{L}(\theta)u+ \frac{\eta^3}{3!} \cdots +\frac{\eta^4}{4!} \cdots
=\mathscr{L}(\theta)+\eta*u^T \nabla_{\theta}\mathscr{L}(\theta)

[ \( \eta\) is typically small, so \( \eta^2,\eta^3  \cdots\rightarrow 0\)]

Note that the move (\(\eta u\)) would be favorable only if

\mathscr{L}(\theta+\eta u)-\mathscr{L}(\theta) < 0

[i.e., if the new loss is less than the previous loss]

This implies,

u^T \nabla_{\theta}\mathscr{L}(\theta) < 0

But, what is the range of \( u^T \nabla_{\theta}\mathscr{L}(\theta) < 0 \)?

Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathscr{L}(\theta)\)

Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\), then we know that

\(-1 \leq  \cos(\beta) = \frac{ u^T  \nabla_{\theta} \mathscr{L}(\theta)}{ ||u^T|| *  ||\nabla_{\theta} \mathscr{L}(\theta)||} \leq 1 \)

Multiply throught \(k = ||u^T|| *  ||\nabla_{\theta} \mathscr{L}(\theta)|| \)

\(-k \leq  k*\cos(\beta) = u^T  \nabla_{\theta} \mathscr{L}(\theta)  \leq k \)

Thus \( \mathscr{L}(\theta+\eta u) - \mathscr{L}(\theta) = u^T \nabla_{\theta} \mathscr{L}(\theta) = k*\cos(\beta) \) is more negative when \(\cos(\beta)=-1\) (i.e., \(\beta\)=\(180^o\))

Gradient Descent Rule

The direction \(u\) that we intend to move in should be at 180° w.r.t. the gradient

In other words, move in a direction opposite to the gradient

Parameter Updation Rule

\( w_{t+1} = w_t - \eta \nabla w_t\)

\(b_{t+1} = b_t - \eta \nabla b_t \)

where, \( \nabla w_t = \frac{\partial \mathscr{L}(w,b)}{\partial w}\), at \(w=w_t,b=b_t \), 

and, \( \nabla b_t = \frac{\partial \mathscr{L}(w,b)}{\partial b}\), at \(w=w_t,b=b_t \), 

So we now have a more principled way of moving in the (\(w,b\)) plane than our “guess work” algorithm

Let us create an algorithm for this rule ...

To see this algorithm in practice let us first derive \( \nabla w \) and \( \nabla b\)  for our toy neural network

max_iterations = 1000
w = random()
b = random()
while max_iterations:
  w = w - eta*dw
  b = b - eta*db
  max_iterations -= 1

Algorithm:gradient_descent()

\(t \leftarrow 0;\)

max_iterarions \(\leftarrow 1000\);

w,b \(\leftarrow\) initialize randomly

while \(t <\) max_iterations  do

        \(w_{t+1} \leftarrow w_t-\eta \nabla w_t\);

        \(b_{t+1} \leftarrow b_t-\eta \nabla b_t\);

         \( t \leftarrow t+1;\)

end

f(x) = \frac{1}{1+e^{-(wx+b)}}
1
x
\sigma
\hat{y}=f(x)
w
b

Let us assume there is only one point to fit

\mathscr{L}(w,b)=\frac{1}{2}*(f(x)-y)^2
\nabla w = \frac{\partial \mathscr{L}(w,b)}{\partial w} = \frac{\partial}{\partial w}[\frac{1}{2}*(f(x)-y)^2]
\nabla w = \frac{\partial}{\partial w}[\frac{1}{2}*(f(x)-y)^2]
= \frac{1}{2} [2*(f(x)-y) * \frac{\partial}{\partial w} (f(x)-y)]
=(f(x)-y) * \frac{\partial}{\partial w} (f(x))
=(f(x)-y) * \frac{\partial}{\partial w} (\frac{1}{1+e^{-(wx+b)}})
\frac{\partial}{\partial w} (\frac{1}{1+e^{-(wx+b)}})
= \frac{-1}{(1+e^{(wx+b)})^2} \frac{\partial}{\partial w} (e^{-(wx+b)})
= \frac{-1}{(1+e^{-(wx+b)})^2}* (e^{-(wx+b)})* \frac{\partial}{\partial w} (-(wx+b))
= \frac{-1}{(1+e^{-(wx+b)})}* \frac{e^{-(wx+b)}}{(1+e^{-(wx+b)})}* -x
= \frac{1}{(1+e^{-(wx+b)})}* \frac{e^{-(wx+b)}}{(1+e^{-(wx+b)})}* x
= f(x)*(1-f(x))* x
=(f(x)-y) * f(x)*(1-f(x))*x
f(x) = \frac{1}{1+e^{-(wx+b)}}
1
x
\sigma
\hat{y}=f(x)
w
b
(x_1,y_1)
(x_2,y_2)
\nabla w=(f(x)-y) * f(x)*(1-f(x))*x

So If we have only one point (\(x,y\)), we have,

For two points,

\nabla w=\sum \limits_{i=1}^{2}(f(x_i)-y_i) * f(x_i)*(1-f(x_i))*x_i
\nabla b=\sum \limits_{i=1}^{2}(f(x_i)-y_i) * f(x_i)*(1-f(x_i))
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]

def f(x,w,b): 
  return 1/(1+np.exp(-(w*x+b)))

def error(w,b):
  err = 0.0
  for x,y in zip(X,Y):
    fx = f(x,w,b)
    err += (fx-y)**2
  return 0.5*err

def grad_b(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)

def grad_w(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)*x

def do_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,1000
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)
    
    w = w - eta*dw
    b = b - eta*db

Let us do_gradient_descent()

import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]

def f(x,w,b): 
  return 1/(1+np.exp(-(w*x+b)))

def error(w,b):
  err = 0.0
  for x,y in zip(X,Y):
    fx = f(x,w,b)
    err += (fx-y)**2
  return 0.5*err

def grad_b(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)

def grad_w(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)*x

def do_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,1000
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)
    
    w = w - eta*dw
    b = b - eta*db
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]

def f(x,w,b): 
  return 1/(1+np.exp(-(w*x+b)))

def error(w,b):
  err = 0.0
  for x,y in zip(X,Y):
    fx = f(x,w,b)
    err += (fx-y)**2
  return 0.5*err

def grad_b(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)

def grad_w(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)*x

def do_gradient_descent():
  
  w,b,eta,max_epochs = -4,-4,1.0,1000
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)
    
    w = w - eta*dw
    b = b - eta*db

When the curve is steep the gradient \( \frac{\Delta y}{\Delta x}\) is large.

When the curve is gentle the gradient \( \frac{\Delta y}{\Delta x}\) is small.

Recall that our weight updates are proportional to the gradient \( w = w - \eta \nabla w \)

Hence in the areas where the curve is gentle the updates are small whereas in the areas where the curve is steep the updates are large

Drag the blue point to see how the slope changes.

Let’s see what happens when we start from a different point

Irrespective of where we start from once we hit a surface which has a gentle slope, the progress slows down

Try: \(w_{init}=-2,b_{init}=2\)

Try: \(w_{init}=-3,b_{init}=2\)

Set  \(t_{max}=150\)

Module 5.3 : Contours

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Visualizing things in 3d can sometimes become a bit cumbersome

Can we do a 2d visualization of this traversal along the error surface

Yes, let’s take a look at something known as contours

Suppose I take horizontal slices of this error surface at regular intervals along the vertical axis

How would this look from the topview ?

Suppose I take horizontal slices of this error surface at regular intervals along the vertical axis

How would this look from the topview ?

A small distance between the contours indicates a steep slope along that direction

A large distance between the contours indicates a gentle slope along that direction

Gradient Descent

Module 5.4 : Momentum Based Gradient Descent

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

It is helpful to remember that the first iteration of any optimization algorithm starts with computing \(\nabla w_0\).

Some observations about gradient descent

It takes a lot of time to navigate regions having a gentle slope

This is because the gradient in these regions is very small

Can we do something better? 

Yes, let’s take a look at Momentum based gradient descent

Before moving further

Suppose that we want to calculate the average of \(n\) numbers, say \(n=4\), \(x_n=[1,2,3,4]\). How do we do it?

\mu_x = \frac{1+2+3+4}{4}=2.5

Suppose that we want to calculate the average of \(n\) numbers, say \(n=4\). However, this time not all the numbers are readily available, they are coming one at a time \(t\). How do we do it?

Approach: 1

Accumulate all numbers as it comes.

Update the average as a new number comes in

t=1
x=[1], \mu_x=\frac{1}{1}=1
t=2
x=[1,2], \mu_x=\frac{1+2}{2}=1.5
t=3
x=[1,2,3], \mu_x=\frac{1+2+3}{3}=2
t=4
x=[1,2,3,4], \mu_x=\frac{1+2+3+4}{4}=2.5

This is not a good approach if we are only interested in the average!

It requires us to store all numbers and therefore involves redundant computation

Is there a better way of doing it?

Approach: 2

t=1
\mu_1=x_1=1
t=2
\mu_2=\frac{1+2}{2}=1.5
\mu_t=\frac{S_t}{t}
t=3
\mu_3=\frac{1+2+3}{3}=2
t=4
\mu_4=\frac{1+2+3+4}{4}=2.5

Do we need to sum all the numbers every time a new number comes in?

with initial condition, \(s_0=0\) 

No!. We can also compute the average  as follows

s_t= s_{t-1}+x_t

Update rule for momentum based gradient descent

u_t = \beta u_{t-1}+ \nabla w_t
w_{t+1} = w_t - \eta u_t

with, \( u_{-1}=0\), \(w_0 = rand()\) and \(0 \leq \beta < 1\)

In addition to the current update, also look at the history of updates.

Intuition

If I am repeatedly being asked to move in the same direction then I should probably gain some confidence and start taking bigger steps in that direction

Just as a ball gains momentum while rolling down a slope

u_0 = \nabla w_0, u_{-1} = 0
u_t = \beta u_{t-1}+ \nabla w_t
u_0 = \nabla w_0 \quad \because u_{-1}=0
u_1 = \beta u_0+\nabla w_1 = \beta (\nabla w_0)+\nabla w_1
u_2 = \beta u_1+\nabla w_2 = \beta (\beta \nabla w_0+\nabla w_1)+\nabla w_2=\beta^2 \nabla w_0+\beta \nabla w_1+\nabla w_2
u_t = \sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_{\tau}

The exponentially weighted average of current and all past gradients.

def do_mgd(max_epochs):
    w,b,eta = -2,-2,1.0
    prev_uw,prev_ub,beta = 0,0,0.9
   
    for i in range(max_epochs):
        dw,db = 0,0        
        for x,y in zip(X,Y):
            dw += grad_w(w,b,x,y)
            db += grad_b(w,b,x,y)
            
        uw = beta*prev_uw+eta*dw
        ub = beta*prev_ub+eta*db
        w = w - vw
        b = b - vb
        prev_uw = uw
        prev_ub = ub
        

Some Observations and questions

Even in the regions having gentle slopes, momentum based gradient descent is able to take large steps because the momentum carries it along

Is moving fast always good? Would there be a situation where momentum would cause us to run pass our goal?

Let us change our input data so that we end up with a different error surface and then see what happens ...

In this case, the error is high on either side of the minima valley

Could momentum be detrimental in such cases... let’s see....

Momentum based gradient descent oscillates in and out of the minima valley as the momentum carries it out of the valley

Takes a lot of u-turns before finally converging

\(w_0=-2, b_0=-4, \eta=1 \)

Despite these u-turns it still converges faster than vanilla gradient descent

After 50 iterations (25 iterations per second) momentum-based method has reached an error of 0.00001 whereas vanilla gradient descent is still stuck at an error of 0.16

Module 5.5 : Nesterov Accelerated Gradient Descent

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Questions

Can we do something to reduce these oscillations ?

Yes, let’s look at Nesterov accelerated gradient

Intuition

Look before you leap

Update rule for NAG

u_t = \beta u_{t-1}+ \nabla (w_t - \beta u_{t-1})
w_{t+1} = w_t - \eta u_t

with, \( u_{-1}=0\) and \(0 \leq \beta < 1\)

Recall that \( u_t = \beta u_{t-1}+\nabla w_t\)

So, we know that we are going to move by at least by \( \beta u_{t-1}\) and then a bit more by \(\nabla w_t\)

Why not make use of this to look ahead instead of relying only on the current \(\nabla w_t\)

def do_nag(max_epochs):
    w,b,eta = -2,-2,1.0
    prev_vw,prev_vb,beta = 0,0,0.9
   
    for i in range(max_epochs):
        dw,db = 0,0
        # do partial updates
        v_w = beta*prev_vw
        v_b = beta*prev_vb
        for x,y in zip(X,Y):
            # Look ahead
            dw += grad_w(w-v_w,b-v_b,x,y)
            db += grad_b(w-v_w,b-v_b,x,y)
        vw = beta*prev_vw+eta*dw
        vb = beta*prev_vb+eta*db
        w = w - vw
        b = b - vb
        prev_vw = vw
        prev_vb = vb
        

Observations about NAG

Looking ahead helps NAG in correcting its course quicker than momentum-based gradient descent.

Hence the oscillations are smaller and the chances of escaping the minima valley are also smaller.

Why do we care for an algorithm with quicker convergence?

Not only about time but also about the cost!

@Mitesh sir, will be used upon your approval

Module 5.6 : Stochastic And Mini-Batch Gradient Descent

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Let’s digress a bit and talk about the stochastic version of these algorithms...

import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]



def do_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,1000
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)
    
    w = w - eta*dw
    b = b - eta*db

Notice that the algorithm goes over the entire data once before updating the parameters

Why?

Because this is the true gradient of the loss as derived earlier (sum of the gradients of the losses corresponding to each data point)

No approximation.

Hence, theoretical guarantees hold (in other words each step guarantees that the loss will decrease)

What’s the flipside?

Imagine we have a million points in the training data.

To make 1 update to \(w, b\) the algorithm makes a million calculations. Obviously very slow!!

Can we do something better?.

Yes, let’s look at stochastic gradient descent


def do_stochastic_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,1000
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)    
      w = w - eta*dw
      b = b - eta*db
def do_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,1000
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)
    
    w = w - eta*dw
    b = b - eta*db

Notice that the algorithm updates the parameters for every single data point

Now if we have a million data points we will make a million updates in each epoch (1 epoch = 1 pass over the data; 1 step = 1 update)

What is the flipside ?

Stochastic because we are estimating the total gradient based on a single data point.

It is an approximate (rather stochastic) gradient

Almost like tossing a coin only once and estimating P(heads).

No guarantee that each step will decrease the loss

Let’s see this algorithm in action when we have a few data points

We see many oscillations. Why ?

Because we are making greedy decisions.

Each point is trying to push the parameters in a direction most favorable to it (without being aware of how this affects other points)

A parameter update which is locally favorable to one point may harm other points (it is almost as if the data points are competing with each other)

Can we reduce the oscillations by improving our stochastic estimates of the gradient? (currently estimated from just 1 data point at a time)

Yes, let’s look at mini-batch gradient descent

def do_stochastic_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,500
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)    
      w = w - eta*dw
      b = b - eta*db
def do_minibatch_stochastic_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,500 
  mini_batch_size = 25
  
  for i in range(max_epochs):   
    dw,db,num_points_seen = 0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)  
      num_points_seen += 1
      
      if num_points_seen%minibatch_size == 0:
        w = w - eta*dw
        b = b - eta*db
        dw,db = 0,0

Notice that the algorithm updates the parameters after it sees mini_batch_size number of data points

The stochastic estimates are now slightly better

Let’s see this algorithm in action when we have \(k = 25\)

With a batch size of \(k=25\) the oscillations have reduced to a good extent.

With a batch size of \(k=25\) the oscillations have reduced to a good extent. Why ?

Because we now have slightly better estimates of the gradient  [analogy: we are now tossing the coin \(k=25\) times to estimate P(heads)]

The higher the value of \(k\) the more accurate are the estimates

Of course, there are still oscillations and they will always be there as long as we are using an approximate gradient as opposed to the true gradient

Some things to remember..

1 epoch = one pass over the entire data

Algorithm # of steps in 1 epoch
Vanilla (Batch) Gradient Descent
Stochastic Gradient Descent
Mini Batch Gradient Descent
\frac{N}{B}
N
1

1 step = one update of the parameters

\(N\) = Number of data points

\(B\) = Mini Batch size

Similarly, we can have stochastic versions of Momentum based gradient descent and Nesterov accelerated based gradient descent

\(\beta =0.5\)

SGD with Momentum

SGD-NAG-Momentum

\(\beta=0.85\)

While the stochastic versions of both Momentum [green] and NAG [blue] exhibit oscillations the relative advantage of NAG over Momentum still holds (i.e., NAG takes relatively shorter u-turns)

Further both of them are faster than stochastic gradient descent (after 500 steps, stochastic gradient descent [black] still exhibits a very high error whereas NAG and Momentum are close to convergence)

And, of course, you can also have the mini batch version of Momentum and NAG...I leave that as an exercise 

Module 5.7 : Tips for Adjusting learning Rate and Momentum

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Before moving on to advanced optimization algorithms let us revisit the problem of learning rate in gradient descent

One could argue that we could have solved the problem of navigating gentle slopes by setting the learning rate high (i.e., blow up the small gradient by multiplying it with a large \(\eta\))

Let us see what happens if we set the learning rate to 10

On the regions which have a steep slope, the already large gradient blows up further

It would be good to have a learning rate which could adjust to the gradient ...

we will see a few such algorithms soon..

Tips for initial learning rate

Tune learning rate [Try different values on a log scale: 0.0001, 0.001, 0.01, 0.1. 1.0]

Run a few epochs with each of these and figure out a learning rate which works best

Now do a finer search around this value [for example, if the best learning rate was 0.1 then now try some values around it: 0.05, 0.2, 0.3]

Disclaimer: these are just heuristics ... no clear winner strategy

Tips for annealing learning rate

Step Decay:

Halve the learning rate after every 5 epochs or

Halve the learning rate after an epoch if the validation error is more than what it was at the end of the previous epoch

Tips for annealing learning rate

Exponential Decay:

\(\eta=\eta_0^{-kt}\) where \(\eta_0\) and \(k\) are

hyperparameters and \(t\) is a step number

Tips for annealing learning rate

1/t Decay:

\(\eta=\frac{\eta_0}{1+kt}\) where \(\eta_0\) and \(k\) are

hyperparameters and \(t\) is a step number

Tips for momentum

The following schedule was suggested by Sutskever et. al., 2013

\( \beta_t = min(1-2^{-1-\log_2(\lfloor \frac{t}{250} \rfloor+1)},\beta_{max}) \)

where, \(\beta_{max}\) is choosen from \(\{0.999,0.995,0.99,0.9,0\}\)

Module 5.8: Line Search

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Just one last thing before we move on to some other algorithms ...

def do_line_search_gradient_descent(max_epochs):
    w,b,etas = -2,-2,[0.1,0.5,1,2,10] 
    
    for i in range(max_epochs):
        
        dw,db = 0,0       
        for x,y in zip(X,Y):
            dw += grad_w(w,b,x,y)
            db += grad_b(w,b,x,y)      
        
        min_error = 10000  # some large value
        for eta in etas:
          temp_w = w - eta * dw
          temp_b = b - eta * db
          if error(temp_w,temp_b) < min_error:
            best_w = temp_w
            best_b = temp_b            
            min_error = error(best_w,best_b)
        w = best_w
        b = best_b
    

In practice, often a line search is done to find a relatively better value of \(\eta\)

Update \(w\) using different values of \(\eta\)

Now retain that updated value of \(w\) which gives the lowest loss

Essentially at each step, we are trying to use the best \(\eta\) value from the available choices

What’s the flipside?

def do_line_search_gradient_descent(max_epochs):
    w,b,etas = -2,-2,[0.1,0.5,1,2,10] 
    
    for i in range(max_epochs):
        
        dw,db = 0,0       
        for x,y in zip(X,Y):
            dw += grad_w(w,b,x,y)
            db += grad_b(w,b,x,y)      
        
        min_error = 10000  # some large value
        for eta in etas:
          temp_w = w - eta * dw
          temp_b = b - eta * db
          if error(temp_w,temp_b) < min_error:
            best_w = temp_w
            best_b = temp_b            
            min_error = error(best_w,best_b)
        w = best_w
        b = best_b
    

A line search can be done to find a relatively better value of \(\eta\)

Update \(w\) using different values of \(\eta\)

Now retain that updated value of \(w\) which gives the lowest loss

Essentially at each step, we are trying to use the best \(\eta\) value from the available choices

What’s the flipside? We are doing many more computations in each step

Blue: Gradient descent with fixed learning rate \(\eta=1\)

Black: Gradient descent with line search

Let us see line search in action

Convergence is faster than vanilla gradient descent

We see some oscillations, but note that these oscillations are different from what we see in momentum and NAG

Module 5.9 : Gradient Descent with Adaptive Learning Rate

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

\sigma
1
x^1
x^2
x^3
x^4
y
y = f(x)=\frac{1}{1+e^{(-\mathbf{wx}+b)}}
x=\{x^1,x^2,x^3,x^4\}
w=\{w^1,w^2,w^3,w^4\}

Given this network, it should be easy to see that given a single point (\(\mathbf{x},b\))

\nabla w^1= (f(\mathbf{x})-y)*f(\mathbf{x})*(1-f(\mathbf{x})*x^1
\nabla w^2= (f(\mathbf{x})-y)*f(\mathbf{x})*(1-f(\mathbf{x})*x^2 ..\text{so on}

If there are \(n\) points, we can just sum the gradients over all the \(n\) points to get the total gradient

What happens if the feature \(x^2\) is very sparse? (i.e., if its value is 0 for most inputs)

\(\nabla w^2\) will be 0 for most inputs (see formula) and hence \(w^2\) will not get enough updates

If \(x^2\) happens to be sparse as well as important we would want to take the updates to \(w^2\) more seriously.

Can we have a different learning rate for each parameter which takes care of the frequency of features?

Intuition

Decay the learning rate for parameters in proportion to their update history (more updates means more decay)

v_t=v_{t-1}+(\nabla w_t)^2
w_{t+1}=w_{t}-\frac{\eta}{\sqrt{v_t+\epsilon}}*\nabla w_{t}

... and a similar set of equations for \(b_t\)

Update Rule for AdaGrad

To see this in action we need to first create some data where one of the features is sparse

How would we do this in our toy network ?

Well, our network has just two parameters \(w \)and \(b\).

def do_adagrad(max_epochs):
  
  #Initialization
  w,b,eta = -2,-2,0.1
  v_w,v_b,eps = 0,0,1e-8
  for i in range(max_epochs): 
    # zero grad
    dw,db = 0,0        
    for x,y in zip(X,Y):
        
        #compute the gradients
        dw = grad_w(w,b,x,y)
        db = grad_b(w,b,x,y)        

    #compute intermediate values
    v_w = v_w + dw**2
    v_b = v_b + db**2      

    #update parameters
    w = w - eta*dw/(np.sqrt(v_w)+eps)
    b =b - eta*db/(np.sqrt(v_b)+eps)

Take some time to think about it

To see this in action we need to first create some data where one of the features is sparse

How would we do this in our toy network ?

Well, our network has just two parameters \(w \)and \(b\). Of these, the input/feature corresponding to \(b\) is always on (so can’t really make it sparse)

def do_adagrad(max_epochs):
  
  #Initialization
  w,b,eta = -2,-2,0.1
  v_w,v_b,eps = 0,0,1e-8
  for i in range(max_epochs): 
    # zero grad
    dw,db = 0,0        
    for x,y in zip(X,Y):
        
        #compute the gradients
        dw = grad_w(w,b,x,y)
        db = grad_b(w,b,x,y)        

    #compute intermediate values
    v_w = v_w + dw**2
    v_b = v_b + db**2      

    #update parameters
    w = w - eta*dw/(np.sqrt(v_w)+eps)
    b =b - eta*db/(np.sqrt(v_b)+eps)

Take some time to think about it

To see this in action we need to first create some data where one of the features is sparse

How would we do this in our toy network ?

Well, our network has just two parameters \(w \)and \(b\). Of these, the input/feature corresponding to \(b\) is always on (so can’t really make it sparse)

The only option is to make \(x\) sparse

Solution: We created 500 random \((x, y)\) pairs and then for roughly 80% of these pairs we set \(x\) to \(0\) thereby, making the feature for \(w\) sparse

def do_adagrad(max_epochs):
  
  #Initialization
  w,b,eta = -2,-2,0.1
  v_w,v_b,eps = 0,0,1e-8
  for i in range(max_epochs): 
    # zero grad
    dw,db = 0,0        
    for x,y in zip(X,Y):
        
        #compute the gradients
        dw = grad_w(w,b,x,y)
        db = grad_b(w,b,x,y)        

    #compute intermediate values
    v_w = v_w + dw**2
    v_b = v_b + db**2      

    #update parameters
    w = w - eta*dw/(np.sqrt(v_w)+eps)
    b =b - eta*db/(np.sqrt(v_b)+eps)

Take some time to think about it

There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?

Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis

Why?

There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?

Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis

Why? Because in our data, the feature corresponding to \(w\) is sparse and hence \(w\) undergoes very few updates

There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?

Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis

Why? Because in our data, the feature corresponding to \(w\) is sparse and hence \(w\) undergoes very few updates ...on the other hand b is very dense and undergoes many updates

Such sparsity is very common in large neural networks containing \(1000s\) of input features and hence we need to address it

Let's see what AdaGrad does..

Adagrad

Adagrad - SGD (*New)

learning rate \(\eta\) = 0.1 for all the algorithms

Momentum \(\gamma = 0.9\)

Number of points 500, 80% set to zero

Momentum and NAG converge quickly near the optima than Adagrad.

Adagrad slows down near the optima due to decaying learning rate

Adagrad 

learning rate \(\eta\) = 0.1 for all the algorithms

Momentum \(\gamma = 0.9\)

Number of points 500, 80% set to zero

Adagrad slows down near the optima due to decaying learning rate

v_t=v_{t-1}+(\nabla w_t)^2
\nabla w=(f(x)-y) * f(x)*(1-f(x))*x
v_0=(\nabla w_0)^2
v_1=(\nabla w_0)^2+(\nabla w_1)^2
v_2=(\nabla w_0)^2+(\nabla w_1)^2+(\nabla w_2)^2
v_t=(\nabla w_0)^2+(\nabla w_1)^2+(\nabla w_2)^2+ \cdots+ (\nabla w_t)^2

Recall that,

Since \(x\) is sparse, the gradient is zero for most of the time steps.

Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays slowly.

By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates

Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator

In practice, this does not work so well if we remove the square root from the denominator

By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates

Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator

In practice, this does not work so well if we remove the square root from the denominator (something to ponder about)

What’s the flipside?

By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates

Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator

In practice, this does not work so well if we remove the square root from the denominator (something to ponder about)

What’s the flipside? over time the effective learning rate for \(b\) will decay to an extent that there will be no further updates to \(b\)

Can we avoid this?

Intuition

Adagrad decays the learning rate very aggressively (as the denominator grows)

Update Rule for RMSprop

v_t = \beta v_{t-1}+(1-\beta)\nabla w_t^2
w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t+\epsilon}}\nabla w_t

... and a similar set of equations for \(b_t\)

As a result, after a while, the frequent parameters will start receiving very small updates because of the decayed learning rate

To avoid this why not decay the denominator and prevent its rapid growth

v_t=v_{t-1}+\nabla b_t^2
\nabla b=(f(x)-y) * f(x)*(1-f(x))
v_0=\nabla b_0^2
v_1=\nabla b_0^2+\nabla b_1^2
v_2=\nabla b_0^2+\nabla b_1^2+\nabla b_2^2
v_t=\nabla b_0^2+\nabla b_1^2+\nabla b_2^2+ \cdots+ \nabla b_t^2

Recall that,

Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays rapidly for \(b\)

AdaGrad

v_t= \beta v_{t-1}+(1-\beta)(\nabla b_t)^2, \quad \beta\in[0,1)
v_0= 0.1\nabla b_0^2
v_1=0.09 \nabla b_0^2+0.1 \nabla b_1^2
\beta = 0.9
v_2=0.08 \nabla b_0^2+0.09 \nabla b_1^2+0.1 \nabla b_2^2

RMSProp

v_t= (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla b_t^2

Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays slowly (compared to adagrad) for \(b\)

def do_rmsprop(max_epochs):
  #Initialization
  w,b,eta = -4,4,0.1
  beta = 0.5 
  v_w,v_b,eps = 0,0,1e-4 
  
  for i in range(max_epochs): 
    # zero grad
    dw,db = 0,0   
    
    for x,y in zip(X,Y):     

        #compute the gradients
        dw = grad_w(w,b,x,y)
        db = grad_b(w,b,x,y)        

    #compute intermediate values
    v_w = beta*v_w +(1-beta)*dw**2
    v_b = beta*v_b + (1-beta)*db**2

    #update parameters
    w = w - eta*dw/(np.sqrt(v_w)+eps)
    b =b - eta*db/(np.sqrt(v_b)+eps)   
 
\eta = 0.1, \beta=0.5

However, why there are oscillations?

Does it imply that after some iterations, there is a chance that the learning rate remains constant so that the algorithm possibly gets into an infinite oscillation around the minima?

RMSProp converged  more quickly than AdaGrad by being less aggressive on decay

  RMSProp

 AdaGrad

In the case of AdaGrad, the learning rate monotonically decreases due to the ever-growing denominator!

In the case of RMSProp, the learning rate may increase, decrease, or remains constant due to the moving average of gradients in the denominator.

The figure below shows the gradients across 500 iterations. Can you see the reason for the oscillation around the minimum?

What is the solution?

If learning rate is constant, there is chance that the descending algorithm oscillate around the local minimum.

\eta=0.05, \epsilon=0.0001
\frac{\eta=0.05}{\sqrt{10^{-4}}}

We have to set the initial learning rate appropriately, in this case, setting \(\eta=0.05\) solves the propblem.

Sensitive to initial learning rate, initial conditions of parameters and corresponding gradients

 If the initial gradients are large, the learning rates will be low for the remainder of training (in AdaGrad)

Later, if a gentle curvature is encountered, no way to increase the learning rate (In Adagrad)

AdaDelta

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t
2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2
1. \rightarrow \nabla w_t
for \quad t \quad in \quad range(1,N):

Avoids setting initial learning rate \(\eta_0\).

AdaDelta

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t
4. \rightarrow u_t = \beta u_{t-1}+(1-\beta)(\Delta w_t)^2
5.\rightarrow w_{t+1}=w_t+\Delta w_t
2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2
1. \rightarrow \nabla w_t
for \quad t \quad in \quad range(1,N):

Avoids setting initial learning rate \(\eta_0\).

Reason why it is called Adaptive Delta

v_t = \beta v_{t-1}+(1-\beta)\nabla w_t^2
v_t= \beta v_{t-1}+(1-\beta)(\nabla w_t)^2, \quad \beta\in[0,1)
v_0= 0.1\nabla w_0^2
v_1=0.09 \nabla w_0^2+0.1 \nabla w_1^2
\beta = 0.9
v_2=0.08 \nabla w_0^2+0.09 \nabla w_1^2+0.1 \nabla w_2^2
v_t= (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_t^2
v_3=0.07 \nabla w_0^2+0.08 \nabla w_1^2+0.09 \nabla w_2^2+0.1 \nabla w_3^2

We want the learning rate to decrease

v_t = \beta v_{t-1}+(1-\beta)\nabla w_t^2
v_t= \beta v_{t-1}+(1-\beta)(\nabla w_t)^2, \quad \beta\in[0,1)
v_0= 0.1\nabla w_0^2
v_1=0.09 \nabla w_0^2+0.1 \nabla w_1^2
\beta = 0.9
v_2=0.08 \nabla w_0^2+0.09 \nabla w_1^2+0.1 \nabla w_2^2
v_t= (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_t^2
v_3=0.07 \nabla w_0^2+0.08 \nabla w_1^2+0.09 \nabla w_2^2+0.1 \nabla w_3^2

We want the learning rate to increase

v_0= 0.1\nabla w_0^2
\Delta w_0 = -\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0
u_0= 0.1 \Delta w_0^2

Surface with high curvature

\(\Delta w_0 < \nabla w_0\)

\(u_0 < v_0\)

w_{1}=w_0+\Delta w_0

A very small update

v_1= 0.08 \nabla w_0^2+0.1\nabla w_1^2
\Delta w_1 = -\frac{\sqrt{u_0+\epsilon}}{\sqrt{v_1+\epsilon}} \nabla w_1
u_1= 0.08 \Delta w_0^2+0.1 \Delta w_1^2
w_{2}=w_1+\Delta w_2

A smaller update as \(\eta < 1\)

\(u_0 < v_1\)

We want the learning rate to decrease. The algorithm does it

v_0= 0.1\nabla w_0^2
\Delta w_0 = -\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0
u_0= 0.1 \Delta w_0^2

Surface with low curvature

\(\Delta w_0 < \nabla w_0\)

\(u_0 < v_0\)

w_{1}=w_0+\Delta w_0

A very small update

v_1= 0.08 \nabla w_0^2+0.1\nabla w_1^2
\Delta w_1 = -\frac{\sqrt{u_0+\epsilon}}{\sqrt{v_1+\epsilon}} \nabla w_1
u_1= 0.08 \Delta w_0^2+0.1 \Delta w_1^2
w_{2}=w_1+\Delta w_1

A bigger update as \(\eta > 1\)

\(u_0 > v_1\)

Let's say \(\nabla w_1\) is near zero and squaring it pushes it towards zero, so we can safely ignore its contribution (just for argument)

We want the learning rate to increase. The algorithm does it!

def do_adadelta(max_epochs): 
  #Initialization
  w,b= -4,-4
  beta = 0.99
  v_w,v_b,eps = 0,0,1e-4
  u_w,u_b = 0,0   
 
  for i in range(max_epochs): 
    dw,db = 0,0       
    for x,y in zip(X,Y):            

        #compute the gradients
        dw += grad_w(w,b,x,y)
        db += grad_b(w,b,x,y)       
   
    v_w = beta*v_w + (1-beta)*dw**2
    v_b = beta*v_b + (1-beta)*db**2
    
    delta_w = dw*np.sqrt(u_w+eps)/(np.sqrt(v_w+eps))
    delta_b = db*np.sqrt(u_b+eps)/(np.sqrt(v_b+eps))    
    u_w = beta*u_w + (1-beta)*delta_w**2
    u_b = beta*u_b + (1-beta)*delta_b**2  

    
    w = w - delta_w
    b = b - delta_b    

Observe the gradient profile along the trajectory

It starts off with  high curvature region and eventually reaches the low curvature region

Let's take a look at how the learning rate changed dynamically over the iterations.

What's your guess?

Let's put it all together in one place

Adam (Adaptive Moments)

Intuition

Do everything that RMSProp and AdaDelta does to solve the decay problem of Adagrad

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Incorporating classical

momentum 

\(L_2\) norm

Plus use a cumulative history of the gradients

Typically, \(\beta_1=0.9\), \(\beta_2=0.999\)

def do_adam_sgd(max_epochs):
  
  #Initialization
  w,b,eta = -4,-4,0.1
  beta1,beta2 = 0.5,0.5
  m_w,m_b,v_w,v_b = 0,0,0,0   
  
  for i in range(max_epochs): 
    dw,db = 0,0
    eps = 1e-10     
    for x,y in zip(X,Y):       

        #compute the gradients
        dw = grad_w_sgd(w,b,x,y)
        db = grad_b_sgd(w,b,x,y)       

    #compute intermediate values
    m_w = beta1*m_w+(1-beta1)*dw
    m_b = beta1*m_b+(1-beta1)*db
    v_w = beta2*v_w+(1-beta2)*dw**2
    v_b = beta2*v_b+(1-beta2)*db**2

    m_w_hat = m_w/(1-np.power(beta1,i+1))
    m_b_hat = m_b/(1-np.power(beta1,i+1))
    v_w_hat = v_w/(1-np.power(beta2,i+1))
    v_b_hat = v_b/(1-np.power(beta2,i+1))      

    #update parameters
    w = w - eta*m_w_hat/(np.sqrt(v_w_hat)+eps)
    b = b - eta*m_b_hat/(np.sqrt(v_b_hat)+eps)
       

Million Dollar Question: Which algorithm to use?

Adam seems to be more or less the default choice now \((\beta_1 = 0.9, \beta_2 = 0.999 \ \text{and} \ \epsilon = 1e^{-8} )\).

Although it is supposed to be robust to initial learning rates, we have observed that for sequence generation problems \(\eta = 0.001, 0.0001\) works best

Having said that, many papers report that SGD with momentum (Nesterov or classical) with a simple annealing learning rate schedule also works well in practice (typically, starting with \(\eta = 0.001, 0.0001 \)for sequence generation problems)

Adam might just be the best choice overall!!

Some *recent work suggests that there is a problem with Adam and it will not converge in some cases. Also, it is observed that the models trained using Adam optimizer usually don't generalize well.

Explanation for why we need bias correction in Adam

Update Rule for Adam

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Note that we are taking a running average of the gradients as \(m_t\)

The reason we are doing this is that we don’t want to rely too much on the current gradient and instead rely on the overall behaviour of the gradients over many timesteps

One way of looking at this is that we are interested in the expected value of the gradients and not on a single point estimate computed at time t

However, instead of computing \(E[\nabla w_t]\) we are computing \(m_t\) as the exponentially moving average

Ideally we would want \(E[m_t ]\) to be equal to \(E[\nabla w_t ]\)

Let us see if that is the case

Recall the momentum equations from module 5.4

u_t = \beta u_{t-1}+ \nabla w_t
u_t = \sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_{\tau}

What we have now in Adam is a slight modification to the above equation (let's treat \(\beta_1\) as \(\beta\) )

m_t = \beta m_{t-1}+(1-\beta)\nabla w_t
\rbrace
m_t = (1-\beta)\sum \limits_{\tau=1}^t \beta^{t-\tau} \nabla w_{\tau}
m_0 = 0
m_1 = \beta m_0+(1-\beta)\nabla w_1=(1-\beta)\nabla w_1
m_2 =\beta m_1 + (1-\beta) \nabla w_2
=\beta (1-\beta) \nabla w_1+ (1-\beta) \nabla w_2
= (1-\beta) (\beta\nabla w_1+ \nabla w_2)
m_3=\beta m_2 + (1-\beta) \nabla w_3
=\beta ((1-\beta) (\beta\nabla w_1+ \nabla w_2) )+(1-\beta) \nabla w_3
=(1-\beta) (\beta^2 \nabla w_1+\beta \nabla w_2)+(1-\beta) \nabla w_3
m_3 = (1-\beta)\sum \limits_{\tau=1}^3 \beta^{t-\tau} \nabla w_{\tau}

Recall the momentum equations from module 5.4

u_t = \beta u_{t-1}+ \nabla w_t
u_t = \sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_{\tau}

What we have now in Adam is a slight modification to the above equation

m_t = \beta m_{t-1}+(1-\beta)\nabla w_t
\rbrace
m_t = (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_{\tau}

Let's take expectation on both sides

E[m_t] =E[ (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_{\tau}]
E[m_t] = (1-\beta)\sum \limits_{\tau=0}^t E[ \beta^{t-\tau} \nabla w_{\tau}]
E[m_t] = (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} E[ \nabla w_{\tau}]

Assumption: All \(\nabla w_\tau\) comes from the same distribution,i.e.,

\(E[\nabla w_\tau] = E[\nabla w] \quad \forall \tau\)

E[m_t] =E[ (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_{\tau}]
E[m_t] = (1-\beta)\sum \limits_{\tau=0}^t E[ \beta^{t-\tau} \nabla w_{\tau}]
E[m_t] = (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} E[ \nabla w_{\tau}]

Assumption: All \(\nabla w_\tau\) comes from the same distribution,i.e.,

\(E[\nabla w_\tau] = E[\nabla w] \quad \forall \tau\)

E[m_t] = E[ \nabla w](1-\beta)
\sum \limits_{\tau=0}^t \beta^{t-\tau}
E[m_t] = E[ \nabla w](1-\beta)
(\beta^{t-1}+\beta^{t-2}+\cdots+\beta^{0})
E[m_t] = E[ \nabla w](1-\beta)
\frac{1-\beta^{t}}{1-\beta}
E[m_t] = E[ \nabla w](1-\beta^{t})

The last ration is the sum of GP with common ratio \(\beta\)

E[\frac{m_t}{1-\beta^{t}}] = E[ \nabla w]
E[\hat{m_t}] = E[ \nabla w] \quad \text{where,} \hat{m_t}=\frac{m_t}{1-\beta^{t}}

Hence we apply the bias correction because then the expected value of \(\hat{m_t}\) is the same as the expected value of \(E[\nabla w_t]\)

Let's take expectation on both sides

AdaMax

Replaced \(L_2\) norm by \(L_{max}\) norm in ADAM

v_t = max(\beta_2v_{t-1},|\nabla w|)
w_{t+1}=w_t-\eta \frac{ \hat{m_t}}{v_t+\epsilon}

Do the same modification for RMSprop and call it MaxaProp

Update Rule for Adam

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Update Rule for AdaMax

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
def do_adamax_sgd(max_epochs):

  #Initialization
  w,b,eta = -4,-4,0.1
  beta1,beta2 = 0.9,0.99
  m_w,m_b,v_w,v_b = 0,0,0,0
  m_w_hat,m_b_hat,v_w_hat,v_b_hat = 0,0,0,0
    
  
  for i in range(max_epochs): 
    dw,db = 0,0
    eps = 1e-10 
    
    for x,y in zip(X,Y):       

        #compute the gradients
        dw += grad_w_sgd(w,b,x,y)
        db += grad_b_sgd(w,b,x,y)
        

    #compute intermediate values
    m_w = beta1*m_w+(1-beta1)*dw
    m_b = beta1*m_b+(1-beta1)*db
    v_w = np.max([beta2*v_w,np.abs(dw)])
    v_b = np.max([beta2*v_b,np.abs(db)])

    m_w_hat = m_w/(1-np.power(beta1,i+1))
    m_b_hat = m_b/(1-np.power(beta1,i+1))    

    #update parameters
    w = w - eta*m_w_hat/(v_w+eps)
    b = b - eta*m_b_hat/(v_b+eps)
    

Intuition

NAdam (Nesterov Adam) : Paper

We know that NAG is superior to Momentum.

We just need to modify \(m_t\) as it depends on the gradient.

 Why not just incorporate it with ADAM?

Update Rule for Adam

m_t = \beta_1 m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Recall the equation for Momentum

u_t=\beta u_{t-1}+ \nabla w_{t}

Recall the equation for NAG

w_{t+1}=w_t - \eta u_t
u_t = \beta u_{t-1}+ \nabla (w_t - \beta u_{t-1})
w_{t+1} = w_t - \eta u_t

We skip the multiplication of factor \(1-\beta\) with \(\nabla w_t\) for brevity

Update Rule for Adam

m_t = \beta_1 m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Recall the equation for Momentum

u_t=\beta u_{t-1}+ \nabla w_{t}

Recall the equation for NAG

w_{t+1}=w_t - \eta u_t
u_t = \beta u_{t-1}+ \nabla (w_t - \beta u_{t-1})
w_{t+1} = w_t - \eta u_t

So can we write \(m_t\) as follows?

m_t = \beta m_{t-1}+(1-\beta)\nabla (w_t-\beta m_{t-1})

Would it be helpful if we untangle \(\nabla(w_t-\beta_1 m_{t-1})\)by rewriting NAG ?

Yes. But at the cost of intuition

Try computing bias correction like we did earlier..

g_t = \nabla (w_t - \beta m_{t-1})
m_t = \beta m_{t-1}+ g_t
w_{t+1} = w_t - \eta m_t

NAG

g_t = \nabla w_t
m_t = \beta m_{t-1}+g_t
w_{t+1} = w_t - \eta ( \beta m_t+g_t)

Rewritten NAG

Observe that the momentum vector \(m_{t-1}\) is used twice (while computing gradient and current momentum vector). As a consequence, there are two weight updates (which is costly for big networks)!

\lbrace

Look ahead

\lbrace

Look ahead

Now, look ahead is computed only at the weight update equation with the current momentum vector \(m_t\)

Now, we just need to plug in this to Adam. Instead of \(m_t\) we use bias-corrected version, \(\hat{m_t}\).

w_{t+1} = w_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\dfrac{\beta m_{t-1}}{1 - \beta^t} + \dfrac{(1 - \beta) g_t}{1 - \beta^t})

Update Rule for Adam

m_t = \beta_1 m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Let's extend the weight update equation (treat \(\beta_1 as \beta\)) and let \(\nabla w_t=g_t\)

w_{t+1} = w_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta \hat{m_{t-1} }+ \dfrac{(1 - \beta) g_t}{1 - \beta^t})

We replace \(\hat{m}_{t-1}\) by \(\hat{m_t}\) (as we did while rewritting NAG).Here is the update rule for NAdam.

\text{w.k.t} \quad \hat{m}_{t-1} =\frac{m_{t-1}}{1-\beta^{t-1}}

Relaxing the requirement of \(\beta^{t-1}\) in the denominator of \(\hat{m}_{t-1}\). we can write the update equation as

g_t = \nabla w_t
m_t = \beta m_{t-1}+g_t
w_{t+1} = w_t - \eta ( \beta m_t+g_t)

Rewritten NAG

w_{t+1} = w_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta \hat{m_{t} }+ \dfrac{(1 - \beta) g_t}{1 - \beta^t})

We need not to modify \(v_t\)

Update Rule for NAdam

m_t = \beta_1 m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1 ^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
\hat{v_t} =\frac{v_t}{1-\beta_2^t}
w_{t+1} = w_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m_{t} }+ \dfrac{(1 - \beta_1) g_t}{1 - \beta_1^t})
def do_adamax_sgd(max_epochs):

  #Initialization
  w,b,eta = -4,-4,0.1
  beta1,beta2 = 0.9,0.99 
  m_w_hat,m_b_hat,v_w_hat,v_b_hat = 0,0,0,0    
  
  for i in range(max_epochs): 
    dw,db = 0,0
    eps = 1e-10     
    for x,y in zip(X,Y):  
        #compute the gradients
        dw += grad_w_sgd(w,b,x,y)
        db += grad_b_sgd(w,b,x,y)       

     #compute intermediate values
    m_w = beta1*m_w+(1-beta1)*dw
    m_b = beta1*m_b+(1-beta1)*db
    v_w = beta2*v_w+(1-beta2)*dw**2
    v_b = beta2*v_b+(1-beta2)*db**2

    m_w_hat = m_w/(1-beta1**(i+1))
    m_b_hat = m_b/(1-beta1**(i+1))
    v_w_hat = v_w/(1-beta2**(i+1))
    v_b_hat = v_b/(1-beta2**(i+1))
    #update parameters
    w = w - (eta/np.sqrt(v_w_hat+eps))* \\
    	(beta1*m_w_hat+(1-beta1)*dw/(1-beta1**(i+1)))
    b = b - (eta/(np.sqrt(v_b_hat+eps)))*\\
    	(beta1*m_b_hat+(1-beta1)*db/(1-beta1**(i+1)))  
    
m_t = \beta_1^tm_{t-1}+(1-\beta_t^1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\prod_{i=1}^t\beta_1^i}
v_t = \beta_t^2 v_{t-1}+(1-\beta_t^2)(\nabla w_t)^2
w_{t+1}=w_t-\eta \frac{ \bar{m_t}}{\sqrt{v_t}+\epsilon}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}
g_t = \nabla w_t \\ \hat{g_t} = \frac{g_t}{1-\prod_{i=1}^t\beta_i} \\
m_{-1} = 0
v_{-1} = 0
\bar{m_t} = (1-\beta_t^1)\hat{g_t}+\beta_{t+1}^1\hat{m_t}
\beta_t = \beta_1(1-\frac{1}{2}0.96^{\frac{t}{250}})
\beta_{t+1} = \beta_1(1-\frac{1}{2}0.96^{\frac{t+1}{250}})

Parameterized version of NAdam

SGD with momentum outperformed Adam in certain problems like object detection and machine translation

It is attributed exponentially averaging squared gradients \((\nabla w)^2\)  in \(v_t\) of Adam which suppresses the high-valued gradient information over iterations.  So, use \(max()\) function to retain it.

Update Rule for Adam

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Update Rule for AMSGrad

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}m_t
\hat{v}_t = \text{max}(\hat{v}_{t-1}, v_t)

Note that there are no bias corrections in AMSGrad!

Well, no one knows what AMSGrad stands for!

Once again, which optimizer to use?

It is common for new learners to start out using off-the-shelf optimizers, which are later replaced by custom-designed ones.

-Pedro Domingos

@mitesh sir, we haven't introduced the concept of regularization so far in any of the previous lectures. So do we need to include AdamW here? (nevertheless, students from B.Sc program are introduced to regularization techniques in MLT course)

\(L_2\) regularization and weight decay are equivalent for SGD algorithms but not for algorithms that use adaptive gradient  like Adam

It is known that using \(L_2\) regularization and weight decay are equivalent for SGD algorithms. In other words, they are coupled as shown below

\mathscr{L}(w)=\frac{1}{2}\sum \limits_{i=1}^N(y_i-f(x_i))^2
\mathscr{L}^{reg}(w)=\frac{1}{2}\sum \limits_{i=1}^N(y_i-f(x_i))^2 +\frac{\lambda}{2} \rvert \rvert w||^2
w_{t+1} = w_t-\eta \nabla_{w_t} \mathscr{L}
\mathscr{L}^{reg}(w)= \mathscr{L({w})}+\frac{\lambda}{2} \rvert \rvert w||^2
w_{t+1} = w_t-\eta \nabla_{w_t} \mathscr{L}^{reg}
w_{t+1} = w_t-\eta \nabla_{w_t} \mathscr{L}-\eta \lambda w_t

The terms on the right can be rearranged and written as

w_{t+1} = (1-\eta \lambda) w_t-\eta \nabla_{w_t} \mathscr{L}

However, it is observed that it is not the case with the Adam optimizer! (In general, any adaptive gradient based optimizers)

Update Rule for Adam

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Update Rule for AdamW

Decoupling weight Decay

w_{t+1} = (1-\eta \lambda) w_t-\eta \nabla_{w_t} \mathscr{L}
m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
\hat{v_t} =\frac{v_t}{1-\beta_2^t}
w_{t+1}=(1-\eta \lambda)w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}

Therefore, if you wish to implement weight decay, do it directly with the update equation. The \(L_2\) regularization and weight decay are no longer coupled for adaptive gradient-based optimizers.

Now we can extend the same to Nadam and call it NadamW, \(RMSpropRMSpropW, and AdaGradW.

A similar situation rises when we apply normalization techniques (will learn on the next set of lectures) to the deep learning model with adaptive gradients. To rectify it,  modify Adam and call it AdamP

Now we need to digress a bit to a new set of adaptive learning rate schemes (as it is a pre-requisite for RAdam (Rectified Adam))

Are you excited?

 Learning rate Schemes

        Based on epochs

Based on validation

1. Step Decay

Based on Gradients

2. Exponential Decay

3. Cyclical

4. Cosine annealing

1. Line search

2. Log search

1. AdaGrad

2. RMSProp

3.AdaDelta

4.Adam

5.AdaMax

6.NAdam

7.AMSGrad

8.AdamW

5. Warm-Restart

Cyclical Learning Rate (CLR) : Paper

Suppose the loss surface looks as shown in the figure. The surface has a saddle point.

Suppose further that the parameters are initialized to \((w_0,b_0)\) (yellow point on the surface) and the learning rate \(\eta\) is decreased exponentially over iterations.

After some iterations, the parameters \((w,b)\) will reach near the saddle point.

Since the learning rate has decreased exponentially, the algorithm has no way of coming out of the saddle point (despite the possibility of coming out of it)

What if we allow the learning rate to increase after some iterations? at least there is a chance to escape the saddle point.

Cyclical Learning Rate (CLR) : Paper

Rationale: Often, difficulty in minimizing the loss arises from saddle points rather than poor local minima.

Therefore, it is beneficial if the learning rate schemes support a way to increase the learning rate near the saddle points.

Adaptive learning rate schemes may help in this case. However, it comes with an additional computational cost.

A simple alternative is to vary the learning rate cyclically

CLR : Triangular

\eta_{max}=0.5
\eta_{min}=0.01
\mu=20
\eta_t=\eta_{min}+(\eta_{max}-\eta_{min})\cdot max(0,(1-|\frac{t}{\mu}(2 \lfloor{1+\frac{t}{2\mu}}\rfloor)+1|)

Where, \(\mu\) is called a step size

def cyclic_lr(iteration,max_lr,base_lr,step_size):  
  cycle = np.floor(1+iteration/(2*step_size))
  x = np.abs(iteration/step_size - 2*cycle + 1)
  lr = base_lr + (max_lr-base_lr)*np.maximum(0, (1-x))
  return lr

def do_gradient_descent_clr(max_epochs):
    w,b = -2,0
    w_trace,b_trace = [w],[b]    
    for i in range(max_epochs):
        dw,db = 0,0               
        dw = grad_w(w,b)
        db = grad_b(w,b)        
        w = w - cyclic_lr(i,max_lr=0.1,base_lr=0.001,step_size=30) * dw
        b = b - cyclic_lr(i,max_lr=0.1,base_lr=0.001,step_size=30) * db        
        
    return w_trace,b_trace

Cosine Annealing (Warm Re-Start) 

Let's see how changing the learning rate cyclically also helps faster convergence with the following example

Cosine Annealing (Warm Re-Start) 

Reached minimum at 46th Iteration (of course, if we run it for a few more iterations, it crosses the minimum)

However, we could use techniques such as early stopping to roll back to the minimum

Cosine Annealing (Warm Re-Start) 

\eta_t = \eta_{min} + \frac{\eta_{max} - \eta_{min}}{2} \left(1 + \cos(\pi \frac{(t \%(T+1))}{T})\right)

\(\eta_{max}\): Maximum value for the learning rate

\(\eta_{min}\): Minimum value for the learning rate

\(t\): Current epoch

\(T\): Restart interval (can be adaptive)

Modified formula from the paper for batch gradient (original deals with SGD and restarts after a particular epoch \(T_i\))

\(\eta_t=\eta_{max}\),  for  \(t=0\)

\(\eta_t=\eta_{min}\),  for  \(t=T\)

\eta_t = \eta_{min} + \frac{\eta_{max} - \eta_{min}}{2} \left(1 + \cos(\pi \frac{(t \%(T+1))}{T})\right)
\eta_{max}=1
\eta_{min}=0.1
T=50

Note the abrupt changes after \(T\) (That's why it is called warm re-start)

Warm-start

There are other learning rate schedulers like warm-start are found to be helpful in achieving quicker convergence in architectures like transformers

Typically, we set the initial learning rate to a high value and then decay it.

On the contrary, using a low initial learning rate helps the model to warm and converge better. This is called warm-start

warmupSteps=4000