CS6910: Lecture 5-Part2

Module 5.9 : Gradient Descent with Adaptive Learning Rate

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

\sigma

x^1

x^2

x^3

x^4

y = f(x)=\frac{1}{1+e^{(-\mathbf{w^Tx}+b)}}

x=\{x^1,x^2,x^3,x^4\}

w=\{w^1,w^2,w^3,w^4\}

Given this network, it should be easy to see that given a single point (\(\mathbf{x},y\))

\nabla w^1= (f(\mathbf{x})-y)*f(\mathbf{x})*(1-f(\mathbf{x})*x^1

\nabla w^2= (f(\mathbf{x})-y)*f(\mathbf{x})*(1-f(\mathbf{x})*x^2 ..\text{so on}

If there are \(n\) points, we can just sum the gradients over all the \(n\) points to get the total gradient

What happens if the feature \(x^2\) is very sparse? (i.e., if its value is 0 for most inputs)

\(\nabla w^2\) will be 0 for most inputs (see formula) and hence \(w^2\) will not get enough updates

If \(x^2\) happens to be sparse as well as important we would want to take the updates to \(w^2\) more seriously.

Can we have a different learning rate for each parameter which takes care of the frequency of features?

Intuition

Decay the learning rate for parameters in proportion to their update history (more updates means more decay)

v_t=v_{t-1}+(\nabla w_t)^2

w_{t+1}=w_{t}-\frac{\eta}{\sqrt{v_t+\epsilon}}*\nabla w_{t}

... and a similar set of equations for \(b_t\)

Update Rule for AdaGrad

To see this in action we need to first create some data where one of the features is sparse

How would we do this in our toy network ?

Well, our network has just two parameters \(w \)and \(b\).

def do_adagrad(max_epochs):
  
  #Initialization
  w,b,eta = -2,-2,0.1
  v_w,v_b,eps = 0,0,1e-8
  for i in range(max_epochs): 
    # zero grad
    dw,db = 0,0        
    for x,y in zip(X,Y):
        
        #compute the gradients
        dw = grad_w(w,b,x,y)
        db = grad_b(w,b,x,y)        

    #compute intermediate values
    v_w = v_w + dw**2
    v_b = v_b + db**2      

    #update parameters
    w = w - eta*dw/(np.sqrt(v_w)+eps)
    b =b - eta*db/(np.sqrt(v_b)+eps)

Take some time to think about it

To see this in action we need to first create some data where one of the features is sparse

How would we do this in our toy network ?

Well, our network has just two parameters \(w \)and \(b\). Of these, the input/feature corresponding to \(b\) is always on (so can’t really make it sparse)

def do_adagrad(max_epochs):
  
  #Initialization
  w,b,eta = -2,-2,0.1
  v_w,v_b,eps = 0,0,1e-8
  for i in range(max_epochs): 
    # zero grad
    dw,db = 0,0        
    for x,y in zip(X,Y):
        
        #compute the gradients
        dw = grad_w(w,b,x,y)
        db = grad_b(w,b,x,y)        

    #compute intermediate values
    v_w = v_w + dw**2
    v_b = v_b + db**2      

    #update parameters
    w = w - eta*dw/(np.sqrt(v_w)+eps)
    b =b - eta*db/(np.sqrt(v_b)+eps)

Take some time to think about it

To see this in action we need to first create some data where one of the features is sparse

How would we do this in our toy network ?

Well, our network has just two parameters \(w \)and \(b\). Of these, the input/feature corresponding to \(b\) is always on (so can’t really make it sparse)

The only option is to make \(x\) sparse

Solution: We created 500 random \((x, y)\) pairs and then for roughly 80% of these pairs we set \(x\) to \(0\) thereby, making the feature for \(w\) sparse

def do_adagrad(max_epochs):
  
  #Initialization
  w,b,eta = -2,-2,0.1
  v_w,v_b,eps = 0,0,1e-8
  for i in range(max_epochs): 
    # zero grad
    dw,db = 0,0        
    for x,y in zip(X,Y):
        
        #compute the gradients
        dw += grad_w(w,b,x,y)
        db += grad_b(w,b,x,y)        

    #compute intermediate values
    v_w = v_w + dw**2
    v_b = v_b + db**2      

    #update parameters
    w = w - eta*dw/(np.sqrt(v_w)+eps)
    b =b - eta*db/(np.sqrt(v_b)+eps)

Take some time to think about it

There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?

Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis

Why?

There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?

Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis

Why? Because in our data, the feature corresponding to \(w\) is sparse and hence \(w\) undergoes very few updates

There is something interesting that these 3 algorithms are doing for this dataset. Can you spot it?

Initially, all three algorithms are moving mainly along the vertical \((b)\) axis and there is very little movement along the horizontal \((w)\) axis

Why? Because in our data, the feature corresponding to \(w\) is sparse and hence \(w\) undergoes very few updates ...on the other hand \(b\) is very dense and undergoes many updates

Such sparsity is very common in large neural networks containing \(1000s\) of input features and hence we need to address it

Let's see what AdaGrad does..

Adagrad

learning rate \(\eta_0\) = 0.1 for all the algorithms

Momentum \(\beta = 0.9\)

Number of points 500, 80% set to zero

Adagrad slows down near the minimum due to decaying learning rate

v_t=v_{t-1}+(\nabla w_t)^2

\nabla w=(f(x)-y) * f(x)*(1-f(x))*x

v_0=(\nabla w_0)^2

v_1=(\nabla w_0)^2+(\nabla w_1)^2

v_2=(\nabla w_0)^2+(\nabla w_1)^2+(\nabla w_2)^2

v_t=(\nabla w_0)^2+(\nabla w_1)^2+(\nabla w_2)^2+ \cdots+ (\nabla w_t)^2

Recall that,

Since \(x\) is sparse, the gradient is zero for most of the time steps.

Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays slowly.

Let's Examine it a bit more closely

v_t=v_{t-1}+(\nabla b_t)^2

\nabla b=(f(x)-y) * f(x)*(1-f(x))

v_0=(\nabla b_0)^2

v_1=(\nabla b_0)^2+(\nabla b_1)^2

v_2=(\nabla b_0)^2+(\nabla b_1)^2+(\nabla b_2)^2

v_t=(\nabla b_0)^2+(\nabla b_1)^2+(\nabla b_2)^2+ \cdots+ (\nabla b_t)^2

Recall that,

Though \(x\) is sparse, the gradient of \(b_t\) will not be zero for most of the time steps.(unless \(x\) takes a very large value)

Hence , \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays rapidly.

Therefore, \(v_t\) grows rapidly.

\eta_t = \frac{\eta_0}{\sqrt{v_t+\epsilon}}

The effective learning rate

v_t = v_{t-1}+ (\nabla w_t)^2

\frac{\eta_0}{\sqrt{v_0+\epsilon}}

The effective learning rate decays gradually for the parameter \(w\).

\(v_t^w\) grows gradually

\frac{\eta_0}{\sqrt{v_0+\epsilon}} = \frac{0.1}{\sqrt{0.019}} = 0.72

\eta_t = \frac{\eta_0}{\sqrt{v_t+\epsilon}}

The effective learning rate

v_t = v_{t-1}+ (\nabla b_t)^2

\frac{\eta_0}{\sqrt{v_0+\epsilon}}

The effective learning rate decays rapidly for the parameter \(b\).

\(v_t^b\) grows rapidly because of accumulating gradients. For ex, \(\nabla b_0^2=-9.19^2=84.45\)

\frac{\eta_0}{\sqrt{v_0+\epsilon}} = \frac{0.1}{\sqrt{84.45}} = 0.01

Observe that in AdaGrad, \(v_t^w\) and \(v_t^b\) never become zero, despite the fact that the gradients become zero after some iterations.

v_t^b= v_{t-1}^b+ (\nabla b_t)^2

v_t^w = v_{t-1}^w+ (\nabla w_t)^2

By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates

Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator

In practice, this does not work so well if we remove the square root from the denominator

By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates

Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator

In practice, this does not work so well if we remove the square root from the denominator (something to ponder about)

What’s the flipside?

By using a parameter specific learning rate it ensures that despite sparsity \(w\) gets a higher learning rate and hence larger updates

Further, it also ensures that if \(b\) undergoes a lot of updates its effective learning rate decreases because of the growing denominator

In practice, this does not work so well if we remove the square root from the denominator (something to ponder about)

What’s the flipside? over time the effective learning rate for \(b\) will decay to an extent that there will be no further updates to \(b\)

Can we avoid this?

Intuition

Adagrad decays the learning rate very aggressively (as the denominator grows)

Update Rule for RMSprop

v_t = \beta v_{t-1}+(1-\beta)\nabla w_t^2

w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t+\epsilon}}\nabla w_t

... and a similar set of equations for \(b_t\)

As a result, after a while, the frequent parameters will start receiving very small updates because of the decayed learning rate

To avoid this why not decay the denominator and prevent its rapid growth

v_t=v_{t-1}+\nabla b_t^2

\nabla b=(f(x)-y) * f(x)*(1-f(x))

v_0=\nabla b_0^2

v_1=\nabla b_0^2+\nabla b_1^2

v_2=\nabla b_0^2+\nabla b_1^2+\nabla b_2^2

v_t=\nabla b_0^2+\nabla b_1^2+\nabla b_2^2+ \cdots+ \nabla b_t^2

Recall that,

Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays rapidly for \(b\)

AdaGrad

v_t= \beta v_{t-1}+(1-\beta)(\nabla b_t)^2, \quad \beta\in[0,1)

v_0= 0.1\nabla b_0^2

v_1=0.09 \nabla b_0^2+0.1 \nabla b_1^2

\beta = 0.9

v_2=0.08 \nabla b_0^2+0.09 \nabla b_1^2+0.1 \nabla b_2^2

RMSProp

v_t= (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla b_\tau^2

Therefore, \(\dfrac{\eta}{\sqrt{v_t+\epsilon}}\) decays slowly (compared to adagrad) for \(b\)

def do_rmsprop(max_epochs):
  #Initialization
  w,b,eta = -4,4,0.1
  beta = 0.5 
  v_w,v_b,eps = 0,0,1e-4 
  
  for i in range(max_epochs): 
    # zero grad
    dw,db = 0,0   
    
    for x,y in zip(X,Y):     

        #compute the gradients
        dw = grad_w(w,b,x,y)
        db = grad_b(w,b,x,y)        

    #compute intermediate values
    v_w = beta*v_w +(1-beta)*dw**2
    v_b = beta*v_b + (1-beta)*db**2

    #update parameters
    w = w - eta*dw/(np.sqrt(v_w)+eps)
    b =b - eta*db/(np.sqrt(v_b)+eps)

However, why there are oscillations?

Does it imply that after some iterations, there is a chance that the learning rate remains constant so that the algorithm possibly gets into an infinite oscillation around the minima?

RMSProp converged more quickly than AdaGrad by being less aggressive on decay

\eta = 0.1, \beta=0.5

Recall that in AdaGrad, \(v_t = v_{t-1}+\nabla w_t^2 \), never decreases despite the fact that the gradients become zero after some iterations. Is that the case for RMSProp?

Observe that the gradients \(d_{b}\) and\( d_w\) are oscillating after some iterations

RMSProp: \(v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t^2)\)

AdaGrad: \(v_t = v_{t-1}+(\nabla w_t^2)\)

In the case of AdaGrad, the learning rate monotonically decreases due to the ever-growing denominator!

In the case of RMSProp, the learning rate may increase, decrease, or remains constant due to the moving average of gradients in the denominator.

The figure below shows the gradients across 500 iterations. Can you see the reason for the oscillation around the minimum?

What is the solution?

If learning rate is constant, there is chance that the descending algorithm oscillate around the local minimum.

\eta=0.05, \epsilon=0.0001

\eta=\frac{0.05}{\sqrt{10^{-4}}}

We have to set the initial learning rate appropriately, in this case, setting \(\eta=0.05\) solves this oscillation propblem.

What happens if we initialize the \(\eta_0\) with different values?

\eta_0=0.6

\eta_t=\frac{0.6}{\sqrt{v_t+\epsilon}}

v_t=\beta v_{t-1}+(1-\beta)(\nabla w_t)^2

\eta_0=0.1

\eta_t=\frac{0.1}{\sqrt{v_t+\epsilon}}

v_t=\beta v_{t-1}+(1-\beta)(\nabla w_t)^2

Which value for \(\eta_0\) is a good one ?

\eta_0=0.6

\eta_t=\frac{0.6}{\sqrt{v_t+\epsilon}}

v_t=\beta v_{t-1}+(1-\beta)(\nabla w_t)^2

\eta_0=0.1

\eta_t=\frac{0.1}{\sqrt{v_t+\epsilon}}

v_t=\beta v_{t-1}+(1-\beta)(\nabla w_t)^2

Which value for \(\eta_0\) is a good one ? Does this make any difference in terms of deciding how quickly the effective learning rate adapt to the gradient of the surface?

What happens if we initialize the \(\eta_0\) with different values?

Set \(\eta_0\) fixed to 0.6 in RMS prop (Top right)

It satisfies our wishlist: Decrease the learning rate at steep curvatures and increase the learning at gentle (or near flat) curvatures.

Let's set the value for \(\eta_0\) to 0.1 (bottom right)

Which one is better?

\text{in steep regions, say}, v_t = 1.25

\eta_t = \frac{0.6}{1.25} = 0.48

\eta_t = \frac{0.1}{1.25} = 0.08

\text{in flat regions, say}, v_t = 0.1

\eta_t = \frac{0.6}{0.1} = 6

\eta_t = \frac{0.1}{0.1} = 1

\(\eta_t = 0.08\) is better in a steep region and \(\eta_t=6\) is better in a gentle region. Therefore, we wish the numerator also to change wrt gradient/slope

Drawbacks:

Sensitive to initial learning rate, initial conditions of parameters and corresponding gradients (both RMSProp, AdaGrad)

If the initial gradients are large, the learning rates will be low for the remainder of training (in AdaGrad)

Later, if a gentle curvature is encountered, no way to increase the learning rate (In Adagrad)

AdaDelta

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t

2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2

1. \rightarrow \nabla w_t

for \quad t \quad in \quad range(1,N):

Avoids setting initial learning rate \(\eta_0\).

Let's see how it does.

AdaDelta

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t

5. \rightarrow u_t = \beta u_{t-1}+(1-\beta)(\Delta w_t)^2

4.\rightarrow w_{t+1}=w_t+\Delta w_t

2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2

1. \rightarrow \nabla w_t

for \quad t \quad in \quad range(1,N):

Avoids setting initial learning rate \(\eta_0\).

Since we use \(\Delta w_t\) to update the weights, it is called as Adaptive Delta

Let's see how it does.

Now the numerator, in the effective learning rate, is a function of past gradients (where as it was a constant in RMSprop, AdaGrad)

Observe that the \(u_t\) that we compute at \(t\) will be used only in the next iteration.

Also, notice that we take only a small fraction (\(1-\beta\)) of (\(\Delta w_t)^2\)

The question is, what difference does \(u_t\) make in adapting the learning rate?

v_t = \beta v_{t-1}+(1-\beta)\nabla w_t^2

v_t= \beta v_{t-1}+(1-\beta)(\nabla w_t)^2, \quad \beta\in[0,1)

v_0= 0.1\nabla w_0^2

v_1=0.09 \nabla w_0^2+0.1 \nabla w_1^2

\beta = 0.9

v_2=0.08 \nabla w_0^2+0.09 \nabla w_1^2+0.1 \nabla w_2^2

v_t= (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_t^2

v_3=0.07 \nabla w_0^2+0.08 \nabla w_1^2+0.09 \nabla w_2^2+0.1 \nabla w_3^2

We want the learning rate to decrease

v_t = \beta v_{t-1}+(1-\beta)\nabla w_t^2

v_t= \beta v_{t-1}+(1-\beta)(\nabla w_t)^2, \quad \beta\in[0,1)

v_0= 0.1\nabla w_0^2

v_1=0.09 \nabla w_0^2+0.1 \nabla w_1^2

\beta = 0.9

v_2=0.08 \nabla w_0^2+0.09 \nabla w_1^2+0.1 \nabla w_2^2

v_t= (1-\beta)\sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_t^2

v_3=0.07 \nabla w_0^2+0.08 \nabla w_1^2+0.09 \nabla w_2^2+0.1 \nabla w_3^2

We want the learning rate to increase

v_0= 0.1\nabla w_0^2

\Delta w_0 = -\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0

u_0= 0.1 \Delta w_0^2

Starting at high curvature region

\(\Delta w_0 \ll \nabla w_0\) (if we ignore \(\epsilon\) in the den. Then \(\Delta w_0=-3.16\sqrt{\epsilon}\))

\(u_0 \ll v_0\) , because of squaring delta

w_{1}=w_0+\Delta w_0

A very small update (What we wish for)

\nabla w_0

\text{say,}\epsilon=10^{-6}

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t

5. \rightarrow u_t = \beta u_{t-1}+(1-\beta)(\Delta w_t)^2

4.\rightarrow w_{t+1}=w_t+\Delta w_t

2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2

1. \rightarrow \nabla w_t

v_0= 0.1\nabla w_0^2

\Delta w_0 = -\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0

u_0= 0.1 \Delta w_0^2

Starting at high curvature region

w_{1}=w_0+\Delta w_0

v_1= 0.08 \nabla w_0^2+0.1\nabla w_1^2

\Delta w_1 = -\frac{\sqrt{u_0+\epsilon}}{\sqrt{v_1+\epsilon}} \nabla w_1

u_1= 0.08 \Delta w_0^2+0.1 \Delta w_1^2

w_{2}=w_1+\Delta w_1

A smaller update.

\(u_0 \ll v_1\), therefore \(\Delta w_1 \ll \nabla w_1\)

If the gradient remains high, then the numerator grows more slowly than the denominator. Therefore, the rate of change of learning rate is determined by the previous gradients!

\nabla w_0

\nabla w_1

\(\Delta w_0 \ll \nabla w_0\) (if we ignore \(\epsilon\) in the den. Then \(\Delta w_0=-3.16\sqrt{\epsilon}\))

\(u_0 \ll v_0\) , because of squaring delta

A very small update (What we wish for)

\text{say,}\epsilon=10^{-6}

v_0= 0.1(\nabla w_0)^2

\Delta w_0 = -\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0

u_0= 0.1 (\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0)^2

Starting at high curvature region

w_{1}=w_0+\Delta w_0

\nabla w_0

\text{say,}\epsilon=10^{-6} \quad \beta=0.9

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t

5. \rightarrow u_t = \beta u_{t-1}+(1-\beta)(\Delta w_t)^2

4.\rightarrow w_{t+1}=w_t+\Delta w_t

2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2

1. \rightarrow \nabla w_t

v_{-1}=0, \quad u_{-1}=0

t=0

Store A fraction of a history for next iteration

v_1= 0.9 v_0+ 0.1 (\nabla w_1)^2 = 0.09 (\nabla w_0)^2+0.1 (\nabla w_1)^2

\Delta w_1 = -\frac{\sqrt{u_0}}{\sqrt{v_1}} \nabla w_1

u_1= 0.09(\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0)^2+0.1(\frac{\sqrt{u_0}}{\sqrt{v_1}} \nabla w_1)^2

w_{2}=w_1+\Delta w_1

t=1

v_2= 0.9 v_1+ 0.1 (\nabla w_2)^2 = 0.08 (\nabla w_0)^2+0.09 (\nabla w_1)^2+0.1(\nabla w_2)^2

\Delta w_2 = -\frac{\sqrt{u_1}}{\sqrt{v_2}} \nabla w_2

u_2= 0.08(\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0)^2+0.09(\frac{\sqrt{u_0}}{\sqrt{v_1}} \nabla w_1)^2+0.1(\frac{\sqrt{u_1}}{\sqrt{v_2}} \nabla w_2)^2

w_{3}=w_2+\Delta w_2

t=2

ignoring \(\epsilon\)

\nabla w_1

\nabla w_2

v_0= 0.1(\nabla w_0)^2

u_0= 0.1 (\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0)^2

Starting at high curvature region

\nabla w_0

\text{say,}\epsilon=10^{-6} \quad \beta=0.9

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t

5. \rightarrow u_t = \beta u_{t-1}+(1-\beta)(\Delta w_t)^2

4.\rightarrow w_{t+1}=w_t+\Delta w_t

2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2

1. \rightarrow \nabla w_t

v_{-1}=0, \quad u_{-1}=0

t=0

v_1= 0.9 v_0+ 0.1 (\nabla w_1)^2 = 0.09 (\nabla w_0)^2+0.1 (\nabla w_1)^2

u_1= 0.09(\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0)^2+0.1(\frac{\sqrt{u_0}}{\sqrt{v_1}} \nabla w_1)^2

t=1

v_2= 0.9 v_1+ 0.1 (\nabla w_2)^2 = 0.08 (\nabla w_0)^2+0.09 (\nabla w_1)^2+0.1(\nabla w_2)^2

u_2= 0.08(\frac{\sqrt{\epsilon}}{\sqrt{v_0+\epsilon}} \nabla w_0)^2+0.09(\frac{\sqrt{u_0}}{\sqrt{v_1}} \nabla w_1)^2+0.1(\frac{\sqrt{u_1}}{\sqrt{v_2}} \nabla w_2)^2

t=2

\nabla w_1

\nabla w_2

for each iteration:

Both \(v_t\) and \(u_t\) increase

However, the magnitude of \(u_t\) is less than the magnitude of \(v_t\) as we take only a fraction of the gradient squared.

Starting at high curvature region

\nabla w_0

\text{say,}\epsilon=10^{-6} \quad \beta=0.9

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t

5. \rightarrow u_t = \beta u_{t-1}+(1-\beta)(\Delta w_t)^2

4.\rightarrow w_{t+1}=w_t+\Delta w_t

2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2

1. \rightarrow \nabla w_t

v_{-1}=0, \quad u_{-1}=0

\nabla w_1

\nabla w_2

for each iteration:

Both \(v_t\) and \(u_t\) increase

However, the magnitude of \(u_t\) is less than the magnitude of \(v_t\) as we take only a fraction of the gradient squared.

-\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t

The effective learning rate in AdaDelta is given by

Therefore, at time step \(t\), the numerator in AdaDelta uses the accumulated history of gradients until the previous time step \((t-1)\).

-\frac{1}{\sqrt{v_t+\epsilon}} \nabla w_t

The effective learning rate in RMSprop is given by

Therefore, even in a surface with a steep curvature, it won't aggressively reduce the learning rate like RMSProp.

In low curvature region

\nabla w_i

v_i= 0.034 \nabla w_0^2+0.038 \nabla w_1^2+ \cdots +0.08 \nabla w_{i-1}^2+0.1\nabla w_i^2

\Delta w_i = -\frac{\sqrt{u_{i-1}+\epsilon}}{\sqrt{v_i+\epsilon}} \nabla w_1

u_i= 0.034 \Delta w_0^2+0.038 \Delta w_1^2+\cdots+0.08\Delta w_{i-1}^2+0.1\Delta w_i^2

w_{i+1}=w_i+\Delta w_i

After some \(i\) iterations, the \(v_t\) will start decreasing and the ratio of the numerator to the denominator starts increasing.

If the gradient remains low for a subsequent time steps, then the learning rate grows accordingly.

Therefore, AdaDelta allows the numerator to increase or to decrease based on the current and past gradients.

\text{say,}\epsilon=10^{-6} \quad \beta=0.9

3. \rightarrow \Delta w_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \nabla w_t

5. \rightarrow u_t = \beta u_{t-1}+(1-\beta)(\Delta w_t)^2

4.\rightarrow w_{t+1}=w_t+\Delta w_t

2. \rightarrow v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2

1. \rightarrow \nabla w_t

v_{-1}=0, \quad u_{-1}=0

def do_adadelta(max_epochs): 
  #Initialization
  w,b= -4,-4
  beta = 0.99
  v_w,v_b,eps = 0,0,1e-4
  u_w,u_b = 0,0   
 
  for i in range(max_epochs): 
    dw,db = 0,0       
    for x,y in zip(X,Y):            

        #compute the gradients
        dw += grad_w(w,b,x,y)
        db += grad_b(w,b,x,y)       
   
    v_w = beta*v_w + (1-beta)*dw**2
    v_b = beta*v_b + (1-beta)*db**2
    
    delta_w = dw*np.sqrt(u_w+eps)/(np.sqrt(v_w+eps))
    delta_b = db*np.sqrt(u_b+eps)/(np.sqrt(v_b+eps))    
    u_w = beta*u_w + (1-beta)*delta_w**2
    u_b = beta*u_b + (1-beta)*delta_b**2  

    
    w = w - delta_w
    b = b - delta_b

It starts off with a (moderately) high curvature region and after 35 iterations it reaches the minimum

v_t = 0.9 v_{t-1}+0.1(\nabla w_t)^2

u_t = 0.9 u_{t-1}+0.1(\frac{\sqrt{u_{t-1}}}{\sqrt{v_t}}\nabla w_t)^2

Note the shape and magnitude of both \(v_t\) and \(u_t\).

The shape is alike but the magnitude differs

Implies that if \(v_t\) grows, \(u_t \) will also grow proportionally and vice a versa

AdaDelta

\dfrac{\sqrt{u_w+\epsilon}}{\sqrt{v_w+\epsilon}}

\dfrac{0.013}{\sqrt{v_t^w+\epsilon}}

AdaDelta

RMSProp

AdaDelta

Let's initialize the RMSProp with \(\eta_0=0.013\) (Assume, we do it by chance) and AdaDelta at \(w_0=-4,b_0=-4\)

AdaDelta Converged more quickly than RMSprob as its learning rate is wisely adapted.

RMSProp decays the learning rate aaggressively than AdaDelta as we can see from the plot on the left.

Which algorithm converges quickly?Guess

Let's put all these together in one place

Adam (Adaptive Moments)

Intuition

Do everything that RMSProp and AdaDelta does to solve the decay problem of Adagrad

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t

\hat{m_t} =\frac{m_t}{1-\beta_1^t}

v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2

w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}

\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Incorporating classical

momentum

\(L^2\) norm

Plus use a cumulative history of the gradients

Typically, \(\beta_1=0.9\), \(\beta_2=0.999\)

def do_adam_sgd(max_epochs):
  
  #Initialization
  w,b,eta = -4,-4,0.1
  beta1,beta2 = 0.9,0.999
  m_w,m_b,v_w,v_b = 0,0,0,0   
  
  for i in range(max_epochs): 
    dw,db = 0,0
    eps = 1e-10     
    for x,y in zip(X,Y):       

        #compute the gradients
        dw = grad_w_sgd(w,b,x,y)
        db = grad_b_sgd(w,b,x,y)       

    #compute intermediate values
    m_w = beta1*m_w+(1-beta1)*dw
    m_b = beta1*m_b+(1-beta1)*db
    v_w = beta2*v_w+(1-beta2)*dw**2
    v_b = beta2*v_b+(1-beta2)*db**2

    m_w_hat = m_w/(1-np.power(beta1,i+1))
    m_b_hat = m_b/(1-np.power(beta1,i+1))
    v_w_hat = v_w/(1-np.power(beta2,i+1))
    v_b_hat = v_b/(1-np.power(beta2,i+1))      

    #update parameters
    w = w - eta*m_w_hat/(np.sqrt(v_w_hat)+eps)
    b = b - eta*m_b_hat/(np.sqrt(v_b_hat)+eps)

Million Dollar Question: Which algorithm to use?

Adam seems to be more or less the default choice now \((\beta_1 = 0.9, \beta_2 = 0.999 \ \text{and} \ \epsilon = 1e^{-8} )\).

Although it is supposed to be robust to initial learning rates, we have observed that for sequence generation problems \(\eta = 0.001, 0.0001\) works best

Having said that, many papers report that SGD with momentum (Nesterov or classical) with a simple annealing learning rate schedule also works well in practice (typically, starting with \(\eta = 0.001, 0.0001 \)for sequence generation problems)

Adam might just be the best choice overall!!

Some works suggest that there is a problem with Adam and it will not converge in some cases. Also, it is observed that the models trained using Adam optimizer usually don't generalize well.

Explanation for why we need bias correction in Adam

Update Rule for Adam

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t

\hat{m_t} =\frac{m_t}{1-\beta_1^t}

v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2

w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}

\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Note that we are taking a running average of the gradients as \(m_t\)

The reason we are doing this is that we don’t want to rely too much on the current gradient and instead rely on the overall behaviour of the gradients over many timesteps

One way of looking at this is that we are interested in the expected value of the gradients and not on a single point estimate computed at time t

However, instead of computing \(E[\nabla w_t]\) we are computing \(m_t\) as the exponentially moving average

Ideally we would want \(E[m_t ]\) to be equal to \(E[\nabla w_t ]\)

Let us see if that is the case

Recall the momentum equations from module 5.4

u_t = \beta u_{t-1}+ \nabla w_t

u_t = \sum \limits_{\tau=1}^t \beta^{t-\tau} \nabla w_{\tau}

What we have now in Adam is a slight modification to the above equation (let's treat \(\beta_1\) as \(\beta\) )

m_t = \beta m_{t-1}+(1-\beta)\nabla w_t

\rbrace

m_t = (1-\beta)\sum \limits_{\tau=1}^t \beta^{t-\tau} \nabla w_{\tau}

m_0 = 0

m_1 = \beta m_0+(1-\beta)\nabla w_1=(1-\beta)\nabla w_1

m_2 =\beta m_1 + (1-\beta) \nabla w_2

=\beta (1-\beta) \nabla w_1+ (1-\beta) \nabla w_2

= (1-\beta) (\beta\nabla w_1+ \nabla w_2)

m_3=\beta m_2 + (1-\beta) \nabla w_3

=\beta ((1-\beta) (\beta\nabla w_1+ \nabla w_2) )+(1-\beta) \nabla w_3

=(1-\beta) (\beta^2 \nabla w_1+\beta \nabla w_2)+(1-\beta) \nabla w_3

m_3 = (1-\beta)\sum \limits_{\tau=1}^3 \beta^{t-\tau} \nabla w_{\tau}

Recall the momentum equations from module 5.4

u_t = \beta u_{t-1}+ \nabla w_t

u_t = \sum \limits_{\tau=0}^t \beta^{t-\tau} \nabla w_{\tau}

What we have now in Adam is a slight modification to the above equation

m_t = \beta m_{t-1}+(1-\beta)\nabla w_t

\rbrace

m_t = (1-\beta)\sum \limits_{\tau=1}^t \beta^{t-\tau} \nabla w_{\tau}

Let's take expectation on both sides

E[m_t] =E[ (1-\beta)\sum \limits_{\tau=1}^t \beta^{t-\tau} \nabla w_{\tau}]

E[m_t] = (1-\beta)\sum \limits_{\tau=1}^t E[ \beta^{t-\tau} \nabla w_{\tau}]

E[m_t] = (1-\beta)\sum \limits_{\tau=1}^t \beta^{t-\tau} E[ \nabla w_{\tau}]

Assumption: All \(\nabla w_\tau\) comes from the same distribution,i.e.,

\(E[\nabla w_\tau] = E[\nabla w] \quad \forall \tau\)

E[m_t] =E[ (1-\beta)\sum \limits_{\tau=1}^t \beta^{t-\tau} \nabla w_{\tau}]

E[m_t] = (1-\beta)\sum \limits_{\tau=1}^t E[ \beta^{t-\tau} \nabla w_{\tau}]

E[m_t] = (1-\beta)\sum \limits_{\tau=1}^t \beta^{t-\tau} E[ \nabla w_{\tau}]

Assumption: All \(\nabla w_\tau\) comes from the same distribution,i.e.,

\(E[\nabla w_\tau] = E[\nabla w] \quad \forall \tau\)

E[m_t] = E[ \nabla w](1-\beta)

\sum \limits_{\tau=1}^t \beta^{t-\tau}

E[m_t] = E[ \nabla w](1-\beta)

(\beta^{t-1}+\beta^{t-2}+\cdots+\beta^{0})

E[m_t] = E[ \nabla w](1-\beta)

\frac{1-\beta^{t}}{1-\beta}

E[m_t] = E[ \nabla w](1-\beta^{t})

The last ratio is the sum of GP with common ratio \(\beta\)

E[\frac{m_t}{1-\beta^{t}}] = E[ \nabla w]

E[\hat{m_t}] = E[ \nabla w] \quad \text{where,} \hat{m_t}=\frac{m_t}{1-\beta^{t}}

Hence we apply the bias correction because then the expected value of \(\hat{m_t}\) is the same as the expected value of \(E[\nabla w_t]\)

Let's take expectation on both sides

Assume that we have only noisy observations

Estimate the true function using

Exponentially weighted (without bias correction)

As expected, it gives a poor approximation for the first few iterations (obvious from the plot, it is biased towards zero)

Let's see the bias-corrected version of it

Bias correction is the problem of exponential averaging: Illustrative example

Assume that we have only noisy observations

Estimate the true function using

Exponentially weighted (with bias correction)

Bias correction is the problem of exponential averaging: Illustrative example

What if we don't do bias correction?

"..a lack of initialization bias correction would lead to initial steps that are much larger" [from the paper]

v_t=\beta_2 v_{t-1}+ (1-\beta_2)(\nabla w_t)^2, \quad \beta_2=0.999

Suppose, \(\nabla w_0=0.1\)

v_0=0.999 * 0+ 0.001(0.1)^2 = 0.00001

\eta_t=\frac{1}{\sqrt{0.00001}}=316.22

v_2=0.999 * v_0+ 0.001(0)^2 = 0.0000099

\eta_t=\frac{1}{\sqrt{0.0000099}}=316.38

Suppose, \(\nabla w_0=0.1\)

v_0=0.999 * 0+ 0.001(0.1)^2 = 0.00001

\hat{v}_0=\frac{v_0}{1-0.999} = \frac{0.00001}{0.001} = 0.01

\eta_t=\frac{1}{\sqrt{0.01}}=10

v_2=0.999 * v_0+ 0.001(0)^2 = 0.0000099

\eta_t=\frac{1}{\sqrt{0.0052}}=13.8

\hat{v}_0=\frac{v_0}{1-0.999^2} = \frac{0.0000099}{0.0019} = 0.0052

Therefore, doing a bias correction attenuates the initial learning to a greater extent.

What if we don't do bias correction?

Adam

Adam-No-BC

"..a lack of initialization bias correction would lead to initial steps that are much larger" [from the paper]

\eta_t=\frac{ \eta_0}{\sqrt{v_t}+\epsilon}

\eta_t=\frac{ \eta_0}{\sqrt{\hat{v_t}}+\epsilon}

\beta_2=0.999, \quad \eta_0=0.1

Is doing a bias correction a must?

No. For example, the Keras framework doen't implement bias correction

Let's revisit \(L^p\) norm

L^p=(|x_1|^p+|x_2|^p+\cdots+|x_n|^p)^{\frac{1}{p}}

In order to visualize it, let's fix \(L^p=1\) and vary \(p\)

1=(|x_1|^p+|x_2|^p)^{\frac{1}{p}}

1^p=|x_1|^p+|x_2|^p

1=|x_1|^p+|x_2|^p

We can choose any value for \(p \ge 1\)

However, you might notice, if \(p \rightarrow \infty\), it can simply be replaced with \(max(x_1,x_2,\cdots,x_n)\)

\(|x|\), raised to a high value of \(p\), becomes too small to represent. This leads to numerical instability in calculations.

So, what is the point we are trying to make?

Therefore, we can replace the \(\sqrt{v_t}\) by the \(\max()\) norm as follows

v_t=\max(\beta_2^{t-1}|\nabla w_1|,\beta_2^{t-2}|\nabla w_2|,\cdots,|\nabla w_t|)

v_t=\max(\beta_2v_{t-1},|\nabla w_t|)

w_{t+1}=w_t-\frac{\eta_0}{v_t}\nabla \hat{m}_t

Observe that we didn't use bias corrected \(v_t\) as max norm is not susceptible to initial zero bias.

v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2

w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}

\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Recall, the equation of \(v_t\) from Adam

We call \(\sqrt{v_t}\) as \(L^2\) norm. Why not replace it with \(\max() (L^{\infty}) \) norm?

Let's see an illustrative example for this.

Assume that we have only noisy observations

Estimate the true function using

max norm, \(max(\beta v_{t-1},|\text{noisy} f(t)|)\)

Max norm is not susceptible to bias towards zero (i.e., zero initialization)

So, what is the point we are trying to make?

(v_t)^{\frac{1}{p}} = (\beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2)^{\frac{1}{p}}

We used \(L^2\) norm, \(\sqrt{v_t}\), in Adam. Why don't we just generalize it to \(L^p\) norm?

As discussed, for larger values of \(p\) ,\(L^p\) norm becomes numerically unstable.

However, as \(p \rightarrow \infty\) it becomes stable and can be repleaced with \(max()\) function.

v_t = \beta_2^p v_{t-1}+(1-\beta_2^p)|\nabla w_t|^p

Let's define

(v_t)^{\frac{1}{p}} = (\beta_2^p v_{t-1}+(1-\beta_2^p)|\nabla w_t|^p)^{\frac{1}{p}}

\lim\limits_{p \to \infty}(v_t)^{\frac{1}{p}} =\lim\limits_{p \to \infty}\Big((1-\beta_2^p) \sum \limits_{\tau=1}^t \beta_2^{(t-\tau)p} |\nabla w_t|^p\Big)^{\frac{1}{p}}

=\lim\limits_{p \to \infty}(1-\beta_2^p)^{\frac{1}{p}} \lim\limits_{p \to \infty} \Big(\sum \limits_{\tau=1}^t \big(\beta_2^{(t-\tau)} |\nabla w_t|\big)^p\Big)^{\frac{1}{p}}

=\lim\limits_{p \to \infty} \Big(\sum \limits_{\tau=1}^t \big(\beta_2^{(t-\tau)} |\nabla w_t|\big)^p\Big)^{\frac{1}{p}}

=\max(\beta_2^{t-1}|\nabla w_1|,\beta_2^{t-2}|\nabla w_2|,\cdots,|\nabla w_t|)

v_t=\max(\beta_2 v_{t-1},|\nabla w_t|), \quad \beta_2=0.999

Suppose that we initialize \(w_0\) such that the gradient at the \(w_0\) is high

Suppose further that the gradients for the next subsequent iterations are also zero (because \(x\) is sparse.)

\nabla w_t

Ideally, we don't want the learning rate to change (especially, increase) its value when \(\nabla w_t=0\) (because \(x=0\))

v_t=\max(\beta_2 v_{t-1},|\nabla w_t|), \quad \beta_2=0.999

v_0=\max(0,|\nabla w_0|)=1

v_1=\max(0.999*1,0|)=0.999

Ideally, we don't want the learning rate to change (especially, increase) its value when \(\nabla w_t=0\) because \(x=0\)

\eta_t=\frac{1}{1}=1

\eta_t=\frac{1}{0.999}=1.001

v_2=\max(0.999,1)=1

\eta_t=\frac{1}{1}=1

50% of input is zero

The problem with exponential averaging (even with bias correction) is that it increases the learning rate despite \(\nabla w_t=0\)

v_t=\beta_2 v_{t-1}+ (1-\beta_2)(\nabla w_t)^2, \quad \beta_2=0.999

v_0=0.999 * 0+ 0.001(\nabla w_0)^2 = 0.001

v_1=0.999*(0.001)+ 0.001(0)^2 = 0.000999

\eta_t=\frac{1}{\sqrt{1}}=1

\eta_t=\frac{1}{\sqrt{0.499}}=1.41

\hat{v_0}=\frac{0.001}{1-0.999}=1

\hat{v_1}=\frac{0.000999}{1-0.999^2}=0.499

Let's see the behaviour of exponential averaging and Max norm for the following gradient profile

Exponential averaging in RMSprop keeps increasing the learning rate where as Max norm is quite conservative in increasing the learning rate

\frac{1}{\max(0,1.16)}=0.86

\frac{1}{v_0}=\frac{1}{1.34}=0.74

For the first few iterations ,\(t=(0,1,\cdots,10)\), the gradient is high and then it gradually decreases.

In terms of convergence, both the algorithms converge to the minimum eventually as the loss surface is smooth and convex

In general, having low learning rate at steep surfaces and high learning rate at gentle surfaces is desired.

For a smooth convex loss surface, exponential averaging wins as increasing the learning rate is not a harm.

MaxProp

v_t = max(\beta v_{t-1},|\nabla w_t|)

Update Rule for RMSProp

v_t = \beta v_{t-1}+(1-\beta)(\nabla w_t)^2

w_{t+1}=w_t-\frac{ \eta}{\sqrt{v_t}+\epsilon}\nabla w_t

Update Rule for MaxProp

w_{t+1}=w_t-\frac{ \eta}{v_t+\epsilon}\nabla w_t

We can extend the same idea to Adam and call it AdaMax

In fact, using max norm in place of \(L^2\) norm was proposed in the same paper where Adam was proposed.

AdaMax

v_t = max(\beta_2v_{t-1},|\nabla w_t|)

w_{t+1}=w_t-\eta \frac{ \hat{m_t}}{v_t+\epsilon}

Update Rule for Adam

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t

\hat{m_t} =\frac{m_t}{1-\beta_1^t}

v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2

w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}

\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Update Rule for AdaMax

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t

\hat{m_t} =\frac{m_t}{1-\beta_1^t}

w_{t+1}=w_t-\frac{ \eta}{v_t+\epsilon}\hat{m_t}

Note that bias correction for \(v_t\) is not required for Max function, as it is not susceptible to initial bias towards zero.

50% of the elements in input \(x\) is set to zero randomly. The weights are updated using SGD with \(\beta_2=0.999, \eta_0=0.025\)

Observe that the weights (\(w\)) get updated even when \(x=0 (\implies \nabla w_t=0)\), that is because of momentum not because of the current gradient.

At 10th iteration

Both Adam and Adamax have moved in the same direction. Let's look at the gradients

v_0 =\frac{ 0.001*(-0.0314)^2}{0.001}=0.00098

\eta_t = \frac{0.025}{\sqrt{0.00098}}=0.79

v_0=\max(0,|-0.0314|)=0.0314

\eta_t = \frac{0.025}{0.0314}=0.79

Adam

Adamax

Adamax Works good for SGD

For Batch-Gradient Descent

Adam Wins the race

def do_adamax_gd(max_epochs):

  #Initialization
  w,b,eta = -4,-4,0.1
  beta1,beta2 = 0.9,0.99
  m_w,m_b,v_w,v_b = 0,0,0,0
  m_w_hat,m_b_hat,v_w_hat,v_b_hat = 0,0,0,0
    
  
  for i in range(max_epochs): 
    dw,db = 0,0
    eps = 1e-10 
    
    for x,y in zip(X,Y):       

        #compute the gradients
        dw += grad_w_sgd(w,b,x,y)
        db += grad_b_sgd(w,b,x,y)
        

    #compute intermediate values
    m_w = beta1*m_w+(1-beta1)*dw
    m_b = beta1*m_b+(1-beta1)*db
    v_w = np.max([beta2*v_w,np.abs(dw)])
    v_b = np.max([beta2*v_b,np.abs(db)])

    m_w_hat = m_w/(1-np.power(beta1,i+1))
    m_b_hat = m_b/(1-np.power(beta1,i+1))    

    #update parameters
    w = w - eta*m_w_hat/(v_w+eps)
    b = b - eta*m_b_hat/(v_b+eps)

def do_adamax_sgd(max_epochs):

  #Initialization
  w,b,eta = -4,-4,0.1
  beta1,beta2 = 0.9,0.99
  m_w,m_b,v_w,v_b = 0,0,0,0
  m_w_hat,m_b_hat,v_w_hat,v_b_hat = 0,0,0,0
    
  
  for i in range(max_epochs): 
    dw,db = 0,0
    eps = 1e-10 
    
    for x,y in zip(X,Y):       

        #compute the gradients
        dw += grad_w_sgd(w,b,x,y)
        db += grad_b_sgd(w,b,x,y)
        

    #compute intermediate values
    m_w = beta1*m_w+(1-beta1)*dw
    m_b = beta1*m_b+(1-beta1)*db
    v_w = np.max([beta2*v_w,np.abs(dw)])
    v_b = np.max([beta2*v_b,np.abs(db)])

    m_w_hat = m_w/(1-np.power(beta1,i+1))
    m_b_hat = m_b/(1-np.power(beta1,i+1))    

    #update parameters
    w = w - eta*m_w_hat/(v_w+eps)
    b = b - eta*m_b_hat/(v_b+eps)

Intuition

NAdam (Nesterov Adam) : Paper

We know that NAG is better than Momentum based GD

We just need to modify \(m_t\) to get NAG

Why not just incorporate it with ADAM?

Update Rule for Adam

m_t = \beta_1 m_{t-1}+(1-\beta_1)\nabla w_t

\hat{m_t} =\frac{m_t}{1-\beta_1^t}

v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2

w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}

\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Recall the equation for Momentum

u_t=\beta u_{t-1}+ \nabla w_{t}

Recall the equation for NAG

w_{t+1}=w_t - \eta u_t

u_t = \beta u_{t-1}+ \nabla (w_t - \beta \eta u_{t-1})

w_{t+1} = w_t - \eta u_t

We skip the multiplication of factor \(1-\beta\) with \(\nabla w_t\) for brevity

g_t = \nabla (w_t - \eta \beta m_{t-1})

m_t = \beta m_{t-1}+ g_t

w_{t+1} = w_t - \eta m_t

NAG

Observe that the momentum vector \(m_{t-1}\) is used twice (while computing gradient and current momentum vector). As a consequence, there are two weight updates (which is costly for big networks)!

\lbrace

Look ahead

w_0

\nabla w_0

w_1 = w_0 -\nabla(w_0-0)

w_1

g_1 = \nabla (w_1-\beta m_0))

g_0=\nabla w_0

\eta = 1

m_0=g_0

m_{-1} = 0

w_1-\beta m_0

\nabla (w_1-\beta \nabla w_0)

m_1=\beta m_0 +g_1

w_2 = w_1 -m_1 = w_1-(\beta m_0 +\nabla(w_1-\beta m_0))

g_t = \nabla (w_t - \beta m_{t-1})

m_t = \beta m_{t-1}+ g_t

w_{t+1} = w_t - \eta m_t

NAG

Observe that the momentum vector \(m_{t-1}\) is used twice (while computing gradient and current momentum vector). As a consequence, there are two weight updates (which is costly for big networks)!

\lbrace

Look ahead

w_0

\nabla w_0

w_1 = w_0 -\nabla(w_0-0)

w_1

g_1 = \nabla (w_1-\beta m_0)

g_0=\nabla w_0

\eta = 1

m_0=g_0

m_{-1} = 0

w_2

m_1=\beta m_0 +g_1

w_2 = w_1 -m_1 = w_1-(\beta m_0 +\nabla(w_1-\beta m_0))

Is there a way to fix this?

w_0

\nabla w_0

w_1 = w_0 -\nabla(w_0-0)

w_1

g_1 = \nabla (w_1-\beta m_0)

g_0=\nabla w_0

\eta = 1

m_0=g_0

m_{-1} = 0

w_2

m_1=\beta m_0 +g_1

w_2 = w_1 -m_1 = w_1-(\beta m_0 +\nabla(w_1-\beta m_0))

w_1 = w_0 -\nabla(w_0-0)

g_1 = \nabla (w_1-\beta m_0)

Why don't we do this look ahead in the previous time step? Because \(\beta m_0 \) is from the previous time step only!)

w_1 - \beta m_0 = (w_0 -\nabla(w_0-0))-\beta m_0

g_{t+1} = \nabla w_{t}

m_{t+1} = \beta m_{t}+ g_{t+1}

w_{t+1} = w_t - \eta ( \beta m_{t+1}+g_{t+1})

Rewritten NAG

\eta = 1

m_{0} = 0

g_1=\nabla w_0

w_0

\nabla w_0

m_1 = \beta m_0+g_1 = g_1

w_1 = w_0-\beta g_1-g_1

w_1

g_2=\nabla w_1

\nabla w_1

m_2 = \beta m_1+g_2=\beta g_1+g_2

w_2 = w_1- (\beta(\beta g_1+g_2)+g_2)

Now, look ahead is computed only at the momentum step \(m_{t+1}\)

w_0-\beta g_1

However, the gradient of look ahead is computed in the next step

g_{t+1} = \nabla w_{t}

m_{t+1} = \beta m_{t}+g_{t+1}

w_{t+1} = w_t - \eta ( \beta m_{t+1}+g_{t+1})

Rewritten NAG

\lbrace

Look ahead

Now, look ahead is computed only at the momentum step \(m_{t+1}\)

\eta = 1

m_{0} = 0

g_1=\nabla w_0

w_0

\nabla w_0

m_1 = \beta m_0+g_1

w_1 = w_0-g_1

w_1

g_2=\nabla w_1

\nabla w_1

m_2 = \beta m_1+g_2=\beta g_1+g_2

w_2 = w_1- (\beta(\beta g_1+g_2)+g_2)

w_2

Update Rule for NAdam

m_{t+1} = \beta_1 m_{t}+(1-\beta_1)\nabla w_t

\hat{m}_{t+1} =\frac{m_{t+1}}{1-\beta_1 ^{t+1}}

v_{t+1} = \beta_2 v_{t}+(1-\beta_2)(\nabla w_t)^2

\hat{v}_{t+1} =\frac{v_{t+1}}{1-\beta_2^{t+1}}

w_{t+1} = w_{t} - \dfrac{\eta}{\sqrt{\hat{v}_{t+1}} + \epsilon} (\beta_1 \hat{m}_{t+1} + \dfrac{(1 - \beta_1) \nabla w_t}{1 - \beta_1^{t+1}})

def do_adamax_sgd(max_epochs):

  #Initialization
  w,b,eta = -4,-4,0.1
  beta1,beta2 = 0.9,0.99 
  m_w_hat,m_b_hat,v_w_hat,v_b_hat = 0,0,0,0    
  
  for i in range(max_epochs): 
    dw,db = 0,0
    eps = 1e-10     
    for x,y in zip(X,Y):  
        #compute the gradients
        dw += grad_w_sgd(w,b,x,y)
        db += grad_b_sgd(w,b,x,y)       

     #compute intermediate values
    m_w = beta1*m_w+(1-beta1)*dw
    m_b = beta1*m_b+(1-beta1)*db
    v_w = beta2*v_w+(1-beta2)*dw**2
    v_b = beta2*v_b+(1-beta2)*db**2

    m_w_hat = m_w/(1-beta1**(i+1))
    m_b_hat = m_b/(1-beta1**(i+1))
    v_w_hat = v_w/(1-beta2**(i+1))
    v_b_hat = v_b/(1-beta2**(i+1))
    #update parameters
    w = w - (eta/np.sqrt(v_w_hat+eps))* \\
    	(beta1*m_w_hat+(1-beta1)*dw/(1-beta1**(i+1)))
    b = b - (eta/(np.sqrt(v_b_hat+eps)))*\\
    	(beta1*m_b_hat+(1-beta1)*db/(1-beta1**(i+1)))

SGD with momentum outperformed Adam in certain problems like object detection and machine translation

It is attributed exponentially averaging squared gradients \((\nabla w)^2\) in \(v_t\) of Adam which suppresses the high-valued gradient information over iterations. So, use \(max()\) function to retain it.

Update Rule for Adam

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t

\hat{m_t} =\frac{m_t}{1-\beta_1^t}

v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2

w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}

\hat{v_t} =\frac{v_t}{1-\beta_2^t}

Update Rule for AMSGrad

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w_t

v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2

w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}m_t

\hat{v}_t = \text{max}(\hat{v}_{t-1}, v_t)

Note that there are no bias corrections in AMSGrad!

Well, no one knows what AMSGrad stands for!

Once again, which optimizer to use?

It is common for new learners to start out using off-the-shelf optimizers, which are later replaced by custom-designed ones.

-Pedro Domingos

Module 5.10 : A few more Learning

Rate Schedulers

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Learning rate Schemes

Based on epochs

Based on validation

1. Step Decay

Based on Gradients

2. Exponential Decay

3. Cyclical

4. Cosine annealing

1. Line search

2. Log search

1. AdaGrad

2. RMSProp

3.AdaDelta

4.Adam

5.AdaMax

6.NAdam

7.AMSGrad

8.AdamW

5. Warm-Restart

Cyclical Learning Rate (CLR) : Paper

Suppose the loss surface looks as shown in the figure. The surface has a saddle point.

Suppose further that the parameters are initialized to \((w_0,b_0)\) (yellow point on the surface) and the learning rate \(\eta\) is decreased exponentially over iterations.

After some iterations, the parameters \((w,b)\) will reach near the saddle point.

Since the learning rate has decreased exponentially, the algorithm has no way of coming out of the saddle point (despite the possibility of coming out of it)

What if we allow the learning rate to increase after some iterations? at least there is a chance to escape the saddle point.

Cyclical Learning Rate (CLR) : Paper

Rationale: Often, difficulty in minimizing the loss arises from saddle points rather than poor local minima.

Therefore, it is beneficial if the learning rate schemes support a way to increase the learning rate near the saddle points.

Adaptive learning rate schemes may help in this case. However, it comes with an additional computational cost.

A simple alternative is to vary the learning rate cyclically

CLR : Triangular

\eta_{max}=0.5

\eta_{min}=0.01

\mu=20

\eta_t=\eta_{min}+(\eta_{max}-\eta_{min})\cdot max(0,(1-|\frac{t}{\mu}-(2 \lfloor{1+\frac{t}{2\mu}}\rfloor)+1|)

Where, \(\mu\) is called a step size

t=20

2*\lfloor 1+\frac{20}{40}\rfloor=2

|\frac{20}{20}-2+1|=0

\max(0,(1-0))=1

\eta_t = \eta_{min}+(\eta_{max}-\eta_{min})*1=\eta_{max}

def cyclic_lr(iteration,max_lr,base_lr,step_size):  
  cycle = np.floor(1+iteration/(2*step_size))
  x = np.abs(iteration/step_size - 2*cycle + 1)
  lr = base_lr + (max_lr-base_lr)*np.maximum(0, (1-x))
  return lr

def do_gradient_descent_clr(max_epochs):
    w,b = -2,0.0001        
    for i in range(max_epochs):
        dw,db = 0,0               
        dw = grad_w(w,b)
        db = grad_b(w,b)        
        w = w - cyclic_lr(i,max_lr=0.1,base_lr=0.001,step_size=30) * dw
        b = b - cyclic_lr(i,max_lr=0.1,base_lr=0.001,step_size=30) * db

Cosine Annealing (Warm Re-Start)

Let's see how changing the learning rate cyclically also helps faster convergence with the following example

Cosine Annealing (Warm Re-Start)

Reached minimum at 46th Iteration (of course, if we run it for a few more iterations, it crosses the minimum)

However, we could use techniques such as early stopping to roll back to the minimum

Cosine Annealing (Warm Re-Start)

\eta_t = \eta_{min} + \frac{\eta_{max} - \eta_{min}}{2} \left(1 + \cos(\pi \frac{(t \%(T+1))}{T})\right)

\(\eta_{max}\): Maximum value for the learning rate

\(\eta_{min}\): Minimum value for the learning rate

\(t\): Current epoch

\(T\): Restart interval (can be adaptive)

Modified formula from the paper for batch gradient (original deals with SGD and restarts after a particular epoch \(T_i\))

\(\eta_t=\eta_{max}\), for \(t=0\)

\(\eta_t=\eta_{min}\), for \(t=T\)

\eta_t = \eta_{min} + \frac{\eta_{max} - \eta_{min}}{2} \left(1 + \cos(\pi \frac{(t \%(T+1))}{T})\right)

\eta_{max}=1

\eta_{min}=0.1

T=50

Note the abrupt changes after \(T\) (That's why it is called warm re-start)

Warm-start

There are other learning rate schedulers like warm-start are found to be helpful in achieving quicker convergence in architectures like transformers

Typically, we set the initial learning rate to a high value and then decay it.

On the contrary, using a low initial learning rate helps the model to warm and converge better. This is called warm-start

warmupSteps=4000