CS6910: Fundamentals of Deep Learning

Lecture 6

Gradient Descent, Momentum Based GD,

Nesterov Accelerate GD,

Mitesh M. Khapra

Department of Computer Science and Engineering, IIT Madras

Learning Objectives

Acknowledgment

For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize backpropagation” (available on youtube)

For Module 3.5, I have borrowed ideas from this excellent *book which is available online

I am sure I would have been influenced and borrowed ideas from other sources and I apologize if I have failed to acknowledge them

Sigmoid Neuron

The story ahead

    Enough about boolean functions!

Can we have a network which can (approximately) represent such functions ?

Before answering the above question we will have to first graduate from perceptrons to sigmoidal neurons ...

What about arbitrary functions of the form \(y=f(x)\) where \(x \in R^n \)(instead of \(\{0,1\}^n\) and \(y \in R\) (instead of \(\{0,1\}\))

Recall

A perceptron will fire if the weighted sum of its inputs is greater than the threshold \((-w_0)\)

The thresholding logic used by a perceptron is very harsh !

For example, let us return to our problem of deciding whether we will like or dislike a movie

Consider that we base our decision only on one input \( x_1 = criticsRating \in (0,1)\)

criticsRating

If the threshold is 0.5 \((-w_0=0.5)\) and \((w_1=1)\)

then what would be the decision for a movie with criticsRating = 0.51 ? (like)

What about a movie with criticsRating = 0.49 ? (dislike)

It seems harsh that we would like a movie with rating 0.51 but not one with a rating of 0.49

y
bias = w_0 = 0.5
w_1
x_1

This behavior is not a characteristic of the specific problem we chose or the specific weight and threshold that we chose

-w_0
y

It is a characteristic of the perceptron function itself which behaves like a step function

There will always be this sudden change in the decision (from 0 to 1) when \(\sum \limits_{i=1}^{n}w_ix_i \) crosses the threshold \(-w_0\)  

z = \sum \limits_{i=1}^{n}w_ix_i

For most real world applications we would expect a smoother decision function which gradually changes from 0 to 1

x
y

Introducing sigmoid neurons where the output function is much smoother than the step function

Here is one form of the sigmoid function called the logistic function

y = \frac{1}{1+exp\big(-(w_0+\sum \limits_{i=1}^{n}w_ix_i)\big)}

We no longer see a sharp transition around the threshold

z = \sum \limits_{i=1}^{n}w_ix_i
-w_0

Also the output \(y\) is no longer binary but a real value between 0 and 1 which can be interpreted as a probability.

-w_0

Instead of a like/dislike decision we get the probability of liking the movie.

Perceptron

Sigmoid (Logistic) Neuron

y = 1 \ \ \text{if} \ \sum \limits_{i=0}^{n}w_ix_i \geq 0
= 0 \ \ \text{if} \ \sum \limits_{i=0}^{n}w_ix_i < 0
y = \frac{1}{1+exp\big(-(w_0+\sum \limits_{i=1}^{n}w_ix_i)\big)}
w_1
x_0 = 1
y_1
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\cdots
w_1
x_0 = 1
y_1
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma
z = \sum \limits_{i=1}^{n}w_ix_i
-w_0
z = \sum \limits_{i=1}^{n}w_ix_i
-w_0

Not smooth, not continuous at (\(-\omega_0\))

not differentiable

Smooth, continuous, differentiable

Sigmoid (Logistic) Neuron

What next ?

Well, just as we had an algorithm for learning the weights of a perceptron, we also need a way of learning the weights of a sigmoid neuron

Before we see such an algorithm we will revisit the concept of error

w_1
x_0 = 1
y
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma

Earlier we mentioned that a single perceptron cannot deal with these data because it is not linearly separable

What does “cannot deal with” mean?

What would happen if we use a perceptron model to classify these data ?

We would probably end up with  lines like these ...

 A typical Supervised Machine

 Learning Setup

This brings us to a typical machine learning setup which has the following components...

Data: \(\{x_i,y_i\}_{i=1}^{n}\)

Model: Our approximation of the relation between \(x\) and \(y\).

For example, \( \hat{y} = \frac{1}{1+e^{-\mathbf{w^T}x}}\) or \(\hat{y} = \mathbf{w^T}x\) or \(\hat{y} = x^T \mathbf{W} x\)

Learning algorithm: An algorithm for learning the parameters \(\mathbf{w} \) of the model (for example, perceptron learning algorithm, gradient descent, etc.)

Objective/Loss/Error function: To guide the learning algorithm

the learning algorithm should aim to minimize the loss function 

As an illustration, consider our movie example

Data: {\({x_i}=\) movie,\(y_i=\) (like/dislike) \(\}_{i=1}^{n}\)

Model: Our approximation of the relation between x and y (the

probability of liking a movie)

\hat{y} = \frac{1}{1+e^{-\mathbf{w^Tx}}}

Parameters: \( \mathbf{w} \)

Learning algorithm:  gradient descent (We will see soon)

Objective/Loss/Error function: One Possibility is

The learning algorithm should aim to find a \( \mathbf{w} \) which minimizes the above function (squared error between \( y \ \text{and} \ \hat{y} \))

\mathcal{L(\mathbf{w})}= \sum \limits_{i=1}^n (\hat{y}_i-y_i)^2

Learning Parameters :

(Infeasible) Guess Work

w_1
x_0 = 1
y_1
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma
f(x) = \frac{1}{1+exp\big(-(\sum \limits_{i=1}^{n}w_ix_i+w_0)\big)}

Keeping this supervised ML setup in mind, we will now focus on this model and discuss an algorithm for learning the parameters of this model from some given data using an appropriate objective function

\(\sigma\) stands for the sigmoid function (logistic function in this case)

1
\hat{y}=f(x)
x
\sigma
f(x) = \frac{1}{1+e^{-(wx+b)}}

Keeping this supervised ML setup in mind, we will now focus on this model and discuss an algorithm for learning the parameters of this model from some given data using an appropriate objective function

\(\sigma\) stands for the sigmoid function (logistic function in this case)

For ease of explanation, we will consider a very simplified version of the model having just 1 input

Further to be consistent with the literature, from now on, we will refer to \(w_0\) as b (bias)

Lastly, instead of considering the problem of predicting like/dislike, we will assume that we want to predict criticsRating(\(y\)) given imdbRating(\(x\)) (for no particular reason)

w
b
1
x
\sigma
f(x) = \frac{1}{1+e^{-(wx+b)}}
\hat{y}=f(x)

Input for Training

w
b

\(\{x_i,y_i\}_{i=1}^N \rightarrow N \ \text{pairs of} \ (x,y) \)

Training Objective

Find \( w\) and \(b \) such that

\underset{(w,b) \in R}{\text{minimize}} \ \mathcal{L(\mathbf{w})}= \frac{1}{N}\sum \limits_{i=1}^N (y_i-f(x_i))^2

What does it mean to train the network?

Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)

x
y
(x_1,y_1)
(x_2,y_2)
f(x) = \frac{1}{1+e^{-(wx+b)}}

At the end of training we expect to find \(w^*\), \(b^*\) such that:

\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)

In other words

We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid

What does it mean to train the network?

Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)

x
y
(x_1,y_1)
(x_2,y_2)
f(x) = \frac{1}{1+e^{-(wx+b)}}

At the end of training we expect to find \(w^*\), \(b^*\) such that:

\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)

In other words

We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid

Let us see this in more detail ...

Can we find \( w^*,b^* \) manually?

Random Guess: \( w=3,b=-1\)

Clearly not good, but how bad is it ?

Let us revisit \(\mathcal{L} (w, b) \) to see how bad it is ...

\mathcal{L(\mathbf{w})}= \frac{1}{N} \sum \limits_{i=1}^N (y_i-f(x_i))^2
= \frac{1}{2} * \big( (y_1-f(x_1))^2 + (y_2-f(x_2))^2 \big)
= \frac{1}{2} * \big( (0.9-f(2.5))^2 + (0.2-f(0.5))^2 \big)
= \frac{1}{2} * \big( (0.9-0.99)^2 + (0.2-f(0.62))^2 \big)
= \frac{1}{2} * \big( (0.90-0.99)^2 + (0.20-f(0.62))^2 \big)
=0.09

We want  \(\mathcal{L} (w, b) \) to be as close to zero as possible

Let us try some other values for \((w,b)\)

Let us Look for something better

than our "Guess work Algorithm"..

Since we have only 2 points and 2 parameters \((w, b)\) we can easily plot \(\mathcal{L}(w, b)\) for different values of \((w, b)\) and pick the one where \(\mathcal{L}(w, b)\) is minimum

Random Search on Error Surface

But of course this becomes intractable once you have many more data points and many more parameters !!

Further, even here we have plotted the error surface only for a small range of \((w, b)\) [from (−6, +6) and not from \((-\infty, +\infty)\)]

Let us look at the geometric interpretation of our “guess work” algorithm in terms of this error surface

Play with the Search Space

 Learning Parameters

Gradient Descent

Goal

Find a better way of traversing the error surface so that we can reach the minimum value quickly without resorting to brute force search!

Vector of parameters, say,
randomly initialized

\( \theta = [w,b]\)

\( \Delta \theta = [\Delta w, \Delta b]\)

change in the value of

\(w,b\)

We moved in the

direction of \(\Delta \theta\)

Let us be a bit conservative:

move only by a small amount \(\eta \)

\(\theta_{new}=\theta+\eta \cdot \Delta \theta \)

Question: What is the right \(\Delta \theta\) to use ?

The answer comes from Taylor series

\theta
\Delta \theta
\color{red} \eta \cdot \Delta \theta
\theta_{new}

What is Taylor Series?

Try with different \( f(x) \) such as \(exp(x),sin(x),x^3 \cdots  \) and see how closely taylor series approximate the function at the point \(x = X\) and around its neighbour \(\epsilon \)

Approximating 2D function

For ease of notation, let \(\Delta \theta  =  u\) , then from Taylor series, we have,

\mathcal{L}(\theta+ \eta u) = \mathcal{L}(\theta)+\eta*u^T \nabla_{\theta}\mathcal{L}(\theta)+\frac{\eta^2}{2!}*u^T\nabla_{\theta}^2\mathcal{L}(\theta)u+ \frac{\eta^3}{3!} \cdots +\frac{\eta^4}{4!} \cdots
= \mathcal{L}(\theta)+\eta*u^T \nabla_{\theta}\mathcal{L}(\theta)

[ \( \eta\) is typically small, so \( \eta^2,\eta^3  \cdots\rightarrow 0\)]

Note that the move \(\eta u\) would be favorable only if

\mathcal{L}(\theta+\eta u)-\mathcal{L}(\theta) < 0

[i.e., if the new loss is less than the previous loss]

This implies,

u^T \nabla_{\theta}\mathcal{L}(\theta) < 0

But, what is the range of \( u^T \nabla_{\theta}\mathcal{L}(\theta) < 0 \)?

Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\)

Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\), then we know that

\(-1 \leq  \cos(\beta) = \frac{ u^T  \nabla_{\theta} \mathcal{L}(\theta)}{ ||u^T|| *  ||\nabla_{\theta} \mathcal{L}(\theta)||} \leq 1 \)

Multiply throught \(k = ||u^T|| *  ||\nabla_{\theta} \mathcal{L}(\theta)|| \)

\(-k \leq  k*\cos(\beta) = u^T  \nabla_{\theta} \mathcal{L}(\theta)  \leq k \)

Thus \( \mathcal{L}(\theta+\eta u) - \mathcal{L}(\theta) = u^T \nabla_{\theta} \mathcal{L}(\theta) = k*\cos(\beta) \) is more negative when \(\cos(\beta)=-1\) (i.e., \(\beta\)=\(180^o\))

\mathcal{L}(\theta) = k, \text{where} (\theta \in R^2)
\nabla_{\theta}\mathcal{L}(\theta)
\Delta (\theta)
\nabla_{\theta}\mathcal{L}(\theta)
\Delta (\theta)

Adjust the slider \(p\) to see the Linear approximation of \(f(x)\)at the point \(p\)

Notice the gradient (\(dp\)) value.  

Change the value of  \(p\) according to the gradient value. That is, take a step \(p\pm dp\) (Enter the new value for \(p\) in the input box only)

After few adjustments, did you reach the local minimum?

If no, Repeat the game and make necessary 

changes to land in local minima.

Find the Local Minima (Blind Folded)

Gradient Descent Rule

  • The direction \(u\) that we intend to move in should be at 180° w.r.t. the gradient

  • In other words, move in a direction opposite to the gradient

Parameter Updation Rule

\( w_{t+1} = w_t - \eta \nabla w_t\)

\(b_{t+1} = b_t - \eta \nabla b_t \)

where, \( \nabla w_t = \frac{\partial \mathcal{L}(w,b)}{\partial w}\), at \(w=w_t,b=b_t \), 

and, \( \nabla b_t = \frac{\partial \mathcal{L}(w,b)}{\partial b}\), at \(w=w_t,b=b_t \), 

So we now have a more principled way of moving in the (\(w,b\)) plane than our “guess work” algorithm

Let us create an algorithm for this rule ...

Algorithm:gradient_descent()

\(t \leftarrow 0;\)

max_iterarions \(\leftarrow 1000\);

(initialization of w, b?)

while \(t <\) max_iterations  do

        \(w_{t+1} \leftarrow w_t-\eta \nabla w_t\);

        \(b_{t+1} \leftarrow b_t-\eta \nabla b_t\);

         \( t \leftarrow t+1;\)

end

To see this algorithm in practice let us first derive \( \nabla w \) and \( \nabla b\)  for our toy neural network

f(x) = \frac{1}{1+e^{-(wx+b)}}
1
x
\sigma
\hat{y}=f(x)
w
b

Let us assume there is only one point to fit

\mathcal{L}(w,b)=\frac{1}{2}*(f(x)-y)^2
\nabla w = \frac{\partial \mathcal{L}(w,b)}{\partial w} = \frac{\partial}{\partial w}[\frac{1}{2}*(f(x)-y)^2]
\nabla w = \frac{\partial}{\partial w}[\frac{1}{2}*(f(x)-y)^2]
= \frac{1}{2} [2*(f(x)-y) * \frac{\partial}{\partial w} (f(x)-y)]
=(f(x)-y) * \frac{\partial}{\partial w} (f(x))
=(f(x)-y) * \frac{\partial}{\partial w} (\frac{1}{1+e^{-(wx+b)}})
\frac{\partial}{\partial w} (\frac{1}{1+e^{-(wx+b)}})
= \frac{-1}{(1+e^{(wx+b)})^2} \frac{\partial}{\partial w} (e^{-(wx+b)})
= \frac{-1}{(1+e^{-(wx+b)})^2}* (e^{-(wx+b)})* \frac{\partial}{\partial w} (-(wx+b))
= \frac{-1}{(1+e^{-(wx+b)})}* \frac{e^{-(wx+b)}}{(1+e^{-(wx+b)})}* -x
= \frac{1}{(1+e^{-(wx+b)})}* \frac{e^{-(wx+b)}}{(1+e^{-(wx+b)})}* x
= f(x)*(1-f(x))* x
=(f(x)-y) * f(x)*(1-f(x))*x
f(x) = \frac{1}{1+e^{-(wx+b)}}
1
x
\sigma
\hat{y}=f(x)
w
b
(x_1,y_1)
(x_2,y_2)
\nabla w=(f(x)-y) * f(x)*(1-f(x))*x

So If we have only one point (\(x,y\)), we have,

For two points,

\nabla w=\sum \limits_{i=1}^{2}(f(x_i)-y_i) * f(x_i)*(1-f(x_i))*x_i
\nabla b=\sum \limits_{i=1}^{2}(f(x_i)-y_i) * f(x_i)*(1-f(x_i))
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]

def f(x,w,b): # Sigmoid with input x, parameters w,b
  return 1/(1+np.exp(-(w*x+b)))

def error(w,b):
  err = 0.0
  for x,y in zip(X,Y):
    fx = f(x,w,b)
    err += (fx-y)**2
  return 0.5*err

def grad_b(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)

def grad_w(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)*x

def do_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,1000
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)
    
    w = w - eta*dw
    b = b - eta*db

Let us do_gradient_descent()

Later on in the course we will look at gradient descent in much more detail and discuss its variants

For the time being it suffices to know that we have an algorithm for learning the parameters of a sigmoid neuron

So where do we head from here ?

Representation Power of a Multilayer Network of Sigmoid Neurons

Representation power of a multilayer network of perceptrons

Representation power of a multilayer network of sigmoid neurons

A multilayer network of perceptrons with a single hidden layer can be used to represent any boolean function precisely (no errors)

A multilayer network of neurons with a single hidden layer can be used to approximate any continuous function to any desired precision

In other words, there is a guarantee that for any function \(f(x) : R^n \rightarrow R^m\), we can always find a neural network (with 1 hidden layer containing enough neurons) whose output \(g(x)\)satisfies \(|g(x) − f(x)| < \epsilon\) !!

Proof: We will see an illustrative proof of this... [Cybenko, 1989], [Hornik, 1991]

See this link* for an excellent illustration of this proof

 

The discussion in the next few slides is based on the ideas presented at the above link

We are interested in knowing whether a network of neurons can be used to represent an arbitrary function (like the one shown in the figure)

We observe that such an arbitrary function can be approximated by several “tower” functions

More the number of such “tower” functions, better the approximation

+
+

To be more precise, we can approximate any arbitrary function by a sum of such “tower” functions

Sifting Property of Tower function (impulse or delta function)

Sifting Property: Any function can be approximated to an arbitrary accuracy by summing shifted and scaled versions of tower functions

Suppose there is a black box which takes the original input \((x)\) and constructs these tower functions

+
x

..........

We can then have a simple network which can just add them up to approximate the function

Our job now is to figure out what is inside this blackbox

Tower Maker

Tower Maker

Tower Maker

Tower Maker

..........

We will figure this out over the next few slides ...

If we take the logistic function and set \(w\) to a very high value we will recover the step function

Can we come up with a neural network to represent this operation of subtracting one sigmoid function from another ?

+1
-1
w_1,b_1
w_2,b_2
x
h_{11}
h_{12}
h_{21}

What if we have more than one input?

Suppose we are trying to take a decision about whether we will find oil at a particular location on the ocean bed (Yes/No) 

Further, suppose we base our decision on two factors: Salinity \((x_1)\) and Pressure \((x_2)\)

We are given some data and it seems that \(y(oil|no-oil)\) is a complex function of \(x_1\) and \(x_2\)

We want a neural network to approximate this function

Creating 2D Tower Function

Lecture 5

By Arun Prakash

Lecture 5

Sigmoid Neuron to Feedforward Neural Networks

  • 349