CS6910: Fundamentals of Deep Learning
Lecture 6
Gradient Descent, Momentum Based GD,
Nesterov Accelerate GD,
Mitesh M. Khapra
Department of Computer Science and Engineering, IIT Madras
Learning Objectives
Acknowledgment
For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize backpropagation” (available on youtube)
For Module 3.5, I have borrowed ideas from this excellent *book which is available online
I am sure I would have been influenced and borrowed ideas from other sources and I apologize if I have failed to acknowledge them
Sigmoid Neuron
The story ahead
Enough about boolean functions!
Can we have a network which can (approximately) represent such functions ?
Before answering the above question we will have to first graduate from perceptrons to sigmoidal neurons ...
What about arbitrary functions of the form \(y=f(x)\) where \(x \in R^n \)(instead of \(\{0,1\}^n\) and \(y \in R\) (instead of \(\{0,1\}\))
Recall
A perceptron will fire if the weighted sum of its inputs is greater than the threshold \((-w_0)\)
The thresholding logic used by a perceptron is very harsh !
For example, let us return to our problem of deciding whether we will like or dislike a movie
Consider that we base our decision only on one input \( x_1 = criticsRating \in (0,1)\)
criticsRating
If the threshold is 0.5 \((-w_0=0.5)\) and \((w_1=1)\)
then what would be the decision for a movie with criticsRating = 0.51 ? (like)
What about a movie with criticsRating = 0.49 ? (dislike)
It seems harsh that we would like a movie with rating 0.51 but not one with a rating of 0.49
This behavior is not a characteristic of the specific problem we chose or the specific weight and threshold that we chose
It is a characteristic of the perceptron function itself which behaves like a step function
There will always be this sudden change in the decision (from 0 to 1) when \(\sum \limits_{i=1}^{n}w_ix_i \) crosses the threshold \(-w_0\)
For most real world applications we would expect a smoother decision function which gradually changes from 0 to 1
Introducing sigmoid neurons where the output function is much smoother than the step function
Here is one form of the sigmoid function called the logistic function
We no longer see a sharp transition around the threshold
Also the output \(y\) is no longer binary but a real value between 0 and 1 which can be interpreted as a probability.
Instead of a like/dislike decision we get the probability of liking the movie.
Perceptron
Sigmoid (Logistic) Neuron
Not smooth, not continuous at (\(-\omega_0\))
not differentiable
Smooth, continuous, differentiable
Sigmoid (Logistic) Neuron
What next ?
Well, just as we had an algorithm for learning the weights of a perceptron, we also need a way of learning the weights of a sigmoid neuron
Before we see such an algorithm we will revisit the concept of error
Earlier we mentioned that a single perceptron cannot deal with these data because it is not linearly separable
What does “cannot deal with” mean?
What would happen if we use a perceptron model to classify these data ?
We would probably end up with lines like these ...
A typical Supervised Machine
Learning Setup
This brings us to a typical machine learning setup which has the following components...
Data: \(\{x_i,y_i\}_{i=1}^{n}\)
Model: Our approximation of the relation between \(x\) and \(y\).
For example, \( \hat{y} = \frac{1}{1+e^{-\mathbf{w^T}x}}\) or \(\hat{y} = \mathbf{w^T}x\) or \(\hat{y} = x^T \mathbf{W} x\)
Learning algorithm: An algorithm for learning the parameters \(\mathbf{w} \) of the model (for example, perceptron learning algorithm, gradient descent, etc.)
Objective/Loss/Error function: To guide the learning algorithm
the learning algorithm should aim to minimize the loss function
As an illustration, consider our movie example
Data: {\({x_i}=\) movie,\(y_i=\) (like/dislike) \(\}_{i=1}^{n}\)
Model: Our approximation of the relation between x and y (the
probability of liking a movie)
Parameters: \( \mathbf{w} \)
Learning algorithm: gradient descent (We will see soon)
Objective/Loss/Error function: One Possibility is
The learning algorithm should aim to find a \( \mathbf{w} \) which minimizes the above function (squared error between \( y \ \text{and} \ \hat{y} \))
Learning Parameters :
(Infeasible) Guess Work
Keeping this supervised ML setup in mind, we will now focus on this model and discuss an algorithm for learning the parameters of this model from some given data using an appropriate objective function
\(\sigma\) stands for the sigmoid function (logistic function in this case)
Keeping this supervised ML setup in mind, we will now focus on this model and discuss an algorithm for learning the parameters of this model from some given data using an appropriate objective function
\(\sigma\) stands for the sigmoid function (logistic function in this case)
For ease of explanation, we will consider a very simplified version of the model having just 1 input
Further to be consistent with the literature, from now on, we will refer to \(w_0\) as b (bias)
Lastly, instead of considering the problem of predicting like/dislike, we will assume that we want to predict criticsRating(\(y\)) given imdbRating(\(x\)) (for no particular reason)
Input for Training
\(\{x_i,y_i\}_{i=1}^N \rightarrow N \ \text{pairs of} \ (x,y) \)
Training Objective
Find \( w\) and \(b \) such that
What does it mean to train the network?
Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)
At the end of training we expect to find \(w^*\), \(b^*\) such that:
\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)
In other words
We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid
What does it mean to train the network?
Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)
At the end of training we expect to find \(w^*\), \(b^*\) such that:
\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)
In other words
We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid
Let us see this in more detail ...
Can we find \( w^*,b^* \) manually?
Random Guess: \( w=3,b=-1\)
Clearly not good, but how bad is it ?
Let us revisit \(\mathcal{L} (w, b) \) to see how bad it is ...
We want \(\mathcal{L} (w, b) \) to be as close to zero as possible
Let us try some other values for \((w,b)\)
Let us Look for something better
than our "Guess work Algorithm"..
Since we have only 2 points and 2 parameters \((w, b)\) we can easily plot \(\mathcal{L}(w, b)\) for different values of \((w, b)\) and pick the one where \(\mathcal{L}(w, b)\) is minimum
Random Search on Error Surface
But of course this becomes intractable once you have many more data points and many more parameters !!
Further, even here we have plotted the error surface only for a small range of \((w, b)\) [from (−6, +6) and not from \((-\infty, +\infty)\)]
Let us look at the geometric interpretation of our “guess work” algorithm in terms of this error surface
Play with the Search Space
Learning Parameters
Gradient Descent
Goal
Find a better way of traversing the error surface so that we can reach the minimum value quickly without resorting to brute force search!
Vector of parameters, say,
randomly initialized
\( \theta = [w,b]\)
\( \Delta \theta = [\Delta w, \Delta b]\)
change in the value of
\(w,b\)
We moved in the
direction of \(\Delta \theta\)
Let us be a bit conservative:
move only by a small amount \(\eta \)
\(\theta_{new}=\theta+\eta \cdot \Delta \theta \)
Question: What is the right \(\Delta \theta\) to use ?
The answer comes from Taylor series
What is Taylor Series?
Try with different \( f(x) \) such as \(exp(x),sin(x),x^3 \cdots \) and see how closely taylor series approximate the function at the point \(x = X\) and around its neighbour \(\epsilon \)
Approximating 2D function
For ease of notation, let \(\Delta \theta = u\) , then from Taylor series, we have,
[ \( \eta\) is typically small, so \( \eta^2,\eta^3 \cdots\rightarrow 0\)]
Note that the move \(\eta u\) would be favorable only if
[i.e., if the new loss is less than the previous loss]
This implies,
But, what is the range of \( u^T \nabla_{\theta}\mathcal{L}(\theta) < 0 \)?
Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\)
Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\), then we know that
\(-1 \leq \cos(\beta) = \frac{ u^T \nabla_{\theta} \mathcal{L}(\theta)}{ ||u^T|| * ||\nabla_{\theta} \mathcal{L}(\theta)||} \leq 1 \)
Multiply throught \(k = ||u^T|| * ||\nabla_{\theta} \mathcal{L}(\theta)|| \)
\(-k \leq k*\cos(\beta) = u^T \nabla_{\theta} \mathcal{L}(\theta) \leq k \)
Thus \( \mathcal{L}(\theta+\eta u) - \mathcal{L}(\theta) = u^T \nabla_{\theta} \mathcal{L}(\theta) = k*\cos(\beta) \) is more negative when \(\cos(\beta)=-1\) (i.e., \(\beta\)=\(180^o\))
Adjust the slider \(p\) to see the Linear approximation of \(f(x)\)at the point \(p\)
Notice the gradient (\(dp\)) value.
Change the value of \(p\) according to the gradient value. That is, take a step \(p\pm dp\) (Enter the new value for \(p\) in the input box only)
After few adjustments, did you reach the local minimum?
If no, Repeat the game and make necessary
changes to land in local minima.
Find the Local Minima (Blind Folded)
Gradient Descent Rule
-
The direction \(u\) that we intend to move in should be at 180° w.r.t. the gradient
-
In other words, move in a direction opposite to the gradient
Parameter Updation Rule
\( w_{t+1} = w_t - \eta \nabla w_t\)
\(b_{t+1} = b_t - \eta \nabla b_t \)
where, \( \nabla w_t = \frac{\partial \mathcal{L}(w,b)}{\partial w}\), at \(w=w_t,b=b_t \),
and, \( \nabla b_t = \frac{\partial \mathcal{L}(w,b)}{\partial b}\), at \(w=w_t,b=b_t \),
So we now have a more principled way of moving in the (\(w,b\)) plane than our “guess work” algorithm
Let us create an algorithm for this rule ...
Algorithm:gradient_descent()
\(t \leftarrow 0;\)
max_iterarions \(\leftarrow 1000\);
(initialization of w, b?)
while \(t <\) max_iterations do
\(w_{t+1} \leftarrow w_t-\eta \nabla w_t\);
\(b_{t+1} \leftarrow b_t-\eta \nabla b_t\);
\( t \leftarrow t+1;\)
end
To see this algorithm in practice let us first derive \( \nabla w \) and \( \nabla b\) for our toy neural network
Let us assume there is only one point to fit
So If we have only one point (\(x,y\)), we have,
For two points,
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]
def f(x,w,b): # Sigmoid with input x, parameters w,b
return 1/(1+np.exp(-(w*x+b)))
def error(w,b):
err = 0.0
for x,y in zip(X,Y):
fx = f(x,w,b)
err += (fx-y)**2
return 0.5*err
def grad_b(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)
def grad_w(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)*x
def do_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
Let us do_gradient_descent()
Later on in the course we will look at gradient descent in much more detail and discuss its variants
For the time being it suffices to know that we have an algorithm for learning the parameters of a sigmoid neuron
So where do we head from here ?
Representation Power of a Multilayer Network of Sigmoid Neurons
Representation power of a multilayer network of perceptrons
Representation power of a multilayer network of sigmoid neurons
A multilayer network of perceptrons with a single hidden layer can be used to represent any boolean function precisely (no errors)
A multilayer network of neurons with a single hidden layer can be used to approximate any continuous function to any desired precision
In other words, there is a guarantee that for any function \(f(x) : R^n \rightarrow R^m\), we can always find a neural network (with 1 hidden layer containing enough neurons) whose output \(g(x)\)satisfies \(|g(x) − f(x)| < \epsilon\) !!
Proof: We will see an illustrative proof of this... [Cybenko, 1989], [Hornik, 1991]
See this link* for an excellent illustration of this proof
The discussion in the next few slides is based on the ideas presented at the above link
We are interested in knowing whether a network of neurons can be used to represent an arbitrary function (like the one shown in the figure)
We observe that such an arbitrary function can be approximated by several “tower” functions
More the number of such “tower” functions, better the approximation
To be more precise, we can approximate any arbitrary function by a sum of such “tower” functions
Sifting Property of Tower function (impulse or delta function)
Sifting Property: Any function can be approximated to an arbitrary accuracy by summing shifted and scaled versions of tower functions
Suppose there is a black box which takes the original input \((x)\) and constructs these tower functions
..........
We can then have a simple network which can just add them up to approximate the function
Our job now is to figure out what is inside this blackbox
Tower Maker
Tower Maker
Tower Maker
Tower Maker
..........
We will figure this out over the next few slides ...
If we take the logistic function and set \(w\) to a very high value we will recover the step function
Can we come up with a neural network to represent this operation of subtracting one sigmoid function from another ?
What if we have more than one input?
Suppose we are trying to take a decision about whether we will find oil at a particular location on the ocean bed (Yes/No)
Further, suppose we base our decision on two factors: Salinity \((x_1)\) and Pressure \((x_2)\)
We are given some data and it seems that \(y(oil|no-oil)\) is a complex function of \(x_1\) and \(x_2\)
We want a neural network to approximate this function
Creating 2D Tower Function
Lecture 5
By Arun Prakash
Lecture 5
Sigmoid Neuron to Feedforward Neural Networks
- 298