Lecture 3: Sigmoid Neurons,Gradient Descent,
Feedforward Neural Networks, Representation Power of Feedforward Neural Networks.

IIT Madras

AI4Bharat

CS6910: Fundamentals of Deep Learning

Mitesh M. Khapra

Learning Objectives

At the end of this lecture, student will have a good understanding of the following topics:

 

  • Sigmoid Neurons

  • Gradient Descent

  • Feedforward Neural Networks

  • Representation Power of Feedforward Neural Networks.

Acknowledgment

For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize backpropagation” (available on youtube)

For Module 3.5, I have borrowed ideas from this excellent book* which is available online

I am sure I would have been influenced and borrowed ideas from other sources and I apologize if I have failed to acknowledge them

Module 3.1: Sigmoid Neuron

Mitesh M. Khapra

IIT Madras

AI4Bharat

 The story ahead

Enough about boolean functions!

Can we have a network which can (approximately) represent such functions ?

Before answering the above question we will have to first graduate from perceptrons to sigmoidal neurons ...

What about arbitrary functions of the form \(y=f(x)\) where \(x \in R^n \) (instead of \(\{0,1\}^n\) and \(y \in R\) (instead of \(\{0,1\}\))















 

Recall

A perceptron will fire if the weighted sum of its inputs is greater than the threshold \((-w_0)\)

The thresholding logic used by a perceptron is very harsh !

For example, let us return to our problem of deciding whether we will like or dislike a movie

Consider that we base our decision only on one input \( x_1 = criticsRating \in (0,1)\)

criticsRating

If the threshold is 0.5 \((-w_0=0.5)\) and \((w_1=1)\)

then what would be the decision for a movie with criticsRating = 0.51 ? (like)

What about a movie with criticsRating = 0.49 ? (dislike)

It seems harsh that we would like a movie with rating 0.51 but not one with a rating of 0.49

y
bias =-w_0 = 0.5
w_1
x_1

This behavior is not a characteristic of the specific problem we chose or the specific weight and threshold that we chose

-w_0
y

It is a characteristic of the perceptron function itself which behaves like a step function

There will always be this sudden change in the decision (from 0 to 1) when \(\sum \limits_{i=1}^{n}w_ix_i \) crosses the threshold \(-w_0\)  

z = \sum \limits_{i=1}^{n}w_ix_i

For most real world applications we would expect a smoother decision function which gradually changes from 0 to 1

z
y

Introducing sigmoid neurons where the output function is much smoother than the step function

Here is one form of the sigmoid function called the logistic function

y = \frac{1}{1+exp\big(-(w_0+\sum \limits_{i=1}^{n}w_ix_i)\big)}

We no longer see a sharp transition around the threshold

z = \sum \limits_{i=1}^{n}w_ix_i
-w_0

Also the output \(y\) is no longer binary but a real value between 0 and 1 which can be interpreted as a probability.

-w_0

Instead of a like/dislike decision we get the probability of liking the movie.

Perceptron

Sigmoid (Logistic) Neuron

y = 1 \ \ \text{if} \ \sum \limits_{i=0}^{n}w_ix_i \geq 0
= 0 \ \ \text{if} \ \sum \limits_{i=0}^{n}w_ix_i < 0
y = \frac{1}{1+exp\big(-(w_0+\sum \limits_{i=1}^{n}w_ix_i)\big)}
w_1
x_0 = 1
y
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\cdots
w_1
x_0 = 1
y
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma
z = \sum \limits_{i=1}^{n}w_ix_i
-w_0
z = \sum \limits_{i=1}^{n}w_ix_i
-w_0

Not smooth,  

not differentiable at (\(-\omega_0\))

Smooth, continuous, differentiable

 Module 3.2: A typical Supervised Machine  Learning Setup

Mitesh M. Khapra

IIT Madras

AI4Bharat

Sigmoid (Logistic) Neuron

What next ?

Well, just as we had an algorithm for learning the weights of a perceptron, we also need a way of learning the weights of a sigmoid neuron

Before we see such an algorithm we will revisit the concept of error

w_1
x_0 = 1
y
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma

Earlier we mentioned that a single perceptron cannot deal with these data because it is not linearly separable

What does “cannot deal with” mean?

What would happen if we use a perceptron model to classify these data ?

We would probably end up with  lines like these ...

From now on, we will accept that it is hard to drive the error to 0 in most cases and will instead aim to reach the minimum possible error

This brings us to a typical supervised machine learning setup which has the following components...

Data: \(\{x_i,y_i\}_{i=1}^{n}\)

Model: Our approximation of the relation between \(\mathbf{x}\) and \(y\).

For example, \( \hat{y} = \frac{1}{1+e^{-\mathbf{w^Tx}}}\) or \(\hat{y} = \mathbf{w^Tx}\) or \(\hat{y} =  \mathbf{x^TWx} \)

Parameters: In all the above cases, \(\mathbf{w}\) is a parameter which needs to be learned from the data

Objective/Loss/Error function: To guide the learning algorithm

the learning algorithm should aim to minimize the loss function 

Learning algorithm: An algorithm for learning the parameters \(\mathbf{w} \) of the model (for example, perceptron learning algorithm, gradient descent, etc.)

As an illustration, consider our movie example

Data: {\({x_i}=\) movie,\(y_i=\) (like/dislike) \(\}_{i=1}^{n}\)

Model: Our approximation of the relation between x and y (the probability of liking a movie)

\hat{y} = \frac{1}{1+e^{-\mathbf{w^Tx}}}

Parameters: \( \mathbf{w} \)

Learning algorithm:  gradient descent (We will see soon)

Objective/Loss/Error function: One Possibility is

The learning algorithm should aim to find a \( \mathbf{w} \) which minimizes the above function (squared error between \( y \ \text{and} \ \hat{y} \))

\mathscr{L(\mathbf{w})}= \sum \limits_{i=1}^n (\hat{y}_i-y_i)^2

Module 3.3: Learning Parameters :

(Infeasible) Guess Work

Mitesh M. Khapra

IIT Madras

AI4Bharat

w_1
x_0 = 1
f(x)
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma
f(x) = \frac{1}{1+exp\big(-(\sum \limits_{i=1}^{n}w_ix_i+w_0)\big)}

Keeping this supervised ML setup in mind, we will now focus on this model and discuss an algorithm for learning the parameters of this model from some given data using an appropriate objective function

\(\sigma\) stands for the sigmoid function (logistic function in this case)

1
\hat{y}=f(x)
x
\sigma
f(x) = \frac{1}{1+e^{-(wx+b)}}

Keeping this supervised ML setup in mind, we will now focus on this model and discuss an algorithm for learning the parameters of this model from some given data using an appropriate objective function

\(\sigma\) stands for the sigmoid function (logistic function in this case)

For ease of explanation, we will consider a very simplified version of the model having just 1 input

Further to be consistent with the literature, from now on, we will refer to \(w_0\) as \(b\) (bias)

Lastly, instead of considering the problem of predicting like/dislike, we will assume that we want to predict criticsRating(\(y\)) given imdbRating(\(x\)) (for no particular reason)

w
b

Input for Training

\(\{x_i,y_i\}_{i=1}^N \rightarrow N \ \text{pairs of} \ (x,y) \)

Training Objective

Find \( w\) and \(b \) such that

\( \underset{(w,b) \in R} {minimize}\) \(   \mathscr{L(w,b)}= \frac{1}{N}\sum \limits_{i=1}^N (y_i-f(x_i))^2 \)

1
\hat{y}=f(x)
x
\sigma
f(x) = \frac{1}{1+e^{-(wx+b)}}
w
b

What does it mean to train the network?

Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)

x
y
(x_1,y_1)
(x_2,y_2)
f(x) = \frac{1}{1+e^{-(wx+b)}}

At the end of training we expect to find \(w^*\), \(b^*\) such that:

\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)

In other words

We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid

What does it mean to train the network?

x
y
(x_1,y_1)
(x_2,y_2)
f(x) = \frac{1}{1+e^{-(wx+b)}}

In other words

We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid

Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)

At the end of training we expect to find \(w^*\), \(b^*\) such that:

\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)

Let us see this in more detail ...

Can we find \( w^*,b^* \) manually?

Let us start with a Random Guess: \( w=3,b=-1\)

Clearly, it is not good, but how bad is it ?

Let us revisit \(\mathscr{L} (w, b) \) to see how bad it is ...

\mathscr{L(w,b)}= \frac{1}{N} \sum \limits_{i=1}^N (y_i-f(x_i))^2
= \frac{1}{2} * \big( (y_1-f(x_1))^2 + (y_2-f(x_2))^2 \big)
= \frac{1}{2} * \big( (0.9-f(2.5))^2 + (0.2-f(0.5))^2 \big)
= \frac{1}{2} * \big( (0.9-0.99)^2 + (0.2-0.62)^2 \big)
=0.099

We want  \(\mathscr{L} (w, b) \) to be as close to zero as possible

Let us try some other values for \((w,b)\) by changing the sliders

w

b

Loss

0.50 0.00 0.0730
-0.10 0.00 0.14
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

w

b

Loss

0.50 0.00 0.0730
-0.10 0.00 0.14
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

A few Snapshots

Let us Look for something better than our "Guess work Algorithm"..

Since we have only 2 points and 2 parameters \((w, b)\) we can easily plot \(\mathscr{L}(w, b)\) for different values of \((w, b)\) and pick the one where \(\mathscr{L}(w, b)\) is minimum

Random Search on Error Surface

But of course this becomes intractable once you have many more data points and many more parameters !!

Further, even here we have plotted the error surface only for a small range of \((w, b)\) [from (−6, +6) and not from \((-\infty, +\infty)\)]

Let us look at the geometric interpretation of our “guess work” algorithm in terms of this error surface

Play with the Search Space

w

b

Loss

0.50 0.00 0.0730
-0.10 0.00 0.14
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

Module 3.4:  Learning Parameters : Gradient Descent

Mitesh M. Khapra

IIT Madras

AI4Bharat

Goal

Find a better way of traversing the error surface so that we can reach the minimum value quickly without resorting to brute force search!

Vector of parameters, say,
randomly initialized

\( \theta = [w,b]\)

\( \Delta \theta = [\Delta w, \Delta b]\)

change in the value of

\(w,b\)

We moved in the

direction of \(\Delta \theta\)

Let us be a bit conservative:

move only by a small amount \(\eta \)

\(\theta_{new}=\theta+\eta \cdot \Delta \theta \)

Question: What is the right \(\Delta \theta\) to use ?

The answer comes from Taylor series

\theta
\Delta \theta
\color{red} \eta \cdot \Delta \theta
\theta_{new}

What is Taylor Series?

Taylor series is a way of approximating any continuously differentiable function \(\mathscr{L(w)}\) using polynomials of degree \(n\). The higher the degree the better the approximation!.

 \(\mathscr{L}(w)=\mathscr{L}(w_0)+\frac{\mathscr{L'}(w_0)}{1!}(w-w_0)+\frac{\mathscr{L''}(w_0)}{2!}(w-w_0)^2+\frac{\mathscr{L'''}(w_0)}{3!}(w-w_0)^3+ \cdots\)

w
\mathscr{L}(w)
w_0

Linear Approximation (\( n=1 \))

\( \mathscr{L}(w)=\mathscr{L}(w_0)+ \frac{\mathscr{L'}(w_0)}{1!}(w-w_0) \)

Quadratic Approximation (\(n=2\))

 \(\mathscr{L}(w)=\mathscr{L}(w_0)+\frac{\mathscr{L'}(w_0)}{1!}(w-w_0)+ \frac{\mathscr{L''}(w_0)}{2!}(w-w_0)^2 \)

\mathscr{L}(w_0)

Taylor Series Approximation

Try with different \( L(w) \) such as \(exp(w),sin(w),w^3 \cdots  \) and see how closely Taylor series approximate the function at the point \(w = w_0\) and around its neighbour \(\epsilon \)

Approximating 2D function

For ease of notation, let \(\Delta \theta  =  u\) , then from Taylor series, we have,

\mathcal{L}(\theta+ \eta u) = \mathcal{L}(\theta)+\eta*u^T \nabla_{\theta}\mathcal{L}(\theta)+\frac{\eta^2}{2!}*u^T\nabla_{\theta}^2\mathcal{L}(\theta)u+ \frac{\eta^3}{3!} \cdots +\frac{\eta^4}{4!} \cdots
= \mathcal{L}(\theta)+\eta*u^T \nabla_{\theta}\mathcal{L}(\theta)

[ \( \eta\) is typically small, so \( \eta^2,\eta^3  \cdots\rightarrow 0\)]

Note that the move \(\eta u\) would be favorable only if

\mathcal{L}(\theta+\eta u)-\mathcal{L}(\theta) < 0

[i.e., if the new loss is less than the previous loss]

This implies,

u^T \nabla_{\theta}\mathcal{L}(\theta) < 0

But, what is the range of \( u^T \nabla_{\theta}\mathcal{L}(\theta) \)?

Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\), then we know that

\(-1 \leq  \cos(\beta) = \frac{ u^T  \nabla_{\theta} \mathcal{L}(\theta)}{ ||u^T|| *  ||\nabla_{\theta} \mathcal{L}(\theta)||} \leq 1 \)

Multiply throught \(k = ||u^T|| *  ||\nabla_{\theta} \mathcal{L}(\theta)|| \)

\(-k \leq  k*\cos(\beta) = u^T  \nabla_{\theta} \mathcal{L}(\theta)  \leq k \)

Thus \( \mathcal{L}(\theta+\eta u) - \mathcal{L}(\theta) = u^T \nabla_{\theta} \mathcal{L}(\theta) = k*\cos(\beta) \) is most negative when \(\cos(\beta)=-1\) (i.e., \(\beta\)=\(180^o\))

Set \(\eta=1\) initially. The function \(f(w)\) is evaluated for initial value of \(w=1.2\) and notice the gradient (\(dw\)) value.  

Change the value of  \(w\) according to the gradient value. That is, take a step \(w\pm \eta dw\) (Just enter \(1.2-0.31\) in the input box for the first iteration and follow the same for the rest)

After a few iterations, did you land at the local minimum?

If no, Repeat the game and make necessary changes to land at local minimum.

Land at Local Minima with Gradient as Guidance

Gradient Descent Rule

The direction \(u\) that we intend to move in should be at 180° w.r.t. the gradient

In other words, move in a direction opposite to the gradient

Parameter Updation Rule

\( w_{t+1} = w_t - \eta \nabla w_t\)

\(b_{t+1} = b_t - \eta \nabla b_t \)

where, \( \nabla w_t = \frac{\partial \mathcal{L}(w,b)}{\partial w}\), at \(w=w_t,b=b_t \), 

and, \( \nabla b_t = \frac{\partial \mathcal{L}(w,b)}{\partial b}\), at \(w=w_t,b=b_t \), 

So we now have a more principled way of moving in the (\(w,b\)) plane than our “guess work” algorithm

Let us create an algorithm for this rule ...

To see this algorithm in practice let us first derive \( \nabla w \) and \( \nabla b\)  for our toy neural network

max_iterations = 1000
w = random()
b = random()
while max_iterations:
  w = w - eta*dw
  b = b - eta*db
  max_iterations -= 1

Algorithm:gradient_descent()

\(t \leftarrow 0;\)

max_iterarions \(\leftarrow 1000\);

w,b \(\leftarrow\) initialize randomly

while \(t <\) max_iterations  do

        \(w_{t+1} \leftarrow w_t-\eta \nabla w_t\);

        \(b_{t+1} \leftarrow b_t-\eta \nabla b_t\);

         \( t \leftarrow t+1;\)

end

f(x) = \frac{1}{1+e^{-(wx+b)}}
1
x
\sigma
\hat{y}=f(x)
w
b

Let us assume there is only one point to fit

\mathscr{L}(w,b)=\frac{1}{2}*(f(x)-y)^2
\nabla w = \frac{\partial \mathscr{L}(w,b)}{\partial w} = \frac{\partial}{\partial w}[\frac{1}{2}*(f(x)-y)^2]
\nabla w = \frac{\partial}{\partial w}[\frac{1}{2}*(f(x)-y)^2]
= \frac{1}{2} [2*(f(x)-y) * \frac{\partial}{\partial w} (f(x)-y)]
=(f(x)-y) * \frac{\partial}{\partial w} (f(x))
=(f(x)-y) * \frac{\partial}{\partial w} (\frac{1}{1+e^{-(wx+b)}})
\frac{\partial}{\partial w} (\frac{1}{1+e^{-(wx+b)}})
= \frac{-1}{(1+e^{-(wx+b)})^2} \frac{\partial}{\partial w} (e^{-(wx+b)})
= \frac{-1}{(1+e^{-(wx+b)})^2}* (e^{-(wx+b)})* \frac{\partial}{\partial w} (-(wx+b))
= \frac{-1}{(1+e^{-(wx+b)})}* \frac{e^{-(wx+b)}}{(1+e^{-(wx+b)})}* -x
= \frac{1}{(1+e^{-(wx+b)})}* \frac{e^{-(wx+b)}}{(1+e^{-(wx+b)})}* x
= f(x)*(1-f(x))* x
=(f(x)-y) * f(x)*(1-f(x))*x
f(x) = \frac{1}{1+e^{-(wx+b)}}
1
x
\sigma
\hat{y}=f(x)
w
b
(x_1,y_1)
(x_2,y_2)
\nabla w=(f(x)-y) * f(x)*(1-f(x))*x

So If we have only one point (\(x,y\)), we have,

For two points,

\nabla w=\sum \limits_{i=1}^{2}(f(x_i)-y_i) * f(x_i)*(1-f(x_i))*x_i
\nabla b=\sum \limits_{i=1}^{2}(f(x_i)-y_i) * f(x_i)*(1-f(x_i))
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]

def f(x,w,b): 
  return 1/(1+np.exp(-(w*x+b)))

def error(w,b):
  err = 0.0
  for x,y in zip(X,Y):
    fx = f(x,w,b)
    err += (fx-y)**2
  return 0.5*err

def grad_b(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)

def grad_w(x,w,b,y):
  fx = f(x,w,b)
  return (fx-y)*fx*(1-fx)*x

def do_gradient_descent():
  
  w,b,eta,max_epochs = -2,-2,1.0,1000
  
  for i in range(max_epochs):
    dw,db = 0,0
    for x,y in zip(X,Y):
      dw += grad_w(x,w,b,y)
      db += grad_b(x,w,b,y)
    
    w = w - eta*dw
    b = b - eta*db

Let us do_gradient_descent()

Later on in the course we will look at gradient descent in much more detail and discuss its variants

For the time being it suffices to know that we have an algorithm for learning the parameters of a sigmoid neuron

So where do we head from here ?

Module 3.5: Representation Power of a Multilayer Network of Sigmoid Neurons

Mitesh M. Khapra

IIT Madras

AI4Bharat

Representation power of a multilayer network of perceptrons

Representation power of a multilayer network of sigmoid neurons

A multilayer network of perceptrons with a single hidden layer can be used to represent any boolean function precisely (no errors)

A multilayer network of neurons with a single hidden layer can be used to approximate any continuous function to any desired precision

In other words, there is a guarantee that for any function \(f(x) : R^n \rightarrow R^m\), we can always find a neural network (with 1 hidden layer containing enough neurons) whose output \(g(x)\)satisfies \(|g(x) − f(x)| < \epsilon\) !!

Proof: We will see an illustrative proof of this... [Cybenko, 1989], [Hornik, 1991]

See this link* for an excellent illustration of this proof

 

The discussion in the next few slides is based on the ideas presented at the above link

We are interested in knowing whether a network of neurons can be used to represent an arbitrary function (like the one shown in the figure)

We observe that such an arbitrary function can be approximated by several “tower” functions

More the number of such “tower” functions, better the approximation

+
+

To be more precise, we can approximate any arbitrary function by a sum of such “tower” functions

Sifting Property of Tower function (impulse or delta function)

Any function can be represented  by summing shifted and scaled versions of tower functions 

Suppose there is a black box which takes the original input \((x)\) and constructs these tower functions

We can then have a simple network which can just add them up to approximate the function

Our job now is to figure out what is inside this blackbox

+

Tower Maker

Tower Maker

Tower Maker

Tower Maker

..........

..........

x

..........

We will figure this out over the next few slides ...

If we take the logistic function and set \(w\) to a very high value we will recover the step function

h_{21}=h_{11}-h_{12}

Can we come up with a neural network to represent this operation of subtracting one sigmoid function from another ?

+1
-1
w_1,b_1
w_2,b_2
h_{11}
h_{12}
h_{21}
x

What could be the suitable activation function for the neuron \(h_{21}\)?

What if we have more than one input?

Suppose we are trying to take a decision about whether we will find oil at a particular location on the ocean bed (Yes/No) 

Further, suppose we base our decision on two factors: Salinity \((x_1)\) and Pressure \((x_2)\)

We are given some data and it seems that \(y \in \{oil,no \ oil\}\) is a complex function of \(x_1\) and \(x_2\)

We want a neural network to approximate this function

f(x_1,x_2)=\frac{1}{\exp(-(w_{1}x_1+w_{2}x_2+b))}

We need to figure out how to get a tower in this case

Let us vary the parameter \(w_1\) keeping \(w_2=0\) and \(b=0\).

Let us vary the parameter \(w_2\) keeping \(w_1=0\) and \(b=0\).

Let us vary the parameter \(b\) keeping \(w_1=k\) and \(w_2=0\).

What if we take two such step functions (with different b values) and subtract one from the other

We still don’t get a tower (or we get a tower which is open from two sides)

\(x_1\)

\(x_2\)

\(w_1\)

\(w_2\)

\(w_3\)

\(w_4\)

\(+1\)

\(-1\)

\(\sigma\)

\(\sigma\)

+

\(y=f(x_1,x_2\))

\(h_{11}\)

\(h_{12}\)

Can we come up with a neural network to represent this entire procedure of constructing a 3 dimensional tower ?

Creating 2D Tower Function

\(x_1\)

\(x_2\)

\(+1\)

\(-1\)

\(\sigma\)

\(h_{11}\)

\(h_{12}\)

\(+1\)

\(-1\)

\(h_{13}\)

\(h_{14}\)

\(\sigma\)

\(\sigma\)

\(\sigma\)

/

/

/

\(\sigma\)

w_{11}=200
w_{12}=6
b_1=-100
w_{21}=200
w_{12}=6
b_2=100
w_{31}=6
w_{32}=200
b_3=-100
w_{41}=6
w_{42}=-200
b_4=100
w_1 = 50
w_{2}=0
b=-100

\(+1\)

\(+1\)

What if we have more than one input?

Suppose we are trying to take a decision about whether we will find oil at a particular location on the ocean bed (Yes/No) 

Further, suppose we base our decision on two factors: Salinity \((x_1)\) and Pressure \((x_2)\)

We are given some data and it seems that \(y \in \{oil,no \ oil\}\) is a complex function of \(x_1\) and \(x_2\)

We want a neural network to approximate this function

For 1 dimensional input we needed 2 neurons to construct a tower

For 2 dimensional input we needed 4 neurons to construct a tower

How many neurons will you need to construct a tower in n dimensions

Think