CS6910: Fundamentals of Deep Learning

Lecture 4: Feedforward Neural Networks, Backpropagation 

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

   References/Acknowledgments

See the excellent videos by Hugo Larochelle on Backpropagation and Andrej Karpathy Lecture (CS231n Winter 2016) on Bckpropagation, Neural Networks






 

Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

The input to the network is an \(n\)-dimensional vector

The network contains \(\text L-1\) hidden layers (2, in this case) having \(\text n\) neurons each

Finally, there is one output layer containing \(\text k\) neurons (say, corresponding to \(\text k\) classes)

\(a_2\)

\(a_3\)

Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (\(a_i\) and \(h_i\) are vectors)

\(x_1\)

\(x_2\)

\(x_n\)

pre-activation

(\(a_i\) and \(h_i\) are vectors)

The input layer can be called the \(0\)-th layer and the output layer can be called the (\(L\))-th layer

\(W_i \in \R^{n \times n}\) and \(b_i \in \R^n\) are the weight and bias between layers \(i-1\) and \(i  (0 < i < L\))

\(W_L \in \R^{k \times n}\) and \(b_L \in \R^k\) are the weight and bias between the last hidden layer and the output layer   (\(L = 3\)) in this case)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

                                                                                   and activation \(a_i\) and \(h_i\) are vectors)

The pre-activation at layer \(i\) is given by

\(a_i(x) = b_i +W_ih_{i-1}(x)\)

The activation at layer \(i\) is given by

\(h_i(x) = g(a_i(x))\)

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

\(f(x) = h_L(x)=O(a_L(x))\)

where \(O\) is the output activation function (for example, softmax, linear, etc.)

To simplify notation we will refer to \(a_i(x)\) as \(a_i\) and \(h_i(x)\) as \(h_i\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

The pre-activation at layer \(i\) is given by

\(a_i = b_i +W_ih_{i-1}\)

The activation at layer \(i\) is given by

\(h_i = g(a_i)\)

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

\(f(x) = h_L=O(a_L)\)

where \(O\) is the output activation function (for example, softmax, linear, etc.)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Data: \(\lbrace x_i,y_i \rbrace_{i=1}^N\)

\(\hat y_i = \hat{f}(x_i) = O(W_3 g(W_2 g(W_1 x_i + b_1) + b_2) + b_3)\)

Model:

\(\theta = W_1, ..., W_L, b_1, b_2, ..., b_L (L = 3)\)

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

\(min \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)

\(\text {In general,}\) \(min  \mathscr{L}(\theta)\)

Parameters:

where \(\mathscr{L}(\theta)\) is some function of the parameters

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

   The story so far...

We have introduced feedforward neural networks
We are now interested in finding an algorithm for learning the parameters of this model






 

Recall our gradient descent algorithm

We can write it more concisely as

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

\(w_{t+1} \gets w_t - \eta \nabla w_t\)

\(b_{t+1} \gets b_t - \eta \nabla b_t\)

\(t \gets 0;\)

Algorithm: gradient_descent()

\(max\_iterations \gets 1000; \)

end

while \(t\)++ \(< max\_iterations\)  do

\(Initialize   w_0,b_0;\)

Recall our gradient descent algorithm

We can write it more concisely as

where \(\nabla \theta_t = [\frac {\partial \mathscr{L}(\theta)}{\partial w_t},\frac {\partial \mathscr{L}(\theta)}{\partial b_t}]^T\)

Now, in this feedforward neural network, instead of \(\theta = [w,b]\) we have \(\theta = [W_1, ..., W_L, b_1, b_2, ..., b_L]\)

We can still use the same algorithm for learning the parameters of our model

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

\(t \gets 0;\)

Algorithm: gradient_descent()

\(max\_iterations \gets 1000; \)

\(Initialize   \theta_0 = [w_0,b_0];\)

end

while \(t\)++ \(< max\_iterations\)  do

\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)

\(t \gets 0;\)

Algorithm: gradient_descent()

\(max\_iterations \gets 1000; \)

\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)

end

while \(t\)++ \(< max\_iterations\)  do

\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)

Recall our gradient descent algorithm

We can write it more concisely as

Now, in this feedforward neural network, instead of \(\theta = [w,b]\) we have \(\theta = [W_1, ..., W_L, b_1, b_2, ..., b_L]\)

We can still use the same algorithm for learning the parameters of our model

where \(\nabla \theta_t = [\frac {\partial \mathscr{L}(\theta)}{\partial W_{1,t}},.,\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,t}}, \frac {\partial \mathscr{L}(\theta)}{\partial b_{1,t}},.,\frac {\partial \mathscr{L}(\theta)}{\partial b_{L,t}}]^T\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Except that now our \(\nabla \theta \) looks much more nasty

\(\nabla \theta \) is thus composed of

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\) 

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k}\) 

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{111}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{11n}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{121}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{12n}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{1n1}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{1nn}}\)

\( \vdots\)

\( \vdots\)

\( \vdots\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{211}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{21n}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{221}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{22n}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{2n1}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{2nn}}\)

\( \vdots\)

\( \vdots\)

\( \vdots\)

\(...\)

\(...\)

\(...\)

\( \vdots\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,11}}\)

\( ... \)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,1k}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,21}}\)

\( ... \)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,2k}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,n1}}\)

\( ... \)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,nk}}\)

\( \vdots\)

\( \vdots\)

\( \vdots\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{11}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{L1}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{12}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{L2}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{1n}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{Lk}}\)

\( \vdots\)

\( \vdots\)

\( \vdots\)

\begin{bmatrix} \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \end{bmatrix}

  We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\) 

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\) 

How to compute \(\nabla \theta\) which is composed of 







 

Module 4.3: Output Functions and Loss Functions

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

  We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\) 

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\) 

How to compute \(\nabla \theta\) which is composed of 








  We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\) 

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\) 

How to compute \(\nabla \theta\) which is composed of 







 

The choice of loss function depends on the problem at hand

We will illustrate this with the help of two examples

Consider our movie example again but this time we are interested in predicting ratings

Here \(y_j \in \R ^3\)

The loss function should capture how much \(\hat y_j\) deviates from \(y_j\)

If \(y_j\in \R ^k\) then the squared error loss can capture this deviation

\(\mathscr {L}(\theta) = \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)

Neural network with \(L - 1\) hidden layers

isActor Damon

isDirector
Nolan

imdb
Rating

Critics
Rating

RT
Rating

\(y_j =\)           {\(7.5                 8.2                      7.7\)}

\(x_i\)

\(.   .\)

.   .   .   .   .   .

A related question: What should the output function '\(O\)' be if \(y_j \in \R\)?

More specifically, can it be the logistic function?

No, because it restricts \(\hat y_j\) to a value between \(0\) & \(1\) but we want \(\hat y_j \in \R\)

So, in such cases it makes sense to have '\(O\)' as linear function

\(\hat{f}(x) = h_L = O(a_L) \)

\(= W_Oa_L + b_O \)

\(\hat y_j= \hat{f}(X_i)\) is no longer bounded between \(0\) and \(1\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Intentionally left blank

Watch this lecture on Information and Entropy here

Neural network with \(L - 1\) hidden layers

\(y =\)     [\(1                       0                          0                         0\)]

Now let us consider another problem for which a different loss function would be appropriate

Suppose we want to classify an image into 1 of \(k\) classes

Here again we could use the squared error loss to capture the deviation

But can you think of a better function?

Apple

Mango

Orange

Banana

Neural network with \(L - 1\) hidden layers

\(y =\)     [\(1                       0                          0                         0\)]

Notice that \(y\) is a probability distribution

Therefore we should also ensure that \(\hat y\) is a probability distribution

Apple

Mango

Orange

Banana

What choice of the output activation '\(O\)' will ensure this ?

\(a_L = W_Lh_{L-1} + b_L\)

\(\hat y_j = O(a_L)_j = \cfrac {e^{a_{L,j}}}{\sum_{i=1}^k e^{a_{L,i}}}\)

\(O(a_L)_j\) is the \(j^{th}\) element of \(\hat y\) and \(a_{L,j}\) is the \(j^{th}\) element of the vector \(a_L\).

This function is called the \(softmax\) function

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = f(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Notice that \(y\) is a probability distribution

Therefore we should also ensure that \(\hat y\) is a probability distribution

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = f(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Neural network with \(L - 1\) hidden layers

\(y =\)     [\(1                       0                          0                         0\)]

Now that we have ensured that both \(y\) & \(\hat y\) are probability distributions can you think of a function which captures the difference between them?

Cross-entropy

\(\mathscr {L}(\theta) = - \displaystyle \sum_{c=1}^k y_c \log \hat y_c \)

Notice that

\(y_c = 1   \text {if}   c = \ell\) (the true class label)

\(    = 0   \text {otherwise}\)

\(\because\)   \(\mathscr {L}(\theta) = - \log \hat y_\ell\)

Apple

Mango

Orange

Banana

So, for classification problem (where you have to choose \(1\) of \(K\) classes), we use the following objective function

or

\(\mathscr {L}(\theta) = - \log \hat y_\ell\)

\begin{matrix} minimize \\ \theta \end{matrix}

\(- \mathscr {L}(\theta) =  \log \hat y_\ell\)

\begin{matrix} maximize \\ \theta \end{matrix}

But wait!

Is \(\hat y_\ell\) a function of \(\theta = [W_1,W_2, . ,W_L, b_1, b_2,., b_L]?\)

Yes, it is indeed a function of \(\theta\)

\(\hat y_\ell = [O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)]_\ell\)

What does \(\hat y_\ell\) encode?

It is the probability that \(x\) belongs to the \(\ell^{th}\) class (bring it as close to \(1\)).

\(\log \hat y_\ell\) is called the \(log\text -likelihood\) of the data.

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = f(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Output Activation
Loss Function

Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often

Outputs
Real Values Probabilities

Linear

Softmax

Squared Error

Cross Entropy

Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activation is a softmax function and the loss function is cross entropy

Output Activation
Loss Function
Outputs
Real Values Probabilities

Linear

Softmax

Squared Error

Cross Entropy

Module 4.4: Backpropagation (Intuition)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

  We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{n \times k},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)

How to compute \(\nabla \theta\) which is composed of