CS6910: Fundamentals of Deep Learning

Lecture 4: Feedforward Neural Networks, Backpropagation

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

References/Acknowledgments

See the excellent videos by Hugo Larochelle on Backpropagation and Andrej Karpathy Lecture (CS231n Winter 2016) on Bckpropagation, Neural Networks

Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

The input to the network is an \(n\)-dimensional vector

The network contains \(\text L-1\) hidden layers (2, in this case) having \(\text n\) neurons each

Finally, there is one output layer containing \(\text k\) neurons (say, corresponding to \(\text k\) classes)

\(a_2\)

\(a_3\)

Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (\(a_i\) and \(h_i\) are vectors)

\(x_1\)

\(x_2\)

\(x_n\)

pre-activation

(\(a_i\) and \(h_i\) are vectors)

The input layer can be called the \(0\)-th layer and the output layer can be called the (\(L\))-th layer

\(W_i \in \R^{n \times n}\) and \(b_i \in \R^n\) are the weight and bias between layers \(i-1\) and \(i (0 < i < L\))

\(W_L \in \R^{k \times n}\) and \(b_L \in \R^k\) are the weight and bias between the last hidden layer and the output layer (\(L = 3\)) in this case)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

and activation \(a_i\) and \(h_i\) are vectors)

The pre-activation at layer \(i\) is given by

\(a_i(x) = b_i +W_ih_{i-1}(x)\)

The activation at layer \(i\) is given by

\(h_i(x) = g(a_i(x))\)

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

\(f(x) = h_L(x)=O(a_L(x))\)

where \(O\) is the output activation function (for example, softmax, linear, etc.)

To simplify notation we will refer to \(a_i(x)\) as \(a_i\) and \(h_i(x)\) as \(h_i\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

The pre-activation at layer \(i\) is given by

\(a_i = b_i +W_ih_{i-1}\)

The activation at layer \(i\) is given by

\(h_i = g(a_i)\)

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

\(f(x) = h_L=O(a_L)\)

where \(O\) is the output activation function (for example, softmax, linear, etc.)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Data: \(\lbrace x_i,y_i \rbrace_{i=1}^N\)

\(\hat y_i = \hat{f}(x_i) = O(W_3 g(W_2 g(W_1 x_i + b_1) + b_2) + b_3)\)

Model:

\(\theta = W_1, ..., W_L, b_1, b_2, ..., b_L (L = 3)\)

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

\(min \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)

\(\text {In general,}\) \(min \mathscr{L}(\theta)\)

Parameters:

where \(\mathscr{L}(\theta)\) is some function of the parameters

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

The story so far...

We have introduced feedforward neural networks
We are now interested in finding an algorithm for learning the parameters of this model

Recall our gradient descent algorithm

We can write it more concisely as

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

\(w_{t+1} \gets w_t - \eta \nabla w_t\)

\(b_{t+1} \gets b_t - \eta \nabla b_t\)

\(t \gets 0;\)

Algorithm: gradient_descent()

\(max\_iterations \gets 1000; \)

end

while \(t\)++ \(< max\_iterations\) do

\(Initialize w_0,b_0;\)

Recall our gradient descent algorithm

We can write it more concisely as

where \(\nabla \theta_t = [\frac {\partial \mathscr{L}(\theta)}{\partial w_t},\frac {\partial \mathscr{L}(\theta)}{\partial b_t}]^T\)

Now, in this feedforward neural network, instead of \(\theta = [w,b]\) we have \(\theta = [W_1, ..., W_L, b_1, b_2, ..., b_L]\)

We can still use the same algorithm for learning the parameters of our model

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

\(t \gets 0;\)

Algorithm: gradient_descent()

\(max\_iterations \gets 1000; \)

\(Initialize \theta_0 = [w_0,b_0];\)

end

while \(t\)++ \(< max\_iterations\) do

\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)

\(t \gets 0;\)

Algorithm: gradient_descent()

\(max\_iterations \gets 1000; \)

\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)

end

while \(t\)++ \(< max\_iterations\) do

\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)

Recall our gradient descent algorithm

We can write it more concisely as

Now, in this feedforward neural network, instead of \(\theta = [w,b]\) we have \(\theta = [W_1, ..., W_L, b_1, b_2, ..., b_L]\)

We can still use the same algorithm for learning the parameters of our model

where \(\nabla \theta_t = [\frac {\partial \mathscr{L}(\theta)}{\partial W_{1,t}},.,\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,t}}, \frac {\partial \mathscr{L}(\theta)}{\partial b_{1,t}},.,\frac {\partial \mathscr{L}(\theta)}{\partial b_{L,t}}]^T\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Except that now our \(\nabla \theta \) looks much more nasty

\(\nabla \theta \) is thus composed of

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{111}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{11n}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{121}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{12n}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{1n1}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{1nn}}\)

\( \vdots\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{211}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{21n}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{221}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{22n}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{2n1}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{2nn}}\)

\( \vdots\)

\(...\)

\( \vdots\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,11}}\)

\( ... \)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,1k}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,21}}\)

\( ... \)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,2k}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,n1}}\)

\( ... \)

\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,nk}}\)

\( \vdots\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{11}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{L1}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{12}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{L2}}\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{1n}}\)

\(...\)

\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{Lk}}\)

\( \vdots\)

\begin{bmatrix} \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space & \space\\ \end{bmatrix}

We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)

How to compute \(\nabla \theta\) which is composed of

Module 4.3: Output Functions and Loss Functions

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)

How to compute \(\nabla \theta\) which is composed of

We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)

How to compute \(\nabla \theta\) which is composed of

The choice of loss function depends on the problem at hand

We will illustrate this with the help of two examples

Consider our movie example again but this time we are interested in predicting ratings

Here \(y_j \in \R ^3\)

The loss function should capture how much \(\hat y_j\) deviates from \(y_j\)

If \(y_j\in \R ^k\) then the squared error loss can capture this deviation

\(\mathscr {L}(\theta) = \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)

Neural network with \(L - 1\) hidden layers

isActor Damon

isDirector
Nolan

imdb
Rating

Critics
Rating

RT
Rating

\(y_j =\) {\(7.5 8.2 7.7\)}

\(x_i\)

\(. .\)

. . . . . .

A related question: What should the output function '\(O\)' be if \(y_j \in \R\)?

More specifically, can it be the logistic function?

No, because it restricts \(\hat y_j\) to a value between \(0\) & \(1\) but we want \(\hat y_j \in \R\)

So, in such cases it makes sense to have '\(O\)' as linear function

\(\hat{f}(x) = h_L = O(a_L) \)

\(= W_Oa_L + b_O \)

\(\hat y_j= \hat{f}(X_i)\) is no longer bounded between \(0\) and \(1\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Intentionally left blank

Watch this lecture on Information and Entropy here

Neural network with \(L - 1\) hidden layers

\(y =\) [\(1 0 0 0\)]

Now let us consider another problem for which a different loss function would be appropriate

Suppose we want to classify an image into 1 of \(k\) classes

Here again we could use the squared error loss to capture the deviation

But can you think of a better function?

Apple

Mango

Orange

Banana

Neural network with \(L - 1\) hidden layers

\(y =\) [\(1 0 0 0\)]

Notice that \(y\) is a probability distribution

Therefore we should also ensure that \(\hat y\) is a probability distribution

Apple

Mango

Orange

Banana

What choice of the output activation '\(O\)' will ensure this ?

\(a_L = W_Lh_{L-1} + b_L\)

\(\hat y_j = O(a_L)_j = \cfrac {e^{a_{L,j}}}{\sum_{i=1}^k e^{a_{L,i}}}\)

\(O(a_L)_j\) is the \(j^{th}\) element of \(\hat y\) and \(a_{L,j}\) is the \(j^{th}\) element of the vector \(a_L\).

This function is called the \(softmax\) function

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = f(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Notice that \(y\) is a probability distribution

Therefore we should also ensure that \(\hat y\) is a probability distribution

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = f(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Neural network with \(L - 1\) hidden layers

\(y =\) [\(1 0 0 0\)]

Now that we have ensured that both \(y\) & \(\hat y\) are probability distributions can you think of a function which captures the difference between them?

Cross-entropy

\(\mathscr {L}(\theta) = - \displaystyle \sum_{c=1}^k y_c \log \hat y_c \)

Notice that

\(y_c = 1 \text {if} c = \ell\) (the true class label)

\( = 0 \text {otherwise}\)

\(\because\) \(\mathscr {L}(\theta) = - \log \hat y_\ell\)

Apple

Mango

Orange

Banana

So, for classification problem (where you have to choose \(1\) of \(K\) classes), we use the following objective function

or

\(\mathscr {L}(\theta) = - \log \hat y_\ell\)

\begin{matrix} minimize \\ \theta \end{matrix}

\(- \mathscr {L}(\theta) = \log \hat y_\ell\)

\begin{matrix} maximize \\ \theta \end{matrix}

But wait!

Is \(\hat y_\ell\) a function of \(\theta = [W_1,W_2, . ,W_L, b_1, b_2,., b_L]?\)

Yes, it is indeed a function of \(\theta\)

\(\hat y_\ell = [O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)]_\ell\)

What does \(\hat y_\ell\) encode?

It is the probability that \(x\) belongs to the \(\ell^{th}\) class (bring it as close to \(1\)).

\(\log \hat y_\ell\) is called the \(log\text -likelihood\) of the data.

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = f(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Output Activation
Loss Function

Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often

Outputs

Real Values

Probabilities

Linear

Softmax

Squared Error

Cross Entropy

Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often

For the rest of this lecture we will focus on the case where the output activation is a softmax function and the loss function is cross entropy

Output Activation
Loss Function

Outputs

Real Values

Probabilities

Linear

Softmax

Squared Error

Cross Entropy

Module 4.4: Backpropagation (Intuition)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{n \times k},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)

How to compute \(\nabla \theta\) which is composed of

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_1\)

\(h_2\)

\(a_1\)

\(a_2\)

\(a_3\)

\(\hat {y} =\hat{ f}(x)\)

\(b_1\)

\(b_2\)

\(b_3\)

\(t \gets 0;\)

Algorithm: gradient_descent()

\(max\_iterations \gets 1000; \)

\(Initialize \theta_0 = [w_0,b_0];\)

end

while

\(t\)++ \(< max\_iterations\)

\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)

Let us focus on this one weight (\(W_{112}\)).

To learn this weight using SGD we need a formula for \(\frac{\partial \mathscr{L}(\theta)}{\partial W_{112}}\).

We will see how to calculate this.

\(W_{211}\)

\(W_{311}\)

\(W_{111}\)

\(W_{112}\)

First let us take the simple case when we have a deep but thin network.

In this case it is easy to find the derivative by chain rule.

\frac{\partial \mathscr{L}(\theta)}{\partial W_{111}} = \frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_{L11}} \frac{\partial a_{L11}}{\partial h_{21}} \frac{\partial h_{21}}{\partial a_{21}} \frac{\partial a_{21}}{\partial h_{11}} \frac{\partial h_{11}}{\partial a_{11}} \frac{\partial a_{11}}{\partial W_{111}}

\(W_{111}\)

\(x_1\)

\(a_{11}\)

\(W_{211}\)

\(a_{21}\)

\(W_{L11}\)

\(a_{L11}\)

\(\hat y = \hat{f}(x)\)

\(\mathscr {L} (\theta)\)

\(h_{11}\)

\(h_{21}\)

First let us take the simple case when we have a deep but thin network.

In this case it is easy to find the derivative by chain rule.

\frac{\partial \mathscr{L}(\theta)}{\partial W_{111}} = \color {blue} \frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_{L11}} \frac{\partial a_{L11}}{\partial h_{21}} \frac{\partial h_{21}}{\partial a_{21}} \frac{\partial a_{21}}{\partial h_{11}} \color {red} \frac{\partial h_{11}}{\partial a_{11}} \frac{\partial a_{11}}{\partial W_{111}}

\frac{\partial \mathscr{L}(\theta)}{\partial W_{111}} = \color {blue} \frac{\partial \mathscr{L}(\theta)}{\partial h_{11}} \color {red} \frac{\partial h_{11}}{\partial W_{111}}

\frac{\partial \mathscr{L}(\theta)}{\partial W_{211}} = \frac{\partial \mathscr{L}(\theta)}{\partial h_{21}} \frac{\partial h_{21}}{\partial W_{211}}

\frac{\partial \mathscr{L}(\theta)}{\partial W_{L11}} = \frac{\partial \mathscr{L}(\theta)}{\partial a_{L1}} \frac{\partial a_{L1}}{\partial W_{L11}}

(just compressing the chain rule)

\(x_1\)

\(a_{11}\)

\(a_{21}\)

\(a_{L11}\)

\(\hat y = f(x)\)

\(\mathscr {L} (\theta)\)

\(h_{11}\)

\(h_{21}\)

\(W_{111}\)

\(W_{211}\)

\(W_{L11}\)

Let us see an intuitive explanation of backpropagation before we get into the
mathematical details

We get a certain loss at the output and we try to figure out who is responsible for this loss

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

We get a certain loss at the output and we try to figure out who is responsible for this loss

So, we talk to the output layer and say "Hey! You are not producing the desired output, better take responsibility".

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

We get a certain loss at the output and we try to figure out who is responsible for this loss

So, we talk to the output layer and say "Hey! You are not producing the desired output, better take responsibility".

The output layer says "Well, I take responsibility for my part but please understand that I am only as good as the hidden layer and weights below me". After all.

\(f(x) = \hat y = O(W_Lh_{L-1} + b_L)\)

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(b_3\)

\(W_3\)

So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(b_3\)

\(W_3\)

\(W_L\) and \(b_L\) take full responsibility but \(h_L\) says "Well, please understand that I am only as good as the pre-activation layer"

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(b_3\)

\(W_3\)

So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(b_3\)

\(W_3\)

The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me.

\(b_2\)

\(W_2\)

\(W_L\) and \(b_L\) take full responsibility but \(h_L\) says "Well, please understand that I am only as good as the pre-activation layer"

So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(b_3\)

\(W_3\)

\(b_2\)

\(W_2\)

The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me.

\(W_L\) and \(b_L\) take full responsibility but \(h_L\) says "Well, please understand that I am only as good as the pre-activation layer"

So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"

We continue in this manner and realize that the responsibility lies with all the weights and biases (i.e. all the parameters of the model)

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(b_3\)

\(W_3\)

\(b_2\)

\(W_2\)

But instead of talking to them directly, it is easier to talk to them through the hidden layers and output layers (and this is exactly what the chain rule allows us to do)

\(W_1\)

\(b_1\)

We continue in this manner and realize that the responsibility lies with all the weights and biases (i.e. all the parameters of the model)

The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me.

\(W_L\) and \(b_L\) take full responsibility but \(h_L\) says "Well, please understand that I am only as good as the pre-activation layer"

So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"

\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{111}}}_{\text{}}\)

\(= \underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)

\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)

\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)

\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)

Talk to the
weight directly

Talk to the output layer

Talk to the previous hidden layer

and now talk to
the weights

Talk to the previous hidden layer

\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{111}}}_{\text{}}\)

\(= \underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)

\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)

\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)

\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)

Talk to the
weight directly

Talk to the output layer

Talk to the previous hidden layer

and now talk to the weights

Talk to the previous hidden layer

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

Our focus is on \(Cross\) \(entropy\) \(loss\) and \(Softmax\) output.

Module 4.5: Backpropagation: Computing Gradients w.r.t. the Output Units

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{111}}}_{\text{}}\)

\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)

\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)

\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)

\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)

Talk to the
weight directly

Talk to the output layer

Talk to the previous hidden layer

and now talk to the weights

Talk to the previous hidden layer

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

Our focus is on \(Cross\) \(entropy\) \(loss\) and \(Softmax\) output.

\(=\)

Let us first consider the partial derivative
w.r.t. \(i\)-th output

\(\mathscr {L}(\theta) = - \log \hat y_\ell\)

\(\color {blue} \cfrac {\partial}{\partial \hat y_i}(\mathscr {L}(\theta)) \color {black} =\)

(\(\ell= \) true class label)

More Compactly,

\(\cfrac {\partial}{\partial \hat y_i}(\mathscr {L}(\theta)) = - \cfrac {\mathbb {1}_{i=l}}{\hat y_\ell} \)

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

\( \cfrac {\partial}{\partial \hat y_i}(- \log \hat y_\ell) \)

\(= - \cfrac{1}{\hat y_\ell}\)

\(\text {if} i = \ell \)

\(= 0\)

\(otherwise\)

\(\cfrac {\partial}{\partial \hat y_i}(\mathscr {L}(\theta)) = - \cfrac {\mathbb {1}_{i=l}}{\hat y_\ell} \)

We can now talk about the gradient
w.r.t. the vector \(\hat y\)

\(= - \cfrac{1}{\hat y_\ell} e_\ell\)

where \(e \ell\) is a k-dimensional vector

whose \(l\)-th element is \(1\) and all other
elements are \(0\).

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

\(\nabla_{\hat y} \mathscr {L}(\theta) = \)

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial \hat y_1} \\ \vdots \\ \frac {\partial \mathscr{L}(\theta)}{\partial \hat y_k} \\ \end{bmatrix}

\(= - \cfrac{1}{\hat y_\ell}\)

\begin{bmatrix} \mathbb {1}_{\ell=1} \\ \mathbb {1}_{\ell=2} \\ \vdots \\ \mathbb {1}_{\ell=k} \\ \end{bmatrix}

\(\cfrac {\partial \mathscr {L}(\theta)}{\partial a_{Li}} = \cfrac {\partial(- \log \hat y_\ell)}{\partial a_{Li}}\)

What we are actually interested in is

\( = \cfrac {\partial(- \log \hat y_\ell)}{\partial \hat y_\ell} \cfrac {\partial \hat y_\ell}{\partial a_{Li}}\)

\( \hat y_\ell = \cfrac {\exp (a_{L\ell})}{\sum_{i} \exp (a_{Li})}\)

Having established this, we will now derive the full expression on the next slide

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Does \(\hat y_\ell\) depend on \(a_{Li} ?\)

Indeed, it does.

\cfrac {\partial}{\partial a_{Li}}-\log \hat y_\ell =

= \cfrac {-1}{\hat y_\ell} \cfrac {\partial}{\partial a_{Li}} softmax(a_L)_\ell

= \cfrac {-1}{\hat y_\ell} \cfrac {\partial}{\partial a_{Li}} \cfrac {\exp (a_L)_\ell}{\sum_{i'} \exp (a_L)_{i'}}

= \cfrac {-1}{\hat y_\ell} \Bigg (\cfrac {\frac {\partial}{\partial a_{Li}}\exp (a_L)_\ell}{\sum_{i'} exp (a_L)_{i'}} - \cfrac {\exp(a_L)_\ell (\frac{\partial}{\partial a_{Li}} \sum_{i'} \exp (a_L)_{i'})}{(\sum_{i'}\exp(a_L)_{i'})^2} \Bigg)

= \cfrac {-1}{\hat y_\ell} \Bigg ( \cfrac {\mathbb {1}_{\ell=i}\exp (a_L)_\ell }{\sum_{i'} exp (a_L)_{i'}} - \cfrac {\exp(a_L)_\ell }{\sum_{i'}\exp(a_L)_{i'}} \cfrac {\exp(a_L)_i}{\sum_{i'}\exp(a_L)_{i'}} \Bigg)

= \cfrac {-1}{\hat y_\ell} \Big( \mathbb {1}_{(\ell=i)} softmax (a_L)_{l} - softmax (a_L)_\ell softmax (a_L)_i \Big)

= \cfrac {-1}{\hat y_\ell} ( \mathbb {1}_{(\ell=i)} \hat y_\ell - \hat y_\ell \hat y_i )

= - (\mathbb {1}_{(\ell=i)} - \hat y_i)

\cfrac {\partial \frac{g(x)}{h(x)}}{\partial x} = \cfrac {\partial g(x)}{\partial x} \cfrac {1}{h(x)} - \cfrac {g(x)}{h(x)^2} \cfrac {\partial h(x)}{\partial x}

\cfrac {-1}{\hat y_\ell} \cfrac {\partial}{\partial a_{Li}} \hat y_\ell

\(\cfrac {\partial \mathscr {L}(\theta)}{\partial a_{Li}} = - (\mathbb {1}_{(\ell=i)} - \hat y_i)\)

So far we have derived the partial derivative w.r.t. the \(i\)-th element of \(a_L\)

We can now write the gradient w.r.t. the vector \(a_L\)

\(\nabla_{a_L} \mathscr {L}(\theta) = \)

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial a_{L1}} \\ \vdots \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{Lk}} \\ \end{bmatrix}

\(= \)

\begin{bmatrix} \mathbb -({1}_{\ell=1} - \hat y_1) \\ \mathbb -({1}_{\ell=2} - \hat y_2) \\ \vdots \\ \mathbb -({1}_{\ell=k} - \hat y_k) \\ \end{bmatrix}

\(= - (e (\ell)-\hat y)\)

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Module 4.6: Backpropagation: Computing Gradients w.r.t. Hidden Units

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{L11}}}_{\text{}}\)

\(= \underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)

\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)

\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)

\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)

Talk to the
weight directly

Talk to the output layer

Talk to the previous hidden layer

and now talk to the weights

Talk to the previous hidden layer

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

Our focus is on \(Cross\) \(entropy\) \(loss\) and \(Softmax\) output.

Chain rule along multiple paths: If a function \(p(z)\) can be written as a function of intermediate results \(q_i (z)\) then we have :

\cfrac {\partial p(z)}{\partial z} = \displaystyle \sum_{m} \cfrac {\partial p(z)}{\partial q_m} \cfrac {\partial q_m}{\partial z}

In our case:

\(p(z)\) is the loss function \(\mathscr{L} (\theta)\)

\(z=h_{ij}\)

\(q_m(z) = a_{Lm}\)

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(b_3\)

\(W_3\)

Intentionally left blank

\cfrac {\partial \mathscr {L} (\theta)}{\partial h_{ij}}

= \displaystyle \sum_{m=1}^k \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{i+1,m}} W_{i+1,m,j}

Now consider these two vectors,

\(\nabla_{a_{i+1}} \mathscr {L}(\theta) = \)

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial a_{i+1,1}} \\ \vdots \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{i+1,k}} \\ \end{bmatrix}

\(; W_{i+1, \cdot ,j}= \)

\begin{bmatrix} W_{i+1,1,j}\\ \vdots \\ W_{i+1,k,j}\\ \end{bmatrix}

\(W_{i+1, \cdot ,j}= \) is the \(j\)-th column of \(W_{i+1};\)

\((W_{i+1, \cdot ,j})^T \nabla a_{i+1} \mathscr {L} (\theta) = \)

\(a_{i+1} = W_{i+1}h_{ij} + b_{i+1}\)

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(b_3\)

\(W_3\)

= \displaystyle \sum_{m=1}^k \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{i+1,m}} \cfrac {\partial a_{i+1,m}}{\partial h_{ij}}

see that,

\( \displaystyle \sum_{m=1}^k \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{i+1,m}} W_{i+1,m,j}\)

We have, \(\cfrac {\partial \mathscr {L} (\theta)}{\partial h_{ij}} =(W_{i+1, \cdot ,j})^T \nabla a_{i+1} \mathscr {L} (\theta) \)

We can now write the gradient w.r.t. \(h_i\)

\(\nabla_{\text h_i} \mathscr {L}(\theta) = \)

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial h_{i1}} \\ \frac {\partial \mathscr{L}(\theta)}{\partial h_{i2}} \\ \vdots \\ \frac {\partial \mathscr{L}(\theta)}{\partial h_{in}} \\ \end{bmatrix} =

\begin{bmatrix} (W_{i+1, \cdot ,1})^T \nabla a_{i+1} \mathscr {L} (\theta)\\ (W_{i+1, \cdot ,2})^T \nabla a_{i+1} \mathscr {L} (\theta)\\ \vdots \\ (W_{i+1, \cdot ,n})^T \nabla a_{i+1} \mathscr {L} (\theta)\\ \end{bmatrix}

\(=(W_{i+1})^T (\nabla a_{i+1} \mathscr {L} (\theta)) \)

We are almost done except that we do not know how to calculate \(\nabla a_{i+1} \mathscr {L} (\theta)\) for \(i<L-1\)

We will see how to compute that

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(b_3\)

\(W_3\)

\(\nabla_{\text h_i} \mathscr {L}(\theta) = \)

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial h_{i1}} \\ \vdots \\ \frac {\partial \mathscr{L}(\theta)}{\partial h_{in}} \\ \end{bmatrix}

\cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ij}}

= \cfrac {\partial \mathscr {L} (\theta)}{\partial h_{ij}} g'(a_{ij})

\([\because h_{ij}=g(a_{ij})]\)

\(\nabla_{\text a_i} \mathscr {L}(\theta) = \)

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial h_{i1}} g'(a_{i1})\\ \vdots \\ \frac {\partial \mathscr{L}(\theta)}{\partial h_{in}} g'(a_{in})\\ \end{bmatrix}

\(= \nabla_{\text h_i} \mathscr {L}(\theta) \odot [...,g'(a_{ik}),...]\)

\(-\log \hat y_\ell\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(b_3\)

\(W_3\)

= \cfrac {\partial \mathscr {L} (\theta)}{\partial h_{ij}} \cfrac {\partial h_{ij}}{\partial a_{ij}}

Module 4.7: Backpropagation: Computing Gradients w.r.t. Parameters

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{L11}}}_{\text{}}\)

\(= \underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)

\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)

\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)

\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)

Talk to the
weight directly

Talk to the output layer

Talk to the previous hidden layer

and now talk to the weights

Talk to the previous hidden layer

Quantities of interest (roadmap for the remaining part):

Gradient w.r.t. output units

Gradient w.r.t. hidden units

Gradient w.r.t. weights and biases

Our focus is on \(Cross\) \(entropy\) \(loss\) and \(Softmax\) output.

Recall that,

\( \text{a}_\text{k} = \text{b}_\text{k} + W_k \text{h}_\text{k-1}\)

\cfrac {\partial \mathscr {L} (\theta)}{\partial W_{kij}} = \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}} \cfrac {\partial a_{ki}}{\partial W_{kij}}

= \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}} h_{k-1,j}

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial W_{k11}} & \frac {\partial \mathscr{L}(\theta)}{\partial W_{k12}} & \dots & \dots & \frac {\partial \mathscr{L}(\theta)}{\partial W_{k1n}}\\ \dots & \dots & \dots & \dots & \dots \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \dots & \dots & \dots & \dots & \frac {\partial \mathscr{L}(\theta)}{\partial W_{knn}} \\ \end{bmatrix}

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

\(-\log \hat y_\ell\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(b_3\)

\(W_3\)

\(b_2\)

\(W_2\)

Intentionally left blank

Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial W_{k11}} & \frac {\partial \mathscr{L}(\theta)}{\partial W_{k12}} & \frac {\partial \mathscr{L}(\theta)}{\partial W_{k13}}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial W_{k21}} & \frac {\partial \mathscr{L}(\theta)}{\partial W_{k22}} & \frac {\partial \mathscr{L}(\theta)}{\partial W_{k23}}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial W_{k31}} & \frac {\partial \mathscr{L}(\theta)}{\partial W_{k32}} & \frac {\partial \mathscr{L}(\theta)}{\partial W_{k33}}\\ \end{bmatrix}

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

\cfrac {\partial \mathscr {L} (\theta)}{\partial W_{kij}} = \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}} \cfrac {\partial a_{ki}}{\partial W_{kij}}

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} h_{k-1,1} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} h_{k-1,2} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} h_{k-1,1} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} h_{k-1,2} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} h_{k-1,1} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} h_{k-1,2} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} h_{k-1,3}\\ \end{bmatrix} =

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

\cfrac {\partial \mathscr {L} (\theta)}{\partial W_{kij}} = \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}} \cfrac {\partial a_{ki}}{\partial W_{kij}}

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} h_{k-1,2} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} h_{k-1,2} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} h_{k-1,2} & \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} h_{k-1,3}\\ \end{bmatrix} =

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

\cfrac {\partial \mathscr {L} (\theta)}{\partial W_{kij}} = \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}} \cfrac {\partial a_{ki}}{\partial W_{kij}}

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {red} h_{k-1,2} & \color {black} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \color {red} h_{k-1,2} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {red} h_{k-1,2} & \color {black} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} h_{k-1,3}\\ \end{bmatrix} =

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

\cfrac {\partial \mathscr {L} (\theta)}{\partial W_{kij}} = \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}} \cfrac {\partial a_{ki}}{\partial W_{kij}}

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {red} h_{k-1,2} & \color {black} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {orange} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \color {red} h_{k-1,2} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \color {orange} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {red} h_{k-1,2} & \color {black} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {orange} h_{k-1,3}\\ \end{bmatrix} =

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

\cfrac {\partial \mathscr {L} (\theta)}{\partial W_{kij}} = \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}} \cfrac {\partial a_{ki}}{\partial W_{kij}}

\begin{bmatrix} \frac {\color {brown} \partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {blue} h_{k-1,1} & \color {brown}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {red} h_{k-1,2} & \color {brown} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \color {orange} h_{k-1,3}\\ \space \\ \frac {\color {green} \partial \mathscr{L}(\theta)}{\color {green} \partial a_{k2}} \color {blue} h_{k-1,1} & \color {green}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \color {red} h_{k-1,2} & \color {green}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \color {orange} h_{k-1,3}\\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {blue} h_{k-1,1} & \color {black}\frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {red} h_{k-1,2} & \color {black} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k3}} \color {orange} h_{k-1,3}\\ \end{bmatrix} =

\(\nabla_{W_k} \mathscr {L}(\theta) = \)

\(\nabla_{a_k} \mathscr {L}(\theta) \cdot \text h _ {\text k-1} ^T \)

Finally, coming to the biases

\( a_{ki} = b_{ki} + W_{kij} h_{k-1,j}\)

\cfrac {\partial \mathscr {L} (\theta)}{\partial b_{ki}} = \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}} \cfrac {\partial a_{ki}}{\partial b_{ki}}

\begin{bmatrix} \frac {\partial \mathscr{L}(\theta)}{\partial a_{k1}} \\ \space \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{k2}} \\ \vdots \\ \frac {\partial \mathscr{L}(\theta)}{\partial a_{kn}} \\ \end{bmatrix}

\(\nabla_{\text b_ \text k} \mathscr {L}(\theta) = \)

= \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{ki}}

We can now write the gradient w.r.t. the vector \(b_k\)

\(= \nabla_{\text a_ \text k} \mathscr {L}(\theta) \)

\(-\log \hat y_\ell\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(W_1\)

\(b_1\)

\(b_3\)

\(W_3\)

\(b_2\)

\(W_2\)

Module 4.8: Backpropagation: Pseudo code

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Finally, we have all the pieces of the puzzle

We can now write the full learning algorithm

\(\nabla_{\text a_ \text L} \mathscr {L}(\theta) \)

(gradient w.r.t. output layer)

\(\nabla_{\text h_ \text k} \mathscr {L}(\theta), \nabla_{\text a_ \text k} \mathscr {L}(\theta) \)

(gradient w.r.t. hidden layers, \(1 \leq k < L\))

\(\nabla_{\text W_ \text k} \mathscr {L}(\theta), \nabla_{\text b_ \text k} \mathscr {L}(\theta) \)

(gradient w.r.t. weights and biases, \(1 \leq k < L\))

\(t \gets 0;\)

Algorithm: gradient_descent()

\(max\_iterations \gets 1000; \)

\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)

end

while \(t\)++ \(< max\_iterations\) do

\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)

\(a_1,h_1,a_2,h_2,...,a_{L-1},h_{L-1},a_L,\hat y=forward\) _ \(propagation(\theta_t)\)

\(\nabla \theta_t = backward\)_\(propagation(h_1,h_2,...,h_{L-1},a_1,a_2,...,a_L,y,\hat y)\)

Algorithm: forward_propagation(\(\theta\))

\(a_k = b_k + W_k h_{k-1} ;\)

\(h_k = g(a_k)\)

\(a_L = b_L + W_L h_{L-1} ;\)

\(\hat y = O(a_L) ;\)

end

for \(k = 1\) to \(L-1\) do

Algorithm: back_propagation(\(h_1,h_2,...,h_{L-1},a_1,a_2,...,a_L,y, \hat y\))

\(\nabla _ {a_L} \mathscr {L} (\theta) = - (e(y) - \hat y); \)

Just do a forward propagation and compute all \(h_i \text{\textquoteright}\)s, \(a_i\text{\textquoteright}\)s, and \(\hat y\)

// Compute output gradient ;

// Compute gradients w.r.t. parameters ;

\(\nabla _ {W_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) h_{k-1}^T ;\)

\(\nabla _ {b_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) ;\)

// Compute gradients w.r.t. layer below ;

\(\nabla _ {h_{k-1}} \mathscr {L} (\theta) = W_k^T \nabla _ {a_k} \mathscr {L} (\theta) ;\)

// Compute gradients w.r.t. layer below (pre-activation) ;

\(\nabla _ {a_{k-1}} \mathscr {L} (\theta) = \nabla _ {h_{k-1}} \mathscr {L} (\theta) \odot [...,g' (a_{k-1,j}),...];\)

end

for \(k = L\) to \(1\) do

Module 4.9: Derivative of the activation function

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Now, the only thing we need to figure out is how to compute \(g'\)

\(g(z) = \sigma (z)\)

Logistic function

\(=\cfrac {1}{1+e^{-z}}\)

\(g'(z) = (-1) \cfrac {1}{(1+e^{-z})^2} \cfrac {d}{dz} (1+e^{-z})\)

\(= (-1) \cfrac {1}{(1+e^{-z})^2} (-e^{-z})\)

\(= \cfrac {1}{(1+e^{-z})} \cfrac {1+e^{-z}-1}{1+e^{-z}}\)

\(=g(z) (1-g(z)\))

\(g(z) = tanh (z)\)

\(=\cfrac {e^z-e^{-z}}{e^z+e^{-z}}\)

\(g'(z) = \cfrac {\Bigg ((e^z+e^{-z}) \frac {d}{dz}(e^z-e^{-z}) \allowbreak - (e^z-e^{-z}) \frac {d}{dz} (e_z+e^{-z})\Bigg )}{(e^z+e^{-z})^2} \)

\(=\cfrac {(e^z+e^{-z})^2-(e^z-e^{-z})^2}{(e^z+e^{-z})^2}\)

\(=1- \cfrac {(e^z-e^{-z})^2}{(e^z+e^{-z})^2}\)

\(=1-(g(z))^2\)

tanh function

CS6910: Fundamentals of Deep Learning

Lecture 4: Feedforward Neural Networks, Backpropagation

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

References/Acknowledgments

See the excellent videos by Hugo Larochelle on Backpropagation and Andrej Karpathy Lecture (CS231n Winter 2016) on Bckpropagation, Neural Networks

Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

The input to the network is an \(n\)-dimensional vector

The network contains \(\text L-1\) hidden layers (2, in this case) having \(\text n\) neurons each

Finally, there is one output layer containing \(\text k\) neurons (say, corresponding to \(\text k\) classes)

Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (\(a_i\) and \(h_i\) are vectors)

pre-activation

(\(a_i\) and \(h_i\) are vectors)

The input layer can be called the \(0\)-th layer and the output layer can be called the (\(L\))-th layer

\(W_i \in \R^{n \times n}\) and \(b_i \in \R^n\) are the weight and bias between layers \(i-1\) and \(i (0 < i < L\))

\(W_L \in \R^{k \times n}\) and \(b_L \in \R^k\) are the weight and bias between the last hidden layer and the output layer (\(L = 3\)) in this case)

and activation \(a_i\) and \(h_i\) are vectors)

The pre-activation at layer \(i\) is given by

The activation at layer \(i\) is given by

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

where \(O\) is the output activation function (for example, softmax, linear, etc.)

To simplify notation we will refer to \(a_i(x)\) as \(a_i\) and \(h_i(x)\) as \(h_i\)

The pre-activation at layer \(i\) is given by

The activation at layer \(i\) is given by

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

where \(O\) is the output activation function (for example, softmax, linear, etc.)

Data: \(\lbrace x_i,y_i \rbrace_{i=1}^N\)

Model:

Algorithm: Gradient Descent with Back-propagation (we will see soon)

Objective/Loss/Error function: Say,

\(\text {In general,}\) \(min \mathscr{L}(\theta)\)

Parameters:

where \(\mathscr{L}(\theta)\) is some function of the parameters

Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition)

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

The story so far...

We have introduced feedforward neural networks We are now interested in finding an algorithm for learning the parameters of this model

Recall our gradient descent algorithm

We can write it more concisely as

Algorithm: gradient_descent()

Recall our gradient descent algorithm

We can write it more concisely as

where \(\nabla \theta_t = [\frac {\partial \mathscr{L}(\theta)}{\partial w_t},\frac {\partial \mathscr{L}(\theta)}{\partial b_t}]^T\)

Now, in this feedforward neural network, instead of \(\theta = [w,b]\) we have \(\theta = [W_1, ..., W_L, b_1, b_2, ..., b_L]\)

We can still use the same algorithm for learning the parameters of our model

Algorithm: gradient_descent()

Algorithm: gradient_descent()

Recall our gradient descent algorithm

We can write it more concisely as

Now, in this feedforward neural network, instead of \(\theta = [w,b]\) we have \(\theta = [W_1, ..., W_L, b_1, b_2, ..., b_L]\)

We can still use the same algorithm for learning the parameters of our model

where \(\nabla \theta_t = [\frac {\partial \mathscr{L}(\theta)}{\partial W_{1,t}},.,\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,t}}, \frac {\partial \mathscr{L}(\theta)}{\partial b_{1,t}},.,\frac {\partial \mathscr{L}(\theta)}{\partial b_{L,t}}]^T\)

Except that now our \(\nabla \theta \) looks much more nasty

\(\nabla \theta \) is thus composed of

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k}\)

We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)

How to compute \(\nabla \theta\) which is composed of

Module 4.3: Output Functions and Loss Functions

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)

How to compute \(\nabla \theta\) which is composed of

We need to answer two questions

How to choose the loss function \(\mathscr{L}(\theta)?\)

\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{k \times n},\)

\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)

How to compute \(\nabla \theta\) which is composed of

The choice of loss function depends on the problem at hand

We have introduced feedforward neural networks
We are now interested in finding an algorithm for learning the parameters of this model