Deep Learning: Distilled for practitioners using PyTorch

Arun Prakash A

DEEP LEARNING

DEEP Neural Networks

Deep Matrix Vector Products

ML

SVM

DT

PGM

KNN

DNN

Why do we need Deep Neural Networks?

Text

Natural Image

Video

Speech

The data are Unstructured and DL models work well for unstructured data (poorly for structured data)

One learning approach (Gradient based) for all these data types

Paves the way to build a single model that learns from a two (or more) different modalities (Example: Image Captioning)

Model's performance scales up with increase in data size

Data type : Image

Dataset: CIFAR-10 (Natural images)

A single thread

Models: A single neuron, FCNN, CNN

Data Collection

Transformations

Training

model

Testing model

It is alright if some words are Alien to you

You may check my GitHub repo out if you want to go deeper into PyTorch.

https://github.com/Arunprakash-A/DL-Pytorch-Workshop

Be One With Data

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

What we see

What A machine sees

\begin{bmatrix} 0 & 1 & 0 & \cdots &1 \\ 1 & 0& 1 & \cdots& 0 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 1 & \cdots & 0\\ \end{bmatrix}

3 \times 3

10 \times 10

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Matrix (Tensor) of dim \( m \times n \)

Image of size \(height \times width\)

\begin{bmatrix} 0 & 1 & 0 & \cdots &1 \\ 1 & 0& 1 & \cdots& 0 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 1 & \cdots & 0\\ \end{bmatrix}

3 \times 3

10 \times 10

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Matrix (Tensor) of dim \( m \times n \)

Image of size \(height \times width\)

3 \times 3

\begin{bmatrix} ? & ? & ? \\ ? & ? & ? \\ ? & ? & ? \end{bmatrix}

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Matrix (Tensor) of dim \( m \times n \)

Image of size \(height \times width\)

3 \times 3

3 \times 3 \times 3

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{bmatrix}

R

G

B

3 \times 3 \times 3

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{bmatrix}

R

G

B

Requirements for DL Framework:

Ability to create \(n\) dimensional array or also called tensors

0.1

[0.1,0.2,0.3]

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

Scalar

(0 dim Tensor)

Vector

(1 dim Tensor)

Matrix

(2 dim Tensor)

Stacked Matrix

(3 dim Tensor)

Provide efficient ways to manipulate them (Accessing elements, mathematical operations, ..)

Most commonly used operations:

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

flatten()

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

reshape(1,9)

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

transpose(,0,1)

\begin{bmatrix} 0 & 1 & 2 & 1 & 1 & 1 & 0 & 1 & 0 \end{bmatrix}

\big[\begin{bmatrix} 0 & 1 & 2 & 1 & 1 & 1 & 0 & 1 & 0 \end{bmatrix}\big]

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 2 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix},

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 2 & 1 & 0 \end{bmatrix}

\bigg(

\bigg)

cat(,dim=1)

\begin{bmatrix} 0 & 1 & 2 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 0 &2 & 1 & 0 \end{bmatrix},

Most commonly used operations:

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

sum()

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

sum(,dim=0)

[1,3,3]

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

sum(,dim=1)

[3,3,1]

Scalar (dim=0)

Vector (dim=1)

Reduction Operations:

Mean, accessing elements

Most commonly used operations:

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

sigmoid()

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

softmax(,dim=0)

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

softmax(,dim=1)

across rows

across columns

Creating Tensors in PyTorch: Switch to colab

Why Image Recognition is a complex Task?

How many patterns are possible with an image of size \(3 \times 3\)?

2^{3 \times 3}= 2^9=512

Of course, not all of them are relevant for a particular problem in most of the situation.

Recognize whether an image contains a single vertical line or not

The output is True for a very small subset of images and False for the rest

What do you perceive?

256^{5 \times 5 \times 3}= 256^{125} \approx 10^{300}

How many patterns are possible with an image of size \(5 \times 5\)?

How many patterns are possible with a colour image of size \(5 \times 5\)?

2^{5 \times 5 }= 2^{25} \approx 33 \times 10^6

This is a challenge for Generative modelling

Our focus is on discriminative modelling

Two configurations out of million possibilities

Neuron

\hat{y}=f(\mathbf{w}^T\mathbf{x}) \quad \mathbf{x} \in \mathbb{R}^{n+1}, \quad \mathbf{w} \in \mathbb{R}^{n+1}

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

\hat{y}

x_0

w_0

The Parameter \(\mathbf{W}\) is randomly initialized

Compute the gradient of loss w.r.t \(\mathbf{w}\) , \(\nabla \mathbf{w}\)

Update rule:

\mathbf{w}:= \mathbf{w}-\eta \nabla \mathbf{w}

loss=\mathscr{L}(\hat{y},y)

The non-linear function \(f\) in neurons is called activation function

Check the performance with a set of criteria. Iteratively update the parameters for improving the performance

Wait, the image is not 1D then how do we feed it to a neuron as an input?

\begin{bmatrix} 1 \\ 1 \\ 0 \\ 1 \\ 1 \\ 1 \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}

Note: Input elements are real-valued in general.

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

\hat{y}

x_0=1

w_0

List of Activation functions

https://cloud.google.com/products/ai/ml-comic-2

Few more design choices

Loss functions

Optimizers : SGD, Adaptive

Hyper-parameters: Learning rate (schedulers), momentum, ..

Why do we need different Activation functions?

Just to ensure that the gradients flow across entire network

Feed Forward Neural Networks

(Multi-Layer (network of) Perceptron)

The input to the network is an \(n\)-dimensional vector

The network contains \(\text L-1\) hidden layers (2, in this case) having \(\text n\) neurons each

Finally, there is one output layer containing \(\text k\) neurons (say, corresponding to \(\text k\) classes)

\(a_2\)

\(a_3\)

Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (\(a_i\) and \(h_i\) are vectors)

\(x_1\)

\(x_2\)

\(x_n\)

pre-activation

(\(a_i\) and \(h_i\) are vectors)

The input layer can be called the \(0\)-th layer and the output layer can be called the (\(L\))-th layer

\(W_i \in \R^{n \times n}\) and \(b_i \in \R^n\) are the weight and bias between layers \(i-1\) and \(i (0 < i < L\))

\(W_L \in \R^{k \times n}\) and \(b_L \in \R^k\) are the weight and bias between the last hidden layer and the output layer (\(L = 3\)) in this case)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

and activation \(a_i\) and \(h_i\) are vectors)

The pre-activation at layer \(i\) is given by

\(a_i(x) = b_i +W_ih_{i-1}(x)\)

The activation at layer \(i\) is given by

\(h_i(x) = g(a_i(x))\)

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

\(f(x) = h_L(x)=O(a_L(x))\)

where \(O\) is the output activation function (for example, softmax, linear, etc.)

To simplify notation we will refer to \(a_i(x)\) as \(a_i\) and \(h_i(x)\) as \(h_i\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

The pre-activation at layer \(i\) is given by

\(a_i = b_i +W_ih_{i-1}\)

The activation at layer \(i\) is given by

\(h_i = g(a_i)\)

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

\(f(x) = h_L=O(a_L)\)

where \(O\) is the output activation function (for example, softmax, linear, etc.)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Data: \(\lbrace x_i,y_i \rbrace_{i=1}^N\)

\(\hat y_i = \hat{f}(x_i) = O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)\)

Model:

\(\theta = W_1, ..., W_L, b_1, b_2, ..., b_L (L = 3)\)

Algorithm: Gradient Descent with Back-propagation

Objective/Loss/Error function: Say,

\(min \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)

\(\text {In general,}\) \(min \mathscr{L}(\theta)\)

Parameters:

where \(\mathscr{L}(\theta)\) is some function of the parameters

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

\(x_1\)

\(x_2\)

\(x_3\)

\(h_2\)

\(a_3\)

\(b_2\) = [0.01,0.02,0.03]

\(b_3\) = [0.01,0.02]

\(W_2\)

\(W_3\)

\(W_1=\)

\(a_2\)

\(h_1\)

\(a_1\)

1.5

2.5

\begin{bmatrix} 0.05 & 0.05 & 0.05 \\ 0.05 & 0.05 & 0.05 \\ 0.05 & 0.05 & 0.05 \\ \end{bmatrix}

\(W_2=\)

\begin{bmatrix} 0.025 & 0.025 & 0.025 \\ 0.025 & 0.025 & 0.025 \\ 0.025 & 0.025 & 0.025 \\ \end{bmatrix}

\(W_3=\)

\begin{bmatrix} 1 & 1\\ 1 & 1\\ 1 & 1\\ \end{bmatrix}

0.36

0.37

0.38

0.589

0.591

0.593

0.054

0.064

0.074

0.513

0.516

0.518

1.558

1.568

0.497

0.502

\(\hat y = h_3 \)

\(\mathscr {L}(\theta) = -\frac{1}{N} \sum_{i=1}^N (y_ilog(\hat y_i)+(1-y_i)log(1- \hat y_i))\) = 0.6981

\(W_1\)

"Forward Pass"

\(x=[1.5, 2.5, 3]\)

\(b_1\) = [0.01,0.02,0.03]

\([h_1]=sigmoid(a_1)\)

\([h_2]=sigmoid(a_2)\)

\([h_3]=softmax(a_3)\)

\([a_1]=[1.5,2.5,3]*\)

\begin{bmatrix} 0.05 & 0.05 & 0.05 \\ 0.05 & 0.05 & 0.05 \\ 0.05 & 0.05 & 0.05 \\ \end{bmatrix}

\(+ [0.01,0.02,0.03]\)

\([a_2]=[0.589,0.591,0.593]*\)

\begin{bmatrix} 0.025 & 0.025 & 0.025 \\ 0.025 & 0.025 & 0.025 \\ 0.025 & 0.025 & 0.025 \\ \end{bmatrix}

\(+ [0.01,0.02,0.03]\)

\([a_3]=[0.513,0.516,0.518]*\)

\begin{bmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ \end{bmatrix}

\(+ [0.01,0.02]\)

\(y=[1, 0]\)

"Binary Cross Entropy Loss"