Developing Deep Learning models

using PyTorch

(For Beginners)

Arun Prakash A

DEEP LEARNING

DEEP Neural Networks

Deeply Stacked Matrix Vector Products

ML

SVM

DT

PGM

KNN

DNN

Why do we need Deep Neural Networks?

Text

Natural Image

Video

Speech

The data are Unstructured and DL models work really well with unstructured data (poorly for structured data)

One learning approach (Gradient based) for all these data types

Multi-Model: Paves the way to build a single model that learns from a two (or more) different modalities (Example: Image Captioning)

Model's performance scales up with increase in data size

Data type : Image

Dataset: CIFAR-10 (Natural images)

A single thread

Models: A single neuron, FCNN, CNN

Data Collection

Transformations

Training

model

Testing model

It is alright if some words are Alien to you

You may check my GitHub repo out if you want to go deeper into PyTorch.

https://github.com/Arunprakash-A/DL-Pytorch-Workshop

Be One With Data

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

What we see

What A machine sees

\begin{bmatrix} 0 & 1 & 0 & \cdots &1 \\ 1 & 0& 1 & \cdots& 0 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 1 & \cdots & 0\\ \end{bmatrix}

3 \times 3

10 \times 10

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Matrix (Tensor) of dim \( m \times n \)

Image of size \(height \times width\)

\begin{bmatrix} 0 & 1 & 0 & \cdots &1 \\ 1 & 0& 1 & \cdots& 0 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 1 & \cdots & 0\\ \end{bmatrix}

3 \times 3

10 \times 10

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Matrix (Tensor) of dim \( m \times n \)

Image of size \(height \times width\)

3 \times 3

\begin{bmatrix} ? & ? & ? \\ ? & ? & ? \\ ? & ? & ? \end{bmatrix}

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Matrix (Tensor) of dim \( m \times n \)

Image of size \(height \times width\)

3 \times 3

3 \times 3 \times 3

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{bmatrix}

R

G

B

3 \times 3 \times 3

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{bmatrix}

R

G

B

Representing Text Data

" The course contents are well organized"

Tokenize

[The, course,contents,are,well,organized]

Numericalize

[{The:2, course:4,contents:1,are:3,well:5,organized:6}]

Embedding

V=\{contents,The,are,course,...\}

\begin{bmatrix} 0.1, 0.05, -0.25, \cdots \end{bmatrix}_{1 \times 1024}

Requirements for DL Framework:

Ability to create \(n\) dimensional array or also called tensors

0.1

[0.1,0.2,0.3]

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

Scalar

(0 dim Tensor)

Vector

(1 dim Tensor)

Matrix

(2 dim Tensor)

Stacked Matrix

(3 dim Tensor)

Provide efficient ways to manipulate them (Accessing elements, mathematical operations, ..)

Most commonly used operations:

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

flatten()

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

reshape(1,9)

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

transpose(,0,1)

\begin{bmatrix} 0 & 1 & 2 & 1 & 1 & 1 & 0 & 1 & 0 \end{bmatrix}

\big[\begin{bmatrix} 0 & 1 & 2 & 1 & 1 & 1 & 0 & 1 & 0 \end{bmatrix}\big]

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 2 & 1 & 0 \end{bmatrix}

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix},

\begin{bmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 2 & 1 & 0 \end{bmatrix}

\bigg(

\bigg)

cat(,dim=1)

\begin{bmatrix} 0 & 1 & 2 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 0 &2 & 1 & 0 \end{bmatrix},

Most commonly used operations:

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

sum()

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

sum(,dim=0)

[1,3,3]

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

sum(,dim=1)

[3,3,1]

Scalar (dim=0)

Vector (dim=1)

Reduction Operations:

Mean, accessing elements

Most commonly used operations:

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

sigmoid()

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

softmax(,dim=0)

\begin{bmatrix} 0 & 1 & 2 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{bmatrix}

softmax(,dim=1)

across rows

across columns

Core: tensors

JIT

Optim

multiprocessing

quantization

sparse

ONNX

Distributed

fast.ai

Detectron 2

Horovod

Flair

AllenNLP

torch.vision

BoTorch

GloW

Lightening

Skorch

Creating Tensors in PyTorch: Switch to colab

Why Image Recognition is a complex Task?

How many patterns are possible with an image of size \(3 \times 3\)?

2^{3 \times 3}= 2^9=512

Of course, not all of them are relevant for a particular problem in most of the situation.

Recognize whether an image contains a single vertical line or not

The output is True for a very small subset of images and False for the rest

What do you perceive?

256^{5 \times 5 \times 3}= 256^{125} \approx 10^{300}

How many patterns are possible with an image of size \(5 \times 5\)?

How many patterns are possible with a colour image of size \(5 \times 5\)?

2^{5 \times 5 }= 2^{25} \approx 33 \times 10^6

This is a challenge for Generative modelling

Our focus is on discriminative modelling

Two configurations out of million possibilities

Neuron

\hat{y}=f(\mathbf{w}^T\mathbf{x}) \quad \mathbf{x} \in \mathbb{R}^{n+1}, \quad \mathbf{w} \in \mathbb{R}^{n+1}

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

\hat{y}

x_0

w_0

The Parameter \(\mathbf{W}\) is randomly initialized

Compute the gradient of loss w.r.t \(\mathbf{w}\) , \(\nabla \mathbf{w}\)

Update rule:

\mathbf{w}:= \mathbf{w}-\eta \nabla \mathbf{w}

loss=\mathscr{L}(\hat{y},y)

The non-linear function \(f\) in neurons is called activation function

Check the performance with a set of criteria. Iteratively update the parameters for improving the performance

Wait, the image is not 1D then how do we feed it to a neuron as an input?

\begin{bmatrix} 1 \\ 1 \\ 0 \\ 1 \\ 1 \\ 1 \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}

Note: Input elements are real-valued in general.

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

\hat{y}

x_0=1

w_0

List of Activation functions

https://cloud.google.com/products/ai/ml-comic-2

Few more design choices

Loss functions

Optimizers : SGD, Adaptive

Hyper-parameters: Learning rate (schedulers), momentum, ..

Why do we need different Activation functions?

Just to ensure that the gradients flow across entire network

Feed Forward Neural Networks

(Multi-Layer (network of) Perceptron)

The input to the network is an \(n\)-dimensional vector

The network contains \(\text L-1\) hidden layers (2, in this case) having \(\text n\) neurons each

Finally, there is one output layer containing \(\text k\) neurons (say, corresponding to \(\text k\) classes)

\(a_2\)

\(a_3\)

Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (\(a_i\) and \(h_i\) are vectors)

\(x_1\)

\(x_2\)

\(x_n\)

pre-activation

(\(a_i\) and \(h_i\) are vectors)

The input layer can be called the \(0\)-th layer and the output layer can be called the (\(L\))-th layer

\(W_i \in \R^{n \times n}\) and \(b_i \in \R^n\) are the weight and bias between layers \(i-1\) and \(i (0 < i < L\))

\(W_L \in \R^{k \times n}\) and \(b_L \in \R^k\) are the weight and bias between the last hidden layer and the output layer (\(L = 3\)) in this case)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

and activation \(a_i\) and \(h_i\) are vectors)

The pre-activation at layer \(i\) is given by

\(a_i(x) = b_i +W_ih_{i-1}(x)\)

The activation at layer \(i\) is given by

\(h_i(x) = g(a_i(x))\)

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

\(f(x) = h_L(x)=O(a_L(x))\)

where \(O\) is the output activation function (for example, softmax, linear, etc.)

To simplify notation we will refer to \(a_i(x)\) as \(a_i\) and \(h_i(x)\) as \(h_i\)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

The pre-activation at layer \(i\) is given by

\(a_i = b_i +W_ih_{i-1}\)

The activation at layer \(i\) is given by

\(h_i = g(a_i)\)

where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)

The activation at the output layer is given by

\(f(x) = h_L=O(a_L)\)

where \(O\) is the output activation function (for example, softmax, linear, etc.)

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Data: \(\lbrace x_i,y_i \rbrace_{i=1}^N\)

\(\hat y_i = \hat{f}(x_i) = O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)\)

Model:

\(\theta = W_1, ..., W_L, b_1, b_2, ..., b_L (L = 3)\)

Algorithm: Gradient Descent with Back-propagation

Objective/Loss/Error function: Say,

\(min \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)

\(\text {In general,}\) \(min \mathscr{L}(\theta)\)

Parameters:

where \(\mathscr{L}(\theta)\) is some function of the parameters

\(a_2\)

\(a_3\)

\(x_1\)

\(x_2\)

\(x_n\)

\(a_1\)

\(h_L=\hat {y} = \hat{f}(x)\)

\(h_2\)

\(h_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

\(x_1\)

\(x_2\)

\(x_3\)

\(h_2\)

\(a_3\)

\(b_2\) = [0.01,0.02,0.03]

\(b_3\) = [0.01,0.02]

\(W_2\)

\(W_3\)

\(W_1=\)

\(a_2\)

\(h_1\)

\(a_1\)

1.5

2.5

\begin{bmatrix} 0.05 & 0.05 & 0.05 \\ 0.05 & 0.05 & 0.05 \\ 0.05 & 0.05 & 0.05 \\ \end{bmatrix}

\(W_2=\)

\begin{bmatrix} 0.025 & 0.025 & 0.025 \\ 0.025 & 0.025 & 0.025 \\ 0.025 & 0.025 & 0.025 \\ \end{bmatrix}

\(W_3=\)

\begin{bmatrix} 1 & 1\\ 1 & 1\\ 1 & 1\\ \end{bmatrix}

0.36

0.37

0.38

0.589

0.591

0.593

0.054

0.064

0.074

0.513

0.516

0.518

1.558

1.568

0.497

0.502

\(\hat y = h_3 \)

\(\mathscr {L}(\theta) = -\frac{1}{N} \sum_{i=1}^N (y_ilog(\hat y_i)+(1-y_i)log(1- \hat y_i))\) = 0.6981

\(W_1\)

"Forward Pass"

\(x=[1.5, 2.5, 3]\)

\(b_1\) = [0.01,0.02,0.03]

\([h_1]=sigmoid(a_1)\)

\([h_2]=sigmoid(a_2)\)

\([h_3]=softmax(a_3)\)

\([a_1]=[1.5,2.5,3]*\)

\begin{bmatrix} 0.05 & 0.05 & 0.05 \\ 0.05 & 0.05 & 0.05 \\ 0.05 & 0.05 & 0.05 \\ \end{bmatrix}

\(+ [0.01,0.02,0.03]\)

\([a_2]=[0.589,0.591,0.593]*\)

\begin{bmatrix} 0.025 & 0.025 & 0.025 \\ 0.025 & 0.025 & 0.025 \\ 0.025 & 0.025 & 0.025 \\ \end{bmatrix}

\(+ [0.01,0.02,0.03]\)

\([a_3]=[0.513,0.516,0.518]*\)

\begin{bmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ \end{bmatrix}

\(+ [0.01,0.02]\)

\(y=[1, 0]\)

"Binary Cross Entropy Loss"

\begin{bmatrix} 1 \\ 1 \\ 0 \\ 1 \\ 1 \\ 1 \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

\hat{y}

x_0=1

w_0

\sum x_iw_i

x_0=1

w_0

\begin{bmatrix} 1 \\ 1 \\ 1 \\ 0 \\ \end{bmatrix}

w_1

w_2

w_3

\sum x_iw_i

x_0=1

w_0

w_4

O_1

\begin{bmatrix} 0 \\ 1 \\ 0\\ 1 \\ \end{bmatrix}

w_1

w_2

w_3

\sum x_iw_i

x_0=1

w_0

w_4

O_1

O_2

The weights are shared!

\begin{bmatrix} 0 \\ 1 \\ 1\\ 1 \\ \end{bmatrix}

w_1

w_2

w_3

\sum x_iw_i

x_0=1

w_0

w_4

O_1

O_2

The weights are shared!

O_3

\begin{bmatrix} 0 \\ 1 \\ 0\\ 1 \\ \end{bmatrix}

w_1

w_2

w_3

\sum x_iw_i

x_0=1

w_0

w_4

O_1

O_2

The weights are shared!

O_3

O_4

w_1

w_2

w_3

w_4

\begin{bmatrix} 0 \\ 1 \\ 0\\ 1 \\ \end{bmatrix}

w_1

w_2

w_3

\sum x_iw_i

x_0=1

w_0

w_4

O_1

O_2

The weights are shared!

O_3

O_4

w_1

w_2

w_3

w_4

O_1

O_2

O_3

O_4

Convolution

Image

filter/kernel

Activation Map

A single neuron produce one activation map

O_1

O_2

The weights are shared!

O_3

O_4

Max Pooling

O_4

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

\hat{y}

x_0=1

w_0

Recuurent Neural Networks

Sequence

"Text is something different from images"

Tokenizer

[Text,

is, something, different, from,

images]

Numericalize

\begin{bmatrix} 10\\ 2\\ 1\\ 6\\ 9\\ 5 \end{bmatrix}

Text is an ordered sequence

\begin{bmatrix} 10\\ 2\\ 1\\ 6\\ 9\\ 5 \end{bmatrix}

Embedding

Matrix

\begin{bmatrix} 0.1 \\ -0.5 \\ 0\\ 0.11\\ -0.2\\ 1.1 \\ 0 \\ 0 \\ \vdots \\ 0.1 \\ 0 \\ 0 \\ 1.25 \end{bmatrix}

\begin{bmatrix} 1.1 \\ -0.25 \\ 0.1\\ 0.11\\ -0.2\\ 1.1 \\ 0 \\ 0 \\ \vdots \\ 0.1 \\ 0 \\ 0 \\ 1.25 \end{bmatrix}

\cdots

\begin{bmatrix} 1. \\ -0.2 \\ 0.11\\ 0.21\\ -0.2\\ 1.1 \\ 0 \\ 0 \\ \vdots \\ 0.1 \\ 0.1 \\ 0 \\ 1 \end{bmatrix}

\begin{bmatrix} 0.1 \\ -0.5 \\ 0\\ 0.11\\ -0.2\\ 1.1 \\ 0 \\ 0 \\ \vdots \\ 0.1 \\ 0 \\ 0 \\ 1.25 \end{bmatrix}

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

x_0=1

w_0

"Text:10"

Embedding

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

x_0=1

w_0

"Is:2"

Embedding

\begin{bmatrix} 1.1 \\ -0.25 \\ 0.1\\ 0.11\\ -0.2\\ 1.1 \\ 0 \\ 0 \\ \vdots \\ 0.1 \\ 0 \\ 0 \\ 1.25 \end{bmatrix}

The weights are shared !

x_1

x_k

x_{n}

\vdots

w_1

w_k

w_n

\sum x_iw_i

x_0=1

w_0

"Images:5"

Embedding

\begin{bmatrix} 1. \\ -0.2 \\ 0.11\\ 0.21\\ -0.2\\ 1.1 \\ 0 \\ 0 \\ \vdots \\ 0.1 \\ 0.1 \\ 0 \\ 1 \end{bmatrix}

Replace a single by a layer of neurons!

The weights are shared !

source: Wiki

Text is an ordered sequence

We want the model to capture the dependency!

Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/