Deep Learning Practice - NLP

Recap of NN models, Roadmap,  Datasets

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Module 1 :  Natural Language Processing

AI4Bharat, Department of Data Science and Artificial Intelligence, IIT Madras

Mitesh M. Khapra

What is it?

enabling computers to understand, interpret, and generate human language

I was watching a movie with my friends at INOX, Chennai. The movie was stunning!

So that, given an input text, we can ask the computer to

I  was watching a movie with my friends at INOX[ORG], Chennai[LOC]. The movie was stunning!

சென்னை INOXல் நண்பர்களுடன் படம் பார்த்துக்கொண்டிருந்தேன். படம் பிரமிக்க வைத்தது!

Translate

Sentiment

NER

What are the methods?

1990

2013

  Transformers

2017

Statistical LM's

Specific Task Helper

n-gram models

Task-agnostic Feature Learners

word2vec, context modelling

Neural LMs

Transfer Learning for NLP

ELMO, GPT, BERT

(pre-train,fine-tune)

 LLMs

2020

General Language Models

 GPT-4, Llama

(emerging abilities)

Task solving Capacity

 LLMs

2024

Module 2 : Neural Network Models

AI4Bharat, Department of Data Science and Artificial Intelligence, IIT Madras

Mitesh M. Khapra

MLP

\(x_1\)

\(x_2\)

\(x_n\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

Let's quickly recap the four types of architectures that we learned in the deep learning theory course

We started with a simple Multi-Layer Perceptron (MLP) and the backpropagation algorithm for learning the parameters

It is a feedforward fully connected neural network

Neural Network Models

CNN

It is also a feedforward neural network commonly used in the domain of computer vision

We introduced the weight sharing concept where the weights in the kernels are shared across the input.

This idea of weight sharing is prevalent in almost all modern neural networks

Recurrent Networks such as RNN, LSTM and transformers  use the same weights for all elements/timesteps in the input sequence

RNN

\(y_i\)

\(s_i\)

\(x_i\)

\(V\)

\(W\)

\(U\)

Transformer

However, training of transformers can be parallelized as opposed to RNNs and LSTMs

Therefore, they are better suited for Natural Language Processing tasks, as weight sharing enables the models to generalize to sequences of any length

This gives the transformer an edge over recurrent neural networks

Today, we can quickly build any of these architectures using modern deep learning frameworks such as Pytorch and TensorFlow

In this course, we will use Pytorch or any framework built on top of that

\(W\)

Every learnable layer can be derived from the `torch.nn.Module` in PyTorch

\(y_i\)

\(s_i\)

\(x_i\)

\(V\)

\(U\)

All architectures contain a sequence of learnable layers (linear, convolution, rnn, embedding ...)

Therefore, any architecture can be composed with the `torch.nn.Module` 

let's build all four architectures with a few lines of code

MLP

class MLP(nn.Module):
  
  def __init__(self,x_dim,h1_dim,h2_dim,out_dim):
    super().__init__()
    self.w1 = nn.Linear(x_dim,h1_dim,bias=True)
    self.w2 = nn.Linear(h1_dim,h2_dim,bias=True)
    self.w3 = nn.Linear(h2_dim,out_dim,bias=True)
    
  def forward(self,x):
    out = torch.relu(self.w1(x))
    out = torch.relu(self.w2(out))
    out = torch.relu(self.w3(out))    
    return out

Implementation in Pytorch

\(x_1\)

\(x_2\)

\(x_n\)

\(W_1\)

\(W_1\)

\(b_1\)

\(W_2\)

\(b_2\)

\(W_3\)

\(b_3\)

The model is composed of linear layers and non-linear activations (assuming Relu). Therefore, the code contains only nn.linear and torch.relu

CNN

class CNN(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 10)
        self.fc2 = nn.Linear(10, 4)
        self.fc3 = nn.Linear(4, 2)

    def forward(self, x):

        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Implementation in Pytorch

The model is composed of nn.linear,nn.Conv2d, MaxPool2D and torch.relu

RNN

class RNN(torch.nn.Module):
    def __init__(self,vocab_size,embed_dim,hidden_dim,num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,
                                   embed_dim,
                                   padding_idx=0)
        self.rnn = nn.RNN(embed_dim,
                       hidden_dim,
                       batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_class)

    def forward(self, x,length)        
        x = self.embedding(x)
        x = pack_padded_sequence(x,
                                 lengths=length,
                                 enforce_sorted=False,
                                 batch_first=True)       
        x = self.rnn(x)        
        x = self.fc(x[-1]) 
        return x

\(y_i\)

\(s_i\)

\(x_i\)

\(V\)

\(W\)

\(U\)

Implementation in Pytorch

The model is composed of  nn.linear,nn.Embedding, nn.RNN 

nn.RNN​  is composed of nn.linear, torch.relu and other required modules

Transformer

class TRANSFORMER(torch.nn.Module):
    def __init__(self,vocab_size,embed_dim,hidden_dim,num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,
                                   embed_dim,
                                   padding_idx=0)
        self.transformer = nn.Transformer(dmodel,
                       nhead,
                       num_encoder_layers,
                       num_decoder_layers,
                       dim_feedforward)
        self.fc = Linear(dmodel, vocab_size)

    def forward(self, x,length)        
        x = self.embedding(x)
        x = self.transformer(x)        
        x = self.fc(x[-1]) 
        return x

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Masked Attention

Add&Norm

Implementation in Pytorch

The model is composed of nn.Embedding, nn.Transformer, nn.Linear

nn.Transformer is composed of  nn.MultiHeadAttention and other required modules

We can create any architecture with a few lines of code. However, this is just a part of the entire training setup.

class TRANSFORMER(torch.nn.Module):
    def __init__(self,vocab_size,embed_dim,hidden_dim,num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,
                                   embed_dim,
                                   padding_idx=0)
        self.transformer = nn.Transformer(dmodel,
                       nhead,
                       num_encoder_layers,
                       num_decoder_layers,
                       dim_feedforward)
        self.fc = Linear(dmodel, vocab_size)

    def forward(self, x,length)        
        x = self.embedding(x)
        x = self.transformer(x)        
        x = self.fc(x[1]) 
        return x
class RNN(torch.nn.Module):
    def __init__(self,vocab_size,embed_dim,hidden_dim,num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,
                                   embed_dim,
                                   padding_idx=0)
        self.rnn = nn.RNN(embed_dim,
                       hidden_dim,
                       batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_class)

    def forward(self, x,length)        
        x = self.embedding(x)
        x = pack_padded_sequence(x,
                                 lengths=length,
                                 enforce_sorted=False,
                                 batch_first=True)       
        x = self.rnn(x)        
        x = self.fc(x[1]) 
        return x
class CNN(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 10)
        self.fc2 = nn.Linear(10, 4)
        self.fc3 = nn.Linear(4, 2)

    def forward(self, x):

        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

The training setup contains many other components. Let's see

Experimental Setup

 Dataset

Encode Text

Initialize  

Model

Train the model

Decode Predictions

How do we encode the input text?

Experimental Setup

 Dataset

Encode Text

Initialize  

Model

Train the model

Decode Predictions

Learn the vocabulary

Tokenizer

Experimental Setup

 Dataset

Encode Text

Initialize  

Model

Train the model

Decode Predictions

Learn the vocabulary

Tokenizer

Evalaute the model

It requires a significant amount of time and effort to set this up properly

 Dataset

Encode Text

Initialize  

Model

Train the model

Decode Predictions

Tokenizer

Evalaute the model

 Hugging Face Does the Heavy Lifting

import datasets
import tokenizers
tokenizer.encode()
tokenizer.decode()
import transformers
import evalaute

 Hugging Face Hub

import datasets
import tokenizers
import transformers
import evalaute

We have access to 150k datasets and 350k models via hugging face hub

Datasets

Models

Spaces

150k

350k

170k

Rich Ecosystem

import datasets
import tokenizers
import transformers
import evalaute
from  accelarate ..
from  optimum ..
from  peft ..
import bitsandbytes 
\vdots

Module 2  : Roadmap

AI4Bharat, Department of Data Science and Artificial Intelligence, IIT Madras

Mitesh M. Khapra

RNN based models for NLP

\(y_i\)

\(s_i\)

\(x_i\)

\(V\)

\(W\)

\(U\)

Transformer based models  for NLP

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Masked Attention

Add&Norm

Paradigm shift

Transformer

Transformer

Transformer

Input text

Predict the class/sentiment

Input text

Summarize

Answer

Question

Input text

Traditional NLP

Task-specific Supervised Training

Language Modelling

Prompt: Input text

Output response conditioned on prompt

Prompt: Predict sentiment, summarize, fill in the blank, generate story

Raw text data

(cleaned)

Modern NLP

Pre-training, Fine Tuning, Instruction tuning,

 Dataset

Encode Text

Initialize  

Model

Train the model

Decode Predictions

Tokenizer

Evalaute the model

Syllabus Outline

import datasets
import tokenizers
tokenizer.encode()
tokenizer.decode()
import transformers
import evalaute

We will start with the datasets module and see how HF simplifies loading different datasets easily. We will also  learn to use some commonly  used functions from this module

The hf hub contains 150K+ datasets

 Dataset

Encode Text

Initialize  

Model

Train the model

Decode Predictions

Tokenizer

Evalaute the model

Syllabus Outline

import datasets
import tokenizers
tokenizer.encode()
tokenizer.decode()
import transformers
import evalaute

For a given dataset, we will then use the tokenizers module to learn a vocabulary using different tokenization algorithms such as BPE, WordPiece, etc.

 Dataset

Encode Text

Initialize  

Model

Train the model

Decode Predictions

Tokenizer

Evalaute the model

Syllabus Outline

import datasets
import tokenizers
tokenizer.encode()
tokenizer.decode()
import transformers
import evalaute

Finally, we will learn various strategies to train/fine-tune models using the transformers module 

The hub contains 350k models

In each week, we cover the required concepts to understand various terminolgies that are helpful while developing a solution for NLP problems

We expect you to revisit the transformer architecture in detail as covered in the deep learning theory course

This week, we will delve deeper into the dataset part of the development cycle.

Module 3 : Datasets

AI4Bharat, Department of Data Science and Artificial Intelligence, IIT Madras

Mitesh M. Khapra

Learning starts with a dataset

We have  a wide range of NLP tasks  such as 

  • Sentiment classification

  • Machine Translation

  • Named Entity Recognition

  • Question Answering

  • Textual entailment

  • Summarization

  • Generation

For each of these tasks, we may have hundreds of datasets with thousands of samples

For example, for sentiment classification we have

  • Amazon review

  • IMDB Moview review

  • Twitter financial news sentiment

  • poem sentiment

All these datasets typically have thousands to millions (if not billions) of words

Similarly, one can find numerous datasets for other tasks

In modern NLP, the pretraining datasets have billions and trillions of tokens

Task Specific Datasets

 Pre-training Datasets 

Trillions of 

Tokens

10 GB
5GB
21GB
40GB
1 GB
100 GB
1 TB

BookCorpus

Wikipedia

WebText(closed)

110GB

RealNews

800GB

 The Pile

1.6 TB

ROOTS

2015
2019
2023
2020

Falcon

2.6TB
10 TB
5TB

RedPajama

11TB

DOLMA

2024

C4

The datasets come with different formats and different sizes

However, it would be convenient if we have a single place to load any of these datasets with a consistent call signature

We need to understand the format and may need to write a script to parse and load the text data

That's where Hugging Face's datasets module helps us

If the memory is limited, we need to implement mechanisms to stream the samples in the dataset . 

What is pre-training?

What are tokens?

Are tokens and the words in a sentence one and the same?

Well, we will discuss all these in subsequent lectures. 

For now, let us just look at the datasets module from HF