Workshop on Transformer based NLP using Hugging Face 

Arun Prakash A

  • Conceptual Guide to Transformer Models 

    • Transformer architecture
    • Processing Input and output
    • Pre-training objectives for BERT,GPT,BART
    • Decoding strategies for GPT models
    • Evolution tree, and What is the best setting?

Outline

  • Why hugging face?

    • What is the motivation?
    • Various Modules
    • HF Hub

Note, in this workshop we could briefly cover the core concepts  so that one could sail through the documentation of hugging face.

For more details, you may read the lectures by Prof.Mitesh Khapra here ( all lecture recordings will be made available on youtube) or watch lectures by Andrej Karpathy

I assume that

You have completed the Deep learning course 

Have a minimal experience in Pytorch or Tensorflow (so that you know the typical training cycle)

This talk covers the core concepts that one should know to use/fine-tune/pre-train language models 

Let's get started

Generations of Language Modelling

1990

2013

  Transformers

2017

Statistical LM's

Specific Task Helper

n-gram models

Task-agnostic Feature Learners

word2vec, context modelling

Neural LMs

Transfer Learning for NLP

ELMO, GPT, BERT

(pre-train,fine-tune)

 LLMs

2020

General Language Models

 GPT-3, InstructGPT

(emerging abilities)

Task solving Capacity

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6
10^9
10^{12}
10^{13}
10^{15}

#  Synapses

400M

Transformer

1.5B

GPT-2

10B

Megatron LM

175B

GPT-3

GShard

1.1 T
1.6 T
30 T

?

Scaling Parameters

10 GB
5GB
21GB
40GB
1 GB
100 GB
1000 GB

BookCorpus

Wikipedia

OpenWebText

110GB

RealNews

800GB

C4, The Pile

1600GB

ROOTS

Scaling Dataset

Model

 Dataset

What is the scaling ratio?

Model

 Dataset

Kaplan Scaling law, Chinchilla Scaling law

How best can we make use of available compute resource ?

  • Train larger model for small time steps
  • Train smaller models for larger time steps

 Image source: Mooler

 Image source: Mooler

 Image source: Mooler

 Image source: Mooler

Scaling laws

 Image source: Mooler

Scaling laws

 Image source: Mooler

Scaling laws

GPT-3 with 175 Billion parameters (proprietary) 

GPT-J with 6 Billion parameters (open sourced)

Jurassic-1 with 178 Billion parameters (proprietary) 

Gopher with 280 Billion parameters (proprietary) 

GLaM with 1200 Billion parameters (proprietary) 

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

The Transformer

N \times

A simple encoder-decoder model with attention mechanism at its core

Takes in a sequence [of words:Tokens:Embeddings]

Outputs a sequence [conditional probabilities: predicted tokens]

Let's first understand the input block

Source Input Block

Target Input Block

Output Block (tied)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Masked Attention

Add&Norm

The Transformer

A simple encoder-decoder model with attention mechanism at its core

Encoder

Decoder

I am reading a book
Naan oru puthagathai padiththu kondirukiren
Naan oru puthagathai padiththu kondirukiren

Source Input Block

Target Input Block

Input Block

Tokenizer

Token Ids

embeddings

I am reading a book

Input Block

Tokenizer

I am reading a book
["i, am, reading, a, book]

Contains:

  1. Normalizer
  2. Pre-tokenizer
  3. tokenization algorithm
Normalizer: Lowercase, (I ->i)
Pre-tokenizer: Whitespace
Tok-algorithm: BPE
  • We have to train the tokenization algorithm  using all the samples from a dataset 

  • It constructs a vocabulary of size \(|V|\) (Typically, 30000+, 50000+ ) 

  • Then the tokenizer splits the input sequence into tokens (token could be a whole word or a sub-word)

Input Block

Tokenizer

I am reading a book
["i, am, reading, a, book]
  • Each token is assigned with a unique integer (Id)

  • These IDs are unique to tokenizers used trained on a particular dataset 

  • Therefore, we have to use the same tokenizer (used to train a model) for all downstream tasks

Token Ids

["i:2, am:8, reading:75, a:4, book:100]
  • Model-specific special tokens are inserted into the existing list of token_ids, for example

["[BOS]:1, i:2, am:8, reading:75, a:4, book:100,[EOS]:3]

Input Block

I am reading a book
["

[BOS]:1 i:2 
am:8 reading:75a:4 book:100
[EOS]:0

]

Tokenizer

Token Ids

embeddings

  • Embedding is a look-up table that returns a vector  of size, say: 512,1024,2048..

0
1
2
3
4
5
6
7
\vdots
|V|^+
\vdots
  • The token "i" is assigned a vector at index 2 in the embedding table

  • The token "a" is assigned a vector at index 4 in the embedding table

  • This mapping goes on for all the tokens in an input sequence

  • All the embedding vectors are LEARNABLE

Input Block

I am reading a book
["

[BOS]:1 i:2 
am:8 reading:75a:4 book:100
[EOS]:0

]

Tokenizer

Token Ids

embeddings

  • We have another embedding table to encode position of tokens [learnable or fixed]

0
1
2
3
4
5
6
7
\vdots
|V|^+
\vdots

position embeddings

+
  • We add these position embeddings to the corresponding token embeddings

T
0
1
2
3
4
\vdots
\vdots
  • The number of position embeddings depends on the context (window) length of model

  • THE ENTIRE PROCESS IS REPEATED FOR TARGET INPUT BLOCK

I 
am
reading 
a 
book

Input Block

Source Input Block

Embedding for each input token 

  • The embedding vectors are randomly initialized

  • Along with these we also pass in attention mask, padding mask for batch processing

0.1,-0.01,0.8
0.05,-0,0.12
-0.5,0.3,0.21
0.9,0.1,0.25
0.11,-0.1,0.06

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Masked Attention

Add&Norm

N \times
  • The encoder and decoder of the transformer blocks takes in embedding vectors as input and produces output probability over tokens in the vocabulary 

reading

I

am

a

book

Naan

puthakathai

padithtu

kondirukiren

\alpha_{11}
\alpha_{25}
\alpha_{34}
\alpha_{42}
0.1,-0.01,0.8
0.05,-0,0.12
-0.5,0.3,0.21
0.9,0.1,0.25
0.11,-0.1,0.06
0.2,0,0.008
0.09,0.34,0.12
0,0.003,0.11
0.19,0.12,0.256
0.123,-0.1,0.16

Configuration

One can construct a transformer architecture given the configuration file with the following fields

  • \(d_{model}\): model dimension(=embedding dimension)
  • \(n_{heads}\):Number of heads
  • \(dff\): Hidden dimension (often, \(d_{ff}=4d_{model}\))
  • \(n_{layers}\): Number of layers
  • \(droput_{prob}\): for Feed Forward, Attention, embedding
  • activation function
  • tie weights?

The Transformer 

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Masked Attention

Add&Norm

N \times

Originally Proposed for Machine Translation task

The field has evolved rapidly  since then in multiple directions 

  • Improvements in attention mechanisms, positional encoding techniques and so on

  • Scaling the size of the model (parameters, datasets, training steps)

  • Extensive studies on the choices of hyperparameters

Dataset

Source Input Block

Target Input Block

Architectural Improvements

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Masked Attention

Add&Norm

Position Encoding

  • Absolute
  • Relative
  • RoPE,NoPE
  • AliBi

Attention

  • Full/sparse
  • Multi/grouped query
  • Flashed
  • Paged

Normalization

  • Layer norm
  • RMS norm
  • DeepNorm

Activation

  • ReLU
  • GeLU
  • Swish
  • SwiGLU
  • GeGLU
  • FFN
  • MoE
  • PostNorm
  • Pre-Norm
  • Sandwich

Pre-training and Fine-Tuning

Preparing labelled data for each task is laborious and costly

On the other hand,

We have a large amount of unlabelled text easily available on the internet 

Transformer

Transformer

Transformer

Input text

Predict the class/sentiment

Input text

Summarize

Question

Answer

Input text

Can we make use of such unlabelled data to train a model?

That is called pre-training.

However, what should be the training objective?

Pre-training  Objectives

Encoder 

Decoder

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(P(x_2=?)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(.,P(x_2|x_1,).,\)

Encoder-Decoder

Objective: MLM

Objective: CLM

Objective: PLM,Denoising

Example: BERT

Example: GPT

Example: BART,T5

 Generative Pre-trained model (GPT)

Pre-training Objective: CLM

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 4

Transformer Block 5

p(x_1)
p(x_4|x_3,x_2,x_1)
\cdots
h_1
h_2
h_3
h_4
h_5
\langle \ go \rangle
x_1
x_2
x_3
W_v
W_v
h_{11}
h_{12}
h_{13}
h_{14}
h_{21}
h_{22}
h_{23}
h_{24}
h_{31}
h_{32}
h_{33}
h_{34}
h_{41}
h_{42}
h_{43}
h_{44}
h_{51}
h_{52}
h_{53}
h_{54}
\mathscr{L}=\sum \limits_{i=1}^T \log (P(x_i|x_1,\cdots,x_{i-1}))

Assume that we have now (pre)trained a model.

Now, we can fine-tune it on different tasks (with slight modifications in the output head) and put it for inference

Causal mask allows us to  compute all

these probabilities in a single go  

Decoding Strategies

Beam Search 

Top-P (Nucleus Sampling )

Greedy Search

Top-K sampling

Deterministic

Stochastic

Suitable for translation

Suitable for tasks like Text generation, summarization

BERT: MLM

[mask]

enjoyed

the

[mask]

transformers

Encoder Layer (attention,FFN,Normalization,Residual connection)

Encoder Layer (attention,FFN,Normalization,Residual connection)

Encoder Layer (attention,FFN,Normalization,Residual connection)

\mathscr{L_1}=-log(\hat{y_1})
\mathscr{L_2}=-log(\hat{y_4})
\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})
W_v
W_v

A multi-layer bidirectional transformer encoder architecture.

BERT Base model contains 12 layers with 12 Attention heads per layer

The masked words (15%) in an input sequence are sampled uniformly 

Of these, 80% are replaced with [mask] token and 10% are replaced with random words and remaining 10% are retained as it is. (Why?)

Is pre-training objective of MLM sufficient for downstream tasks like Question-Answering  where interaction between sentences is important?

the special mask token won't be a part of the dataset while adapting for downstream tasks 

Now, let's extend the input with a pair of sentences (A,B) and the label that indicates whether the sentence B naturally follows sentence A.   

[CLS]

I

enjoyed

the

movie

Feed Forward Network

Self-Attention

transformers

[SEP]

The 

visuals

were

amazing

Sent: A
Sent: B
input: Sentence: A
\mathscr{L}=-log(\hat{y})
W_v
input: Sentence: B
Label: IsNext

Next Sentence Prediction(NSP)

Special tokens: [CLS],[SEP]

[CLS]

I

enjoyed

the

movie

transformers

[SEP]

The 

visuals

were

amazing

Position Embeddings

Segment Embeddings

Token Embeddings

Feed Forward Network

Self-Attention

\mathscr{L}=-log(\hat{y})
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_A
E_A
E_A
E_A
E_A
E_A
E_B
E_B
E_B
E_B
E_A
E_0
E_1
E_2
E_3
E_4
E_5
E_6
E_7
E_8
E_9
E_{10}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

Encoder Layer 

\mathscr{L_{cls}}=-log(\hat{y})

Encoder Layer 

Encoder Layer 

Encoder Layer 

\mathscr{L_1}=-log(\hat{y_1})
\mathscr{L_2}=-log(\hat{y_4})
\mathscr{L_3}=-log(\hat{y_8})
\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

[CLS]

[mask]

enjoyed

the

[mask]

transformers

[SEP]

The 

[mask]

were

amazing

W_v
W_v
W_v
W_v

Minimize the objective:

What is the best setting?

encoder

encoder

decoder

decoder

Scale:

  • small
  • medium
  • large

Objective:MLM

  • corruption rate
  • token deletion
  • span mask

Scale:

  • small
  • medium
  • large

Scale:

  • small
  • medium
  • large

pre-Training

pre-Training

pre-Training

Pretraining:

  • Wiki
  • Books
  • Web Crawl

Fine-Tuning

Fine-Tuning

Fine-Tuning

FineTuning:

  • GLUE
  • SQUAD
  • SuperGLUE
  • WMT-14
  • WMT-15
  • WMT-16
  • CNN/DM

hyp-params:

  • num.of train steps
  • learning rate scheme
  • optimizer

hyp-params:

  • num.of train steps
  • learning rate scheme
  • optimizer

hyp-params:

  • num.of train steps
  • learning rate scheme
  • optimizer

Objective:

  • de-noising
  • corrpution rate
  • continous masking

FineTuning:

  • GLUE
  • SQUAD
  • SuperGLUE
  • WMT-14
  • WMT-15
  • WMT-16
  • CNN/DM

Objective:

  • CLM
  • prefix-LM
  • conditional

FineTuning:

  • GLUE
  • SQUAD
  • SuperGLUE
  • WMT-14
  • WMT-15
  • WMT-16
  • CNN/DM

Language Models are Few Shot Learners, conditional learners

Text to text framework

Instruction fine-tuning

Aligning with user intent

Retrieval Augmented Generation

Agents

Hugging Face Is Here to Help us

Applying a novel machine learning architecture to a new task can be a complex undertaking, and usually involves the following steps:*

  • Implement the model architecture in code, typically based on PyTorch or TensorFlow.

  • Load the pretrained weights (if available) from a server

  • Preprocess the inputs, pass them through the model, and apply some task specific post-processing

  • Implement data loaders and define loss functions and optimizers to train the model.

Each of these steps requires custom logic for each model and task.

Why Hugging face?

All points are taken from the book NLP with Transformers by Lewis Tunstall

Each of these steps requires custom logic for each model and task.

Why Hugging face?

when research groups publish a new article, they will also release the code along with the model weights. However, this code is rarely standardized and often requires days of engineering to adapt to new use cases.

That's where HF transformers come to rescue

All points are taken from the book NLP with Transformers by Lewis Tunstall

Core: tensors

JIT

nn

Optim

multiprocessing

quantization

sparse

ONNX

Distributed

fast.ai

Detectron 2

Horovod

Flair

AllenNLP

torch.vision

BoTorch

GloW

Lightening

Skorch

Transformers

Datasets

evaluate

trainer

accelerate

GradIO

Inference EndPoints

PEFT

Transformers has layered APIs that allows us to interact at various levels of abstractions

With the Transformers, Tokenizers, and Datasets libraries, we have everything we need to train our very own transformer models!

We have a lot more modules such as Trainers, evaluate and modules for images and audio processing..

We will get into details of the core modules in the notebooks

Rich Ecosystem

Transformers-Distilled-HF-Workshop-IITMBS

By Arun Prakash

Transformers-Distilled-HF-Workshop-IITMBS

  • 293