Workshop on Transformer based NLP using Hugging Face

Arun Prakash A

Conceptual Guide to Transformer Models
- Transformer architecture
- Processing Input and output
- Pre-training objectives for BERT,GPT,BART
- Decoding strategies for GPT models
- Evolution tree, and What is the best setting?

Outline

Why hugging face?
- What is the motivation?
- Various Modules
- HF Hub

Note, in this workshop we could briefly cover the core concepts so that one could sail through the documentation of hugging face.

For more details, you may read the lectures by Prof.Mitesh Khapra here ( all lecture recordings will be made available on youtube) or watch lectures by Andrej Karpathy

I assume that

You have completed the Deep learning course

Have a minimal experience in Pytorch or Tensorflow (so that you know the typical training cycle)

This talk covers the core concepts that one should know to use/fine-tune/pre-train language models

Let's get started

Generations of Language Modelling

1990

2013

Transformers

2017

Statistical LM's

Specific Task Helper

n-gram models

Task-agnostic Feature Learners

word2vec, context modelling

Neural LMs

Transfer Learning for NLP

ELMO, GPT, BERT

(pre-train,fine-tune)

LLMs

2020

General Language Models

GPT-3, InstructGPT

(emerging abilities)

Task solving Capacity

Reference: LLM Survey

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6

10^9

10^{12}

10^{13}

10^{15}

# Synapses

400M

Transformer

1.5B

GPT-2

10B

Megatron LM

175B

GPT-3

GShard

1.1 T

1.6 T

Switch

30 T

?

Scaling Parameters

10 GB

5GB

21GB

40GB

1 GB

100 GB

1000 GB

BookCorpus

Wikipedia

OpenWebText

110GB

RealNews

800GB

C4, The Pile

1600GB

ROOTS

Scaling Dataset

Model

Dataset

What is the scaling ratio?

Model

Dataset

Kaplan Scaling law, Chinchilla Scaling law

How best can we make use of available compute resource ?

Train larger model for small time steps
Train smaller models for larger time steps

Image source: Mooler

Scaling laws

Image source: Mooler

Scaling laws

Image source: Mooler

Scaling laws

GPT-3 with 175 Billion parameters (proprietary)

GPT-J with 6 Billion parameters (open sourced)

Jurassic-1 with 178 Billion parameters (proprietary)

Gopher with 280 Billion parameters (proprietary)

GLaM with 1200 Billion parameters (proprietary)

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

The Transformer

N \times

A simple encoder-decoder model with attention mechanism at its core

Takes in a sequence [of words:Tokens:Embeddings]

Outputs a sequence [conditional probabilities: predicted tokens]

Let's first understand the input block

Source Input Block

Target Input Block

Output Block (tied)

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Masked Attention

Add&Norm

The Transformer

A simple encoder-decoder model with attention mechanism at its core

Encoder

Decoder

I am reading a book

Naan oru puthagathai padiththu kondirukiren

Naan oru puthagathai padiththu kondirukiren

Source Input Block

Target Input Block

Input Block

Tokenizer

Token Ids

embeddings

I am reading a book

Input Block

Tokenizer

I am reading a book

["i, am, reading, a, book]

Contains:

Normalizer
Pre-tokenizer
tokenization algorithm

Normalizer: Lowercase, (I ->i)

Pre-tokenizer: Whitespace

Tok-algorithm: BPE

We have to train the tokenization algorithm using all the samples from a dataset

It constructs a vocabulary of size \(|V|\) (Typically, 30000+, 50000+ )

Then the tokenizer splits the input sequence into tokens (token could be a whole word or a sub-word)

Input Block

Tokenizer

I am reading a book

["i, am, reading, a, book]

Each token is assigned with a unique integer (Id)

These IDs are unique to tokenizers used trained on a particular dataset

Therefore, we have to use the same tokenizer (used to train a model) for all downstream tasks

Token Ids

["i:2, am:8, reading:75, a:4, book:100]

Model-specific special tokens are inserted into the existing list of token_ids, for example

["[BOS]:1, i:2, am:8, reading:75, a:4, book:100,[EOS]:3]

Input Block

I am reading a book

["

[BOS]:1 i:2 
am:8 reading:75a:4 book:100
[EOS]:0

]

Tokenizer

Token Ids

embeddings

Embedding is a look-up table that returns a vector of size, say: 512,1024,2048..

\vdots

|V|^+

\vdots

The token "i" is assigned a vector at index 2 in the embedding table

The token "a" is assigned a vector at index 4 in the embedding table

This mapping goes on for all the tokens in an input sequence

All the embedding vectors are LEARNABLE

Input Block

I am reading a book

["

[BOS]:1 i:2 
am:8 reading:75a:4 book:100
[EOS]:0

]

Tokenizer

Token Ids

embeddings

We have another embedding table to encode position of tokens [learnable or fixed]

\vdots

|V|^+

\vdots

position embeddings

We add these position embeddings to the corresponding token embeddings

\vdots

The number of position embeddings depends on the context (window) length of model

THE ENTIRE PROCESS IS REPEATED FOR TARGET INPUT BLOCK

I 
am
reading 
a 
book

Input Block

Source Input Block

Embedding for each input token

The embedding vectors are randomly initialized

Along with these we also pass in attention mask, padding mask for batch processing

0.1,-0.01,0.8

0.05,-0,0.12

-0.5,0.3,0.21

0.9,0.1,0.25

0.11,-0.1,0.06

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Masked Attention

Add&Norm

N \times

The encoder and decoder of the transformer blocks takes in embedding vectors as input and produces output probability over tokens in the vocabulary

reading

book

Naan

puthakathai

padithtu

kondirukiren

\alpha_{11}

\alpha_{25}

\alpha_{34}

\alpha_{42}

0.1,-0.01,0.8

0.05,-0,0.12

-0.5,0.3,0.21

0.9,0.1,0.25

0.11,-0.1,0.06

0.2,0,0.008

0.09,0.34,0.12

0,0.003,0.11

0.19,0.12,0.256

0.123,-0.1,0.16

Configuration

One can construct a transformer architecture given the configuration file with the following fields

\(d_{model}\): model dimension(=embedding dimension)
\(n_{heads}\):Number of heads
\(dff\): Hidden dimension (often, \(d_{ff}=4d_{model}\))
\(n_{layers}\): Number of layers
\(droput_{prob}\): for Feed Forward, Attention, embedding
activation function
tie weights?

The Transformer

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Masked Attention

Add&Norm

N \times

Originally Proposed for Machine Translation task

The field has evolved rapidly since then in multiple directions

Improvements in attention mechanisms, positional encoding techniques and so on
Scaling the size of the model (parameters, datasets, training steps)
Extensive studies on the choices of hyperparameters

Dataset

Source Input Block

Target Input Block

Architectural Improvements

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Masked Attention

Add&Norm

Position Encoding

Absolute
Relative
RoPE,NoPE
AliBi

Attention

Full/sparse
Multi/grouped query
Flashed
Paged

Normalization

Layer norm
RMS norm
DeepNorm

Activation

ReLU
GeLU
Swish
SwiGLU
GeGLU

FFN
MoE

PostNorm
Pre-Norm
Sandwich

Pre-training and Fine-Tuning

Preparing labelled data for each task is laborious and costly

On the other hand,

We have a large amount of unlabelled text easily available on the internet

Transformer

Input text

Predict the class/sentiment

Input text

Summarize

Question

Answer

Input text

Can we make use of such unlabelled data to train a model?

That is called pre-training.

However, what should be the training objective?

Pre-training Objectives

Encoder

Decoder

Multi-Head Attention

Feed forward NN

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(P(x_2=?)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(.,P(x_2|x_1,).,\)

Encoder-Decoder

Objective: MLM

Objective: CLM

Objective: PLM,Denoising

Example: BERT

Example: GPT

Example: BART,T5

Generative Pre-trained model (GPT)

Pre-training Objective: CLM

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 4

Transformer Block 5

p(x_1)

p(x_4|x_3,x_2,x_1)

\cdots

h_1

h_2

h_3

h_4

h_5

\langle \ go \rangle

x_1

x_2

x_3

W_v

h_{11}

h_{12}

h_{13}

h_{14}

h_{21}

h_{22}

h_{23}

h_{24}

h_{31}

h_{32}

h_{33}

h_{34}

h_{41}

h_{42}

h_{43}

h_{44}

h_{51}

h_{52}

h_{53}

h_{54}

\mathscr{L}=\sum \limits_{i=1}^T \log (P(x_i|x_1,\cdots,x_{i-1}))

Assume that we have now (pre)trained a model.

Now, we can fine-tune it on different tasks (with slight modifications in the output head) and put it for inference

Causal mask allows us to compute all

these probabilities in a single go

Decoding Strategies

Beam Search

Top-P (Nucleus Sampling )

Greedy Search

Top-K sampling

Deterministic

Stochastic

Suitable for translation

Suitable for tasks like Text generation, summarization

BERT: MLM

[mask]

enjoyed

the

[mask]

transformers

Encoder Layer (attention,FFN,Normalization,Residual connection)

\mathscr{L_1}=-log(\hat{y_1})

\mathscr{L_2}=-log(\hat{y_4})

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})

W_v

A multi-layer bidirectional transformer encoder architecture.

BERT Base model contains 12 layers with 12 Attention heads per layer

The masked words (15%) in an input sequence are sampled uniformly

Of these, 80% are replaced with [mask] token and 10% are replaced with random words and remaining 10% are retained as it is. (Why?)

Is pre-training objective of MLM sufficient for downstream tasks like Question-Answering where interaction between sentences is important?

the special mask token won't be a part of the dataset while adapting for downstream tasks

Now, let's extend the input with a pair of sentences (A,B) and the label that indicates whether the sentence B naturally follows sentence A.

[CLS]

enjoyed

the

movie

Feed Forward Network

Self-Attention

transformers

[SEP]

The

visuals

were

amazing

Sent: A

Sent: B

input: Sentence: A

\mathscr{L}=-log(\hat{y})

W_v

input: Sentence: B

Label: IsNext

Next Sentence Prediction(NSP)

Special tokens: [CLS],[SEP]

[CLS]

enjoyed

the

movie

transformers

[SEP]

The

visuals

were

amazing

Position Embeddings

Segment Embeddings

Token Embeddings

Feed Forward Network

Self-Attention

\mathscr{L}=-log(\hat{y})

E_T

E_A

E_B

E_A

E_0

E_1

E_2

E_3

E_4

E_5

E_6

E_7

E_8

E_9

E_{10}

Encoder Layer

\mathscr{L_{cls}}=-log(\hat{y})

Encoder Layer

\mathscr{L_1}=-log(\hat{y_1})

\mathscr{L_2}=-log(\hat{y_4})

\mathscr{L_3}=-log(\hat{y_8})

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

[CLS]

[mask]

enjoyed

the

[mask]

transformers

[SEP]

The

[mask]

were

amazing

W_v

Minimize the objective:

What is the best setting?

encoder

decoder

Scale:

small
medium
large

Objective:MLM

corruption rate
token deletion
span mask

Scale:

small
medium
large

Scale:

small
medium
large

pre-Training

Pretraining:

Wiki
Books
Web Crawl

Fine-Tuning

FineTuning:

GLUE
SQUAD
SuperGLUE
WMT-14
WMT-15
WMT-16
CNN/DM

hyp-params:

num.of train steps
learning rate scheme
optimizer

hyp-params:

num.of train steps
learning rate scheme
optimizer

hyp-params:

num.of train steps
learning rate scheme
optimizer

Objective:

de-noising
corrpution rate
continous masking

FineTuning:

GLUE
SQUAD
SuperGLUE
WMT-14
WMT-15
WMT-16
CNN/DM

Objective:

CLM
prefix-LM
conditional

FineTuning:

GLUE
SQUAD
SuperGLUE
WMT-14
WMT-15
WMT-16
CNN/DM

Language Models are Few Shot Learners, conditional learners

Text to text framework

Instruction fine-tuning

Aligning with user intent

Retrieval Augmented Generation

Agents

Hugging Face Is Here to Help us

Applying a novel machine learning architecture to a new task can be a complex undertaking, and usually involves the following steps:*

Implement the model architecture in code, typically based on PyTorch or TensorFlow.

Load the pretrained weights (if available) from a server

Preprocess the inputs, pass them through the model, and apply some task specific post-processing

Implement data loaders and define loss functions and optimizers to train the model.

Each of these steps requires custom logic for each model and task.

Why Hugging face?

All points are taken from the book NLP with Transformers by Lewis Tunstall

Each of these steps requires custom logic for each model and task.

Why Hugging face?

when research groups publish a new article, they will also release the code along with the model weights. However, this code is rarely standardized and often requires days of engineering to adapt to new use cases.

That's where HF transformers come to rescue

All points are taken from the book NLP with Transformers by Lewis Tunstall

Core: tensors

JIT

Optim

multiprocessing

quantization

sparse

ONNX

Distributed

fast.ai

Detectron 2

Horovod

Flair

AllenNLP

torch.vision

BoTorch

GloW

Lightening

Skorch

Transformers

Datasets

evaluate

trainer

accelerate

GradIO

Inference EndPoints

PEFT

Workshop on Transformer based NLP using Hugging Face

Arun Prakash A

Conceptual Guide to Transformer Models

Outline

Why hugging face?

I assume that

You have completed the Deep learning course

Have a minimal experience in Pytorch or Tensorflow (so that you know the typical training cycle)

This talk covers the core concepts that one should know to use/fine-tune/pre-train language models

Let's get started

Generations of Language Modelling

1990

2013

Transformers

2017

Statistical LM's

Neural LMs

LLMs

2020

Task solving Capacity

Fruit Fly

Honey Bee

Mouse

Cat

Brain

# Synapses

Transformer

GPT-2

Megatron LM

GPT-3

GShard

Switch

?

Scaling Parameters

BookCorpus

Wikipedia

OpenWebText

RealNews

C4, The Pile

ROOTS

Scaling Dataset

Model

Dataset

What is the scaling ratio?

Model

Dataset

Kaplan Scaling law, Chinchilla Scaling law

How best can we make use of available compute resource ?

Image source: Mooler

Image source: Mooler

Image source: Mooler

Image source: Mooler

Scaling laws

Image source: Mooler

Scaling laws

Image source: Mooler

Scaling laws

GPT-3 with 175 Billion parameters (proprietary)

GPT-J with 6 Billion parameters (open sourced)

Jurassic-1 with 178 Billion parameters (proprietary)

Gopher with 280 Billion parameters (proprietary)

GLaM with 1200 Billion parameters (proprietary)

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

The Transformer

A simple encoder-decoder model with attention mechanism at its core

Takes in a sequence [of words:Tokens:Embeddings]

Outputs a sequence [conditional probabilities: predicted tokens]

Let's first understand the input block

The Transformer

A simple encoder-decoder model with attention mechanism at its core

Input Block

Input Block

We have to train the tokenization algorithm using all the samples from a dataset