Lecture 3: Bidirectional Encoder Representations from Transformers

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Introduction to Large Language Models

Using only the encoder of the transformer (encoder only models)

Using only the decoder of the transformer (decoder only models)

Using both the encoder and decoder of the transformer (encoder decoder models)

Multi-Head Attention

Feed forward NN

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(P(<mask>)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(P(<mask>)\)

In the previous lecture, we used the decoder of the transformer to build a language model

Using only the encoder of the transformer (encoder only models)

Using only the decoder of the transformer (decoder only models)

Using both the encoder and decoder of the transformer (encoder decoder models)

Multi-Head Attention

Feed forward NN

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(P(<mask>)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(P(<mask>)\)

In this lecture, we will see how to use the encoder of the transformer to build a language model

In GPT, the representation for a language is learned in an unidirectional (left-to-right) way.

Multi-Head masked Attention

Feed forward NN

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

i \quad \ \ like \quad \ \ to

play
watch
go
read

This is a natural fit for tasks like text generation

How about tasks like Named Entity Recognition (NER) ..

Nothing has shipped its new OS to Nothing Phone 2

Company

i \quad \ \_\_\_ \quad \ \ to \quad read \quad a \quad \_\_\_ \quad [eos]

..and fill in the blanks

Module 3.1 : Masked Language Modelling

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

We need to look at both directions (surrounding words) to predict the masked words

Multi-Head Attention

Feed forward NN

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(P(y_2=x_2)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(P(<mask>)\)

i \quad \ \_\_\_ \quad \ \ to \quad read \quad a \quad \_\_\_ \quad [eos]

Predict the masked words using the context words in both directions (like CBOW)

This is called Masked Language Modelling (MLM)

We cannot use decoder component of the transformer as it is inherently unidirectional

We can use the encoder part of the transformer

Using only the decoder of the transformer (decoder only models)

Using both the encoder and decoder of the transformer (encoder decoder models)

Multi-Head Attention

Feed forward NN

Add&Norm

\(x_1,\cdots,[mask],\cdots,x_{T}\)

\(P(y_i=x_i)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(P(<mask>)\)

We need to find out ways to mask the words in the input text.

P(y_i|x_1,x_{i}=[mask],\cdots,x_T)=?

Now, the problem is cast as

It is assumed that the predicted tokens are independent of each other

Self-Attention

1	0	0

0	1	0

1	1	0

1	0	1

1	1	1

enjoyed

the

movie

transformers

Let's see

We know that each word attends to every other word in the sequence of words in the self attention layer

Self-Attention

1	0	0

0	1	0

1	1	0

1	0	1

1	1	1

[mask]

enjoyed

the

[mask]

transformers

Masked Language Modelling (MLM)

We know that each word attends to every other word in the sequence of words in the self attention layer

Our objective is to mask a few words randomly and predict the masked words

Can we MASK the attention weights of the words to be masked as in CLM?

\text{MatMul:} \ Q^TK

\text{Scale}:\frac{1}{\sqrt{d_k}}

transformers

Self-Attention

1	0	0

1	1	1

0.3	0.2

\cdots

0.1	0.5

-0.1	0.25

0.11	0.89

0	0.4

0.2	0.7

k_1

v_1

q_1

k_5

v_5

q_5

W_K

W_V

W_Q

W_k

W_V

W_Q

movie

Mask

\text{Softmax}

\text{MatMul}

\begin{bmatrix}-\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ \end{bmatrix}

Mask =

No, Why?

Because we want the model to learn (attend to) what the blanks are.

\text{MatMul:} \ Q^TK

\text{Scale}:\frac{1}{\sqrt{d_k}}

Mask

\text{Softmax}

\text{MatMul}

Mask =

No, Why?

Because we want the model to learn (attend to) what the blanks are.

We can think of the [mask] token as noise that corrputs the original input text

\begin{bmatrix}-\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ \end{bmatrix}

Then the model is tasked to recover the original token

This is similar to denoising objective of Auto Encoders

For this reason, MLM is also called as pre-training denoising objective

[mask]

transformers

Self-Attention

1	0	0

1	1	1

0.3	0.2

\cdots

0.1	0.5

-0.1	0.25

0.11	0.89

0	0.4

0.2	0.7

k_1

v_1

q_1

k_5

v_5

q_5

W_K

W_V

W_Q

W_k

W_V

W_Q

[mask]

Add [mask] as a special token in the vocabulary and get an embedding for it.

Using Special Tokens

A simple approach is to use [mask] token for the words to be masked.

[mask]

enjoyed

the

[mask]

transformers

Feed Forward Network

Self-Attention

\mathscr{L_1}=-log(\hat{y_1})

\mathscr{L_2}=-log(\hat{y_4})

Note carefully that only a set \(\mathcal{M}\) of masked words are predicted and hence the cross entropy loss for those predictions are computed

Typically, 15% of words in the input sequence are masked

Very high masking would result in severe loss of context information and the model will not learn good representations

On the other hand, very little masking takes a long time for convergence (because the gradient values are comparatively small) and makes training inefficient

However, tradeoff could be made based on the model size, masking scheme and optimization algorithm. Refer to Should you mask 15% in MLM? to know more

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})

W_v

BERT

[mask]

enjoyed

the

[mask]

transformers

Encoder Layer (attention,FFN,Normalization,Residual connection)

\mathscr{L_1}=-log(\hat{y_1})

\mathscr{L_2}=-log(\hat{y_4})

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})

W_v

A multi-layer bidirectional transformer encoder architecture.

BERT Base model contains 12 layers with 12 Attention heads per layer

The masked words (15%) in an input sequence are sampled uniformly

Of these, 80% are replaced with [mask] token and 10% are replaced with random words and remaining 10% are retained as it is. (Why?)

Is pre-training objective of MLM sufficient for downstream tasks like Question-Answering where interaction between sentences is important?

the special mask token won't be a part of the dataset while adapting for downstream tasks

Next Sentence Prediction(NSP)

Now, let's extend the input with a pair of sentences (A,B) and the label that indicates whether the sentence B naturally follows sentence A.

[CLS]

enjoyed

the

movie

Feed Forward Network

Self-Attention

transformers

[SEP]

The

visuals

were

amazing

Sent: A

Sent: B

input: Sentence: A

\mathscr{L}=-log(\hat{y})

W_v

input: Sentence: B

Label: IsNext

Next Sentence Prediction(NSP)

[CLS]

enjoyed

the

movie

Feed Forward Network

Self-Attention

transformers

[SEP]

The

visuals

were

amazing

Sent: A

Sent: B

Two sentences are separated with a special token [SEP]

\mathscr{L}=-log(\hat{y})

W_v

The hidden representation corresponding to the [CLS] token is used for final classification (prediction)

Next Sentence Prediction (NSP)

[CLS]

enjoyed

the

movie

Feed Forward Network

Self-Attention

transformers

[SEP]

The

visuals

were

amazing

Sent: A

Sent: B

In 50% of the instances, the sentence B is the natural next sentence that follows sentence A

\mathscr{L}=-log(\hat{y})

W_v

In 50% of the instances, the sentence B is a random sentence from the corpus labelled as NotNext

Pretraining with NSP objective improves the performance of QA and similar tasks significantly

To distinguish the belongingness of the token to sentence A or B, a separate learnable segment embedding is used in addition to token and positional embeddings

[CLS]

enjoyed

the

movie

transformers

[SEP]

The

visuals

were

amazing

Position Embeddings

Segment Embeddings

Token Embeddings

Feed Forward Network

Self-Attention

\mathscr{L}=-log(\hat{y})

E_T

E_A

E_B

E_A

E_0

E_1

E_2

E_3

E_4

E_5

E_6

E_7

E_8

E_9

E_{10}

Encoder Layer

\mathscr{L_{cls}}=-log(\hat{y})

Encoder Layer

\mathscr{L_1}=-log(\hat{y_1})

\mathscr{L_2}=-log(\hat{y_4})

\mathscr{L_3}=-log(\hat{y_8})

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

[CLS]

[mask]

enjoyed

the

[mask]

transformers

[SEP]

The

[mask]

were

amazing

W_v

Minimize the objective:

Pre-training

Dataset:

BookCorpus

800M words

WikiPedia

2500M words

[CLS]

Masked Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoder

\mathscr{L_{cls}}=-log(\hat{y})

W_v

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

Masked Sentence B

Vocabulary size: 30,000

Context length: \(\leq\)512 tokens

context length (T)

Pre-training

Dataset:

BookCorpus

800M words

WikiPedia

2500M words

[CLS]

Masked Sentence A

[SEP]

[PAD]

Encoder Layer-1

\mathscr{L_{cls}}=-log(\hat{y})

W_v

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

Masked Sentence B

Vocabulary size: 30,000

Context length: \(\leq\)512 tokens

Encoder Layer-2

Encoder Layer-12

\vdots

Num. of layers:12 (base)/24(large)

Model Architecture:

h \in \mathbb{R}^{768}

Hidden size: 768/1024

Pre-training

Dataset:

BookCorpus

800M words

WikiPedia

2500M words

[CLS]

Masked Sentence A

[SEP]

[PAD]

Encoder Layer-1

\mathscr{L_{cls}}=-log(\hat{y})

W_v

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

Masked Sentence B

Vocabulary size: 30,000

Context length: \(\leq\)512 tokens

Encoder Layer-12

Num. of Attention heads:12 /16

QKV

W_h

\lbrace

Num. of layers:12 (base)/24(large)

Model Architecture:

Intermediate size: 4H = 3072 /4096

3072

Hidden size: 768/1024

h \in \mathbb{R}^{768}

Parameter Calculation

[CLS]

Masked Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoders

W_v

Masked Sentence B

Embedding Layer

E_0

E_A

E_T

Position Embeddings

Segment Embeddings

Token Embeddings

E_1

E_A

E_T

E_2

E_A

E_T

E_3

E_B

E_T

E_4

E_B

E_T

Token Embeddings

30000 \times 768\approx 23M

Segment Embeddings

2 \times 768 \approx 1536

Position Embeddings

512 \times 768 \approx 0.4M

Total : 23.4M

\in \mathbb{R}^{768}

\cdots

Parameter Calculation

[CLS]

Masked Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoder

W_v

Masked Sentence B

Self-Attention (Ignoring bias)

(768 \times64\times3)\times12 =1.7M

For a Single Layer

FFN:768 \times 3072+3072 \times 768 +(3072+768) =4.7M

W_K,W_Q,W_V,A

FFN

(768 \times 768=0.6M

Total: \(\approx\) 7M

For 12 layers:

\(\approx\) 84M

Total number of Parameters: 23.4+84 = 107.4M* (110M in Paper)

*Actual vocabulary size 30522, parameters for layer normalization are excluded

Module 3.2 : Adapting to Downstream tasks

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Two Approaches

BERT as feature extractor

Fine-tuning BERT

[CLS]

Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoders

Sentence B

Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)

Take the final hidden representation (output of the last encoder) as a feature vector for the entire sentence.

This representation is superior to merely concatenating representations (say from word2vec) of individual words

Logistic Regression, Naive Bayes, NN, ..

\hat{y} \in \{0,1\}

Classification: Feature Based

[CLS]

Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoders

Sentence B

Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)

Take the final hidden representation (output of the last encoder) as a feature vector for the entire sentence.

This representation is superior to merely concatenating representations (say from word2vec) of individual words

Logistic Regression, Naive Bayes, NN, ..

Finally, we can use any ML model (called head) like Logistic regression, Naive Bayes or Neural Networks for classification

All the parameters of BERT are freezed and only the classification head is trained from scratch.

\hat{y} \in \{0,1\}

Classification: Feature Based

BODY

Head

[CLS]

Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoders

Sentence B

Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)

Logistic Regression, Naive Bayes, NN, ..

Add a classification head (again it could be any suitable ML model)

Now, train the entire model including the parameters of the pre-trained BERT for the new dataset.

\hat{y} \in \{0,1\}

Classification: Fine-Tuning

Initialize the parameters of the classification head randomly.

It is observed that the model used in the classification head converges quickly with a less number of labelled training samples than the feature-based approach

Note, however, that we do not mask words in the input sequence (the reason why we replaced 10% of masked words by random words during pre-trining)

Extractive Question-Answering

[CLS]

Masked Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoder

W_v

Masked Sentence B

Unlabelled Data

PreTraining

BookCorpus

800M words

WikiPedia

2500M words

Extractive Question-Answering

Fine-Tuning

labelled Data

Paragraph: What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

Question: What is the unique about the mission?

Answer : role of artificial intelligence (AI) in guiding the spacecraft

Starting token: 9

[CLS]

[SEP]

Bidirectional Transformer Encoder

\cdots

what

mission

what

sets

\cdots

surface

h_1

h_2

h_{14}

h_{25}

\cdots

We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.

Extractive Question-Answering

Fine-Tuning

[CLS]

[SEP]

Bidirectional Transformer Encoder

\cdots

what

mission

what

sets

\cdots

surface

h_1

h_2

h_{14}

h_{25}

\cdots

We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.

One approach is to pose this as a classification problem

The classification head takes in all these final representations and returns the probability distribution for a token to be a start or end token

Extractive Question-Answering

Fine-Tuning

[CLS]

[SEP]

Bidirectional Transformer Encoder

\cdots

what

mission

what

sets

\cdots

surface

h_1

h_2

h_{14}

h_{25}

\cdots

We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.

Let \(S\) denotes a start vector of size of \(h_i\)

the probability that \(i-th\) word in the paragraph being the start token is

s_i= \frac{exp(S \cdot h_i)}{\sum \limits_{j=1}^{25}exp(S.h_j)}

Let \(E\) denotes an end vector of size of \(h_i\)

the probability that \(i-th\) word in the paragraph being the end token is

e_i= \frac{exp(E \cdot h_i)}{\sum \limits_{j=1}^{25}exp(E.h_j)}

Both \(S\) and \(E\) are learnable parameters

What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

Probability distribution for start token

s_{{9}}

s_i= \frac{exp(S \cdot h_i)}{\sum \limits_{j=1}^{25}exp(S.h_j)}

\cdots

s_{{1}}

s_{{25}}

s_{{14}}

What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

Probability distribution for end token

e_{{9}}

e_i= \frac{exp(E\cdot h_i)}{\sum \limits_{j=1}^{25}exp(E.h_j)}

\cdots

e_{{1}}

e_{{25}}

e_{{17}}

What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

s_{{9}}

\uparrow

i=9

e_{{17}}

\uparrow

j=17

j\geq i

Span

role of artificial intelligence (AI) in guiding the spacecraft

What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

s_{{9}}

\uparrow

i=9

e_{{7}}

\uparrow

j=7

j< i

Return empty string (implies that the answer is not in the paragraph)

It is possible that the end token index might be lesser than the start token index.

In that case, return an empty string.

Module 3.3 : BERT Variations

Mitesh M. Khapra, Arun prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

SpanBERT

Instead of masking tokens randomly,

Bidirectional Transformer Encoder

[GO]

[EOS]

Mask a contiguous span of random tokens which is more natural for downstream tasks like Question-Answering

[mask]

p(y_4)

p(y_5)

p(y_6)

It is a random variable that follows the Geometric Distribution (that is skewed towards shorter span)

What should be the span length?

Dropped NSP objective, instead adds an auxilary objective call Span Boundary Objective (SBO) that predicts the entire span of texts using the representation of (unmasked) span boundary tokens (\(x_3\) and \(x_7\), in this case)

\(x_3\)

\(x_7\)

RoBERTa

Proposed a set of Robust design choices for BERT (because training is an expensive process)

[CLS]

Masked Sentence A

[SEP]

[PAD]

Encoder Layer-1

\mathscr{L_{cls}}=-log(\hat{y})

W_v

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})

Masked Sentence B

Encoder Layer-2

Encoder Layer-12

\vdots

h \in \mathbb{R}^{768}

Training the model longer, with bigger batches (10 to 30 fold) , over more data (almost 10-fold).
remove the next sentence prediction objective
training on longer sequences
dynamically changing the masking pattern applied to the training data

It stands for Robustly Optimized BERT pre-training Approach

ALBERT-TinyBERT-DistillBERT

The other aspect is to reduce the number of parameters in BERT using factorization of embedding matrix, weight sharing and knowledge distillation (KD) techniques.

This is helpful for storing and running BERT on resource-constrained devices

The model size is reduced from 108MB to 12 MB and about 5 to 9 times speedup in inference time

Image Source: TinyBERT

Electra

Instead of masking tokens randomly,

Replace the tokens randomly with the plausible alternatives sampled from a a small generator network

Then the discriminator network is tasked to distinguish the replaced token from real token

This improves performance even without increasing the train data size with comparatively lesser compute.

The performance is comparable to that of RoBERTa and ALBERT (A Little BERT)

Image Source: Electra