Lecture 3: Bidirectional Encoder Representations from Transformers

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Introduction to Large Language Models

Using only the encoder of the transformer (encoder only models)

Using only the decoder of the transformer (decoder only models)

Using both the encoder and decoder of the transformer (encoder decoder models)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(P(<mask>)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(P(<mask>)\)

In the previous lecture, we used the decoder of the transformer to build a language model

Using only the encoder of the transformer (encoder only models)

Using only the decoder of the transformer (decoder only models)

Using both the encoder and decoder of the transformer (encoder decoder models)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(P(<mask>)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(P(<mask>)\)

In this lecture, we will see how to use the encoder of the transformer to build a language model

In GPT, the representation for a language is learned in an unidirectional (left-to-right) way.

Multi-Head masked Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

?
i \quad \ \ like \quad \ \ to
play
watch
go
read

This is a natural fit for tasks like text generation

How about tasks like Named Entity Recognition (NER)  ..

Nothing has shipped its new OS to Nothing Phone 2

Company

i \quad \ \_\_\_ \quad \ \ to \quad read \quad a \quad \_\_\_ \quad [eos]

..and fill in the blanks

Module 3.1 : Masked Language Modelling

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

We need to look at both directions (surrounding words) to predict the masked words

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(P(y_2=x_2)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(P(<mask>)\)

i \quad \ \_\_\_ \quad \ \ to \quad read \quad a \quad \_\_\_ \quad [eos]

Predict the masked words using the context words in both directions (like CBOW)

This is called Masked Language Modelling (MLM)

We cannot use decoder component of the transformer as it is inherently unidirectional

We can use the encoder part of the transformer 

Using only the decoder of the transformer (decoder only models)

Using both the encoder and decoder of the transformer (encoder decoder models)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,\cdots,[mask],\cdots,x_{T}\)

\(P(y_i=x_i)\)

Multi-Head masked Attention

Feed forward NN

Add&Norm

Add&Norm

\(x_1,x_2,\cdots,x_{i-1}\)

\(P(x_i)\)

Multi-Head Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head cross Attention

Feed forward NN

Add&Norm

Add&Norm

Multi-Head Maksed Attention

Add&Norm

\(x_1,<mask>,\cdots,x_{T}\)

\(<go>\)

\(P(<mask>)\)

We need to find out ways to mask the words in the input text.

P(y_i|x_1,x_{i}=[mask],\cdots,x_T)=?

Now, the problem is cast as

It is assumed that the predicted tokens are independent of each other 

Self-Attention

1 0 0
0 1 0
1 1 0
1 0 1
1 1 1

I

enjoyed

the

movie

transformers

Let's see

We know that each word attends to every other word in the sequence of words in the self attention layer

Self-Attention

1 0 0
0 1 0
1 1 0
1 0 1
1 1 1

[mask]

enjoyed

the

[mask]

transformers

Masked Language Modelling (MLM)

We know that each word attends to every other word in the sequence of words in the self attention layer

Our objective is to  mask a few words randomly and predict the masked words

Can we MASK the attention weights of the words to be masked as in CLM?

Q
K
V
\text{MatMul:} \ Q^TK
\text{Scale}:\frac{1}{\sqrt{d_k}}

I

transformers

Self-Attention

1 0 0
1 1 1
0.3 0.2
\cdots
\cdots
\cdots
0.1 0.5
-0.1 0.25
0.11 0.89
0 0.4
0.2 0.7
k_1
v_1
q_1
k_5
v_5
q_5
W_K
W_V
W_Q
W_k
W_V
W_Q

movie

Mask
\text{Softmax}
\text{MatMul}
\begin{bmatrix}-\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ \end{bmatrix}

Mask = 

No, Why? 

Because we want the model to learn (attend to) what the blanks are. 

Because we want the model to learn (attend to) what the blanks are. 

Q
K
V
\text{MatMul:} \ Q^TK
\text{Scale}:\frac{1}{\sqrt{d_k}}
Mask
\text{Softmax}
\text{MatMul}

Mask = 

No, Why? 

Because we want the model to learn (attend to) what the blanks are. 

We can think of the [mask] token as noise that corrputs the original input text

\begin{bmatrix}-\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ -\infty &0 & 0 & -\infty & 0 \\ \end{bmatrix}

Then the model is tasked to recover the original token 

This is similar to denoising objective of Auto Encoders

For this reason, MLM is also called as pre-training denoising objective

[mask]

transformers

Self-Attention

1 0 0
1 1 1
0.3 0.2
\cdots
\cdots
\cdots
0.1 0.5
-0.1 0.25
0.11 0.89
0 0.4
0.2 0.7
k_1
v_1
q_1
k_5
v_5
q_5
W_K
W_V
W_Q
W_k
W_V
W_Q

[mask]

Add [mask] as a special token in the vocabulary and get an embedding for it.

Using Special Tokens

A simple approach is to use [mask] token for the words to be masked.

[mask]

enjoyed

the

[mask]

transformers

Feed Forward Network

Self-Attention

\mathscr{L_1}=-log(\hat{y_1})
\mathscr{L_2}=-log(\hat{y_4})

Note carefully that only a set \(\mathcal{M}\) of masked words are predicted and hence the cross entropy loss for those predictions are computed

Typically, 15% of words in the input sequence are masked

Very high masking would result in severe loss of context information and the model will not learn good representations

On the other hand, very little masking takes a long time for convergence (because the gradient values are comparatively small) and makes training inefficient

However, tradeoff could be made based on the model size, masking scheme and optimization algorithm. Refer to Should you mask 15% in MLM? to know more

\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})
W_v
W_v

BERT

[mask]

enjoyed

the

[mask]

transformers

Encoder Layer (attention,FFN,Normalization,Residual connection)

Encoder Layer (attention,FFN,Normalization,Residual connection)

Encoder Layer (attention,FFN,Normalization,Residual connection)

\mathscr{L_1}=-log(\hat{y_1})
\mathscr{L_2}=-log(\hat{y_4})
\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})
W_v
W_v

A multi-layer bidirectional transformer encoder architecture.

BERT Base model contains 12 layers with 12 Attention heads per layer

The masked words (15%) in an input sequence are sampled uniformly 

Of these, 80% are replaced with [mask] token and 10% are replaced with random words and remaining 10% are retained as it is. (Why?)

Is pre-training objective of MLM sufficient for downstream tasks like Question-Answering  where interaction between sentences is important?

the special mask token won't be a part of the dataset while adapting for downstream tasks 

Next Sentence Prediction(NSP)

Now, let's extend the input with a pair of sentences (A,B) and the label that indicates whether the sentence B naturally follows sentence A.   

[CLS]

I

enjoyed

the

movie

Feed Forward Network

Self-Attention

transformers

[SEP]

The 

visuals

were

amazing

Sent: A
Sent: B
input: Sentence: A
\mathscr{L}=-log(\hat{y})
W_v
input: Sentence: B
Label: IsNext

Next Sentence Prediction(NSP)

[CLS]

I

enjoyed

the

movie

Feed Forward Network

Self-Attention

transformers

[SEP]

The 

visuals

were

amazing

Sent: A
Sent: B

Two sentences are separated with a special token [SEP] 

\mathscr{L}=-log(\hat{y})
W_v

The hidden representation corresponding to the [CLS] token is used  for final classification (prediction)

Next Sentence Prediction (NSP)

[CLS]

I

enjoyed

the

movie

Feed Forward Network

Self-Attention

transformers

[SEP]

The 

visuals

were

amazing

Sent: A
Sent: B

In 50% of the instances, the sentence B is the natural next sentence that follows sentence A

\mathscr{L}=-log(\hat{y})
W_v

In 50% of the instances, the sentence B is a random sentence from the corpus labelled as NotNext 

Pretraining with NSP objective improves the performance of QA and similar tasks significantly 

To distinguish the belongingness of the token to sentence A or B,  a separate learnable  segment embedding is used in addition to token and positional embeddings

[CLS]

I

enjoyed

the

movie

transformers

[SEP]

The 

visuals

were

amazing

Position Embeddings

Segment Embeddings

Token Embeddings

Feed Forward Network

Self-Attention

\mathscr{L}=-log(\hat{y})
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_T
E_A
E_A
E_A
E_A
E_A
E_A
E_B
E_B
E_B
E_B
E_A
E_0
E_1
E_2
E_3
E_4
E_5
E_6
E_7
E_8
E_9
E_{10}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

Encoder Layer 

\mathscr{L_{cls}}=-log(\hat{y})

Encoder Layer 

Encoder Layer 

Encoder Layer 

\mathscr{L_1}=-log(\hat{y_1})
\mathscr{L_2}=-log(\hat{y_4})
\mathscr{L_3}=-log(\hat{y_8})
\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

[CLS]

[mask]

enjoyed

the

[mask]

transformers

[SEP]

The 

[mask]

were

amazing

W_v
W_v
W_v
W_v

Minimize the objective:

Pre-training

Dataset:

BookCorpus

800M words

WikiPedia

2500M words

[CLS]

Masked Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoder

\mathscr{L_{cls}}=-log(\hat{y})
W_v
W_v
W_v
\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

Masked Sentence B

Vocabulary size: 30,000

Context length: \(\leq\)512 tokens 

context length (T)

Pre-training

Dataset:

BookCorpus

800M words

WikiPedia

2500M words

[CLS]

Masked Sentence A

[SEP]

[PAD]

Encoder Layer-1

\mathscr{L_{cls}}=-log(\hat{y})
W_v
W_v
W_v
\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

Masked Sentence B

Vocabulary size: 30,000

Context length: \(\leq\)512 tokens 

 Encoder Layer-2

 Encoder Layer-12

\vdots
\vdots
\vdots

Num. of layers:12 (base)/24(large)

Model Architecture:

h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}

Hidden size: 768/1024

Pre-training

Dataset:

BookCorpus

800M words

WikiPedia

2500M words

[CLS]

Masked Sentence A

[SEP]

[PAD]

Encoder Layer-1

\mathscr{L_{cls}}=-log(\hat{y})
W_v
W_v
W_v
\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})+\mathscr{L_{cls}}

Masked Sentence B

Vocabulary size: 30,000

Context length: \(\leq\)512 tokens 

 Encoder Layer-12

Num. of Attention heads:12 /16

QKV
W_h
12
\lbrace

Num. of layers:12 (base)/24(large)

Model Architecture:

Intermediate size: 4H = 3072 /4096

3072

Hidden size: 768/1024

h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}

Parameter Calculation

[CLS]

Masked Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoders

W_v
W_v
W_v

Masked Sentence B

Embedding Layer

+
+
E_0
E_A
E_T

Position Embeddings

Segment Embeddings

Token Embeddings

+
+
E_1
E_A
E_T
+
+
E_2
E_A
E_T
+
+
E_3
E_B
E_T
+
+
E_4
E_B
E_T

Token Embeddings

30000 \times 768\approx 23M

Segment Embeddings

2 \times 768 \approx 1536

Position Embeddings

512 \times 768 \approx 0.4M

Total : 23.4M

\in \mathbb{R}^{768}
\in \mathbb{R}^{768}
\in \mathbb{R}^{768}
\cdots
\cdots
\cdots
\cdots
\cdots
\cdots
\cdots
\cdots
\cdots

Parameter Calculation

[CLS]

Masked Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoder

W_v
W_v
W_v

Masked Sentence B

Self-Attention (Ignoring bias)

(768 \times64\times3)\times12 =1.7M

For a Single Layer

FFN:768 \times 3072+3072 \times 768 +(3072+768) =4.7M
W_K,W_Q,W_V,A

FFN

(768 \times 768=0.6M

Total: \(\approx\) 7M

For 12 layers:

 \(\approx\) 84M

Total number of Parameters: 23.4+84 = 107.4M* (110M in Paper)

*Actual vocabulary size 30522, parameters for layer normalization are excluded

Module 3.2 : Adapting to Downstream tasks

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Two Approaches

BERT as feature extractor

Fine-tuning BERT

[CLS]

 Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoders

 Sentence B

Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)

Take the final hidden representation (output of the last encoder) as a feature vector for the entire sentence.

This representation is superior to merely concatenating representations (say from word2vec) of individual words 

 Logistic Regression, Naive Bayes, NN, ..

\hat{y} \in \{0,1\}

Classification: Feature Based

[CLS]

 Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoders

 Sentence B

Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)

Take the final hidden representation (output of the last encoder) as a feature vector for the entire sentence.

This representation is superior to merely concatenating representations (say from word2vec) of individual words 

 Logistic Regression, Naive Bayes, NN, ..

Finally, we can use any ML model (called head) like Logistic regression, Naive Bayes or Neural Networks for classification

All the parameters of BERT are freezed and only the classification head is trained from scratch.

\hat{y} \in \{0,1\}

Classification: Feature Based

BODY

Head

[CLS]

 Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoders

 Sentence B

Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)

 Logistic Regression, Naive Bayes, NN, ..

Add a classification head (again it could be any suitable ML model)

Now, train the entire model including the parameters of the pre-trained BERT for the new dataset.

\hat{y} \in \{0,1\}

Classification: Fine-Tuning

Initialize the parameters of the classification head randomly.

It is observed that the model used in the classification head converges quickly with a less number of labelled training samples than the feature-based approach

Note, however, that we do not mask words in the input sequence (the reason why we replaced 10% of masked words by random words during pre-trining)

Extractive Question-Answering

[CLS]

Masked Sentence A

[SEP]

[PAD]

Bidirectional Transformer Encoder

W_v
W_v
W_v

Masked Sentence B

Unlabelled Data

PreTraining

BookCorpus

800M words

WikiPedia

2500M words

Extractive Question-Answering

Fine-Tuning

labelled Data

Paragraph: What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

Question: What is the unique about the mission?

Answer : role of artificial intelligence (AI) in guiding the spacecraft 

Starting token: 9

[CLS]

[SEP]

Bidirectional Transformer Encoder

\cdots

what

is

mission

what

sets

in

\cdots
\cdots

surface

h_1
h_2
h_{14}
h_{25}
\cdots
\cdots

We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.

Extractive Question-Answering

Fine-Tuning

[CLS]

[SEP]

Bidirectional Transformer Encoder

\cdots

what

is

mission

what

sets

in

\cdots
\cdots

surface

h_1
h_2
h_{14}
h_{25}
\cdots
\cdots

We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.

One approach is to pose this as a classification problem

The classification head takes in all these final representations and returns the probability distribution for a token to be  a start or end token 

Extractive Question-Answering

Fine-Tuning

[CLS]

[SEP]

Bidirectional Transformer Encoder

\cdots

what

is

mission

what

sets

in

\cdots
\cdots

surface

h_1
h_2
h_{14}
h_{25}
\cdots
\cdots

We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.

Let \(S\) denotes a start vector of  size of \(h_i\)

the probability that \(i-th\) word in the paragraph being the start token is

s_i= \frac{exp(S \cdot h_i)}{\sum \limits_{j=1}^{25}exp(S.h_j)}

Let \(E\) denotes an end vector of  size of \(h_i\)

the probability that \(i-th\) word in the paragraph being the end token is

e_i= \frac{exp(E \cdot h_i)}{\sum \limits_{j=1}^{25}exp(E.h_j)}

Both \(S\) and \(E\) are learnable parameters

 What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

Probability distribution for start token

s_{{9}}
s_i= \frac{exp(S \cdot h_i)}{\sum \limits_{j=1}^{25}exp(S.h_j)}
\cdots
\cdots
\cdots
s_{{1}}
s_{{25}}
s_{{14}}

 What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

Probability distribution for end token

e_{{9}}
e_i= \frac{exp(E\cdot h_i)}{\sum \limits_{j=1}^{25}exp(E.h_j)}
\cdots
\cdots
\cdots
e_{{1}}
e_{{25}}
e_{{17}}

 What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

s_{{9}}
\uparrow
i=9
e_{{17}}
\uparrow
j=17
j\geq i

Span

role of artificial intelligence (AI) in guiding the spacecraft

 What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.

s_{{9}}
\uparrow
i=9
e_{{7}}
\uparrow
j=7
j< i

Return empty string (implies that the answer is not in the paragraph)

It is possible that the end token index might be lesser than the start token index. 

In that case, return an empty string. 

Module 3.3 : BERT Variations

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Instead of masking tokens randomly,

Bidirectional Transformer Encoder

[GO]

[EOS]

Mask a contiguous span of  random tokens which is more natural for downstream tasks like Question-Answering 

[mask]

[mask]

[mask]

p(y_4)
p(y_5)
p(y_6)

It is a random variable that follows the Geometric Distribution (that is skewed towards shorter span)

What should be the span length?

Dropped NSP objective, instead adds an auxilary objective call Span Boundary Objective (SBO) that predicts the entire span of texts using  the representation of (unmasked) span boundary tokens (\(x_3\) and \(x_7\), in this case)

\(x_3\)

\(x_7\)

Proposed a set of Robust design choices for BERT (because training is an expensive process)

[CLS]

Masked Sentence A

[SEP]

[PAD]

Encoder Layer-1

\mathscr{L_{cls}}=-log(\hat{y})
W_v
W_v
W_v
\mathscr{L}=\frac{1}{|\mathcal{M}|}\sum \limits_{y_i \in \mathcal{M}} -\log(\hat{y_i})

Masked Sentence B

 Encoder Layer-2

 Encoder Layer-12

\vdots
\vdots
\vdots
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
h \in \mathbb{R}^{768}
  1. Training the model longer, with bigger batches (10 to 30 fold) , over more data (almost 10-fold).

  2. remove the next sentence prediction objective

  3. training on longer sequences

  4. dynamically changing the masking pattern applied to the training data

It stands for Robustly Optimized BERT pre-training Approach

The other aspect is to reduce the number of parameters in BERT using factorization of embedding matrix, weight sharing and knowledge distillation (KD) techniques.

This is helpful for storing and running BERT on resource-constrained devices

The model size is reduced from 108MB to 12 MB and about 5 to 9 times speedup in inference time

Instead of masking tokens randomly,

Replace the tokens randomly with the plausible alternatives sampled from a a small generator network

Then the discriminator network is tasked to distinguish the replaced token from real token

This improves performance even without increasing the train data size with comparatively lesser compute.

The performance is comparable to that of RoBERTa and ALBERT (A Little BERT)

Lecture-3-BERT

By Arun Prakash