Lecture 3: Bidirectional Encoder Representations from Transformers
Mitesh M. Khapra, Arun Prakash A
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Introduction to Large Language Models
Using only the encoder of the transformer (encoder only models)
Using only the decoder of the transformer (decoder only models)
Using both the encoder and decoder of the transformer (encoder decoder models)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(<mask>)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
In the previous lecture, we used the decoder of the transformer to build a language model
Using only the encoder of the transformer (encoder only models)
Using only the decoder of the transformer (decoder only models)
Using both the encoder and decoder of the transformer (encoder decoder models)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(<mask>)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
In this lecture, we will see how to use the encoder of the transformer to build a language model
In GPT, the representation for a language is learned in an unidirectional (left-to-right) way.
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
play |
---|
watch |
go |
read |
This is a natural fit for tasks like text generation
How about tasks like Named Entity Recognition (NER) ..
Nothing has shipped its new OS to Nothing Phone 2
Company
..and fill in the blanks
Module 3.1 : Masked Language Modelling
Mitesh M. Khapra, Arun Prakash A
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
We need to look at both directions (surrounding words) to predict the masked words
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(y_2=x_2)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
Predict the masked words using the context words in both directions (like CBOW)
This is called Masked Language Modelling (MLM)
We cannot use decoder component of the transformer as it is inherently unidirectional
We can use the encoder part of the transformer
Using only the decoder of the transformer (decoder only models)
Using both the encoder and decoder of the transformer (encoder decoder models)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,\cdots,[mask],\cdots,x_{T}\)
\(P(y_i=x_i)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
We need to find out ways to mask the words in the input text.
Now, the problem is cast as
It is assumed that the predicted tokens are independent of each other
Self-Attention
1 | 0 | 0 |
---|
0 | 1 | 0 |
---|
1 | 1 | 0 |
---|
1 | 0 | 1 |
---|
1 | 1 | 1 |
---|
I
enjoyed
the
movie
transformers
Let's see
We know that each word attends to every other word in the sequence of words in the self attention layer
Self-Attention
1 | 0 | 0 |
---|
0 | 1 | 0 |
---|
1 | 1 | 0 |
---|
1 | 0 | 1 |
---|
1 | 1 | 1 |
---|
[mask]
enjoyed
the
[mask]
transformers
Masked Language Modelling (MLM)
We know that each word attends to every other word in the sequence of words in the self attention layer
Our objective is to mask a few words randomly and predict the masked words
Can we MASK the attention weights of the words to be masked as in CLM?
I
transformers
Self-Attention
1 | 0 | 0 |
---|
1 | 1 | 1 |
---|
0.3 | 0.2 |
---|
0.1 | 0.5 |
---|
-0.1 | 0.25 |
---|
0.11 | 0.89 |
---|
0 | 0.4 |
---|
0.2 | 0.7 |
---|
movie
Mask =
No, Why?
Because we want the model to learn (attend to) what the blanks are.
Because we want the model to learn (attend to) what the blanks are.
Mask =
No, Why?
Because we want the model to learn (attend to) what the blanks are.
We can think of the [mask] token as noise that corrputs the original input text
Then the model is tasked to recover the original token
This is similar to denoising objective of Auto Encoders
For this reason, MLM is also called as pre-training denoising objective
[mask]
transformers
Self-Attention
1 | 0 | 0 |
---|
1 | 1 | 1 |
---|
0.3 | 0.2 |
---|
0.1 | 0.5 |
---|
-0.1 | 0.25 |
---|
0.11 | 0.89 |
---|
0 | 0.4 |
---|
0.2 | 0.7 |
---|
[mask]
Add [mask] as a special token in the vocabulary and get an embedding for it.
Using Special Tokens
A simple approach is to use [mask] token for the words to be masked.
[mask]
enjoyed
the
[mask]
transformers
Feed Forward Network
Self-Attention
Note carefully that only a set \(\mathcal{M}\) of masked words are predicted and hence the cross entropy loss for those predictions are computed
Typically, 15% of words in the input sequence are masked
Very high masking would result in severe loss of context information and the model will not learn good representations
On the other hand, very little masking takes a long time for convergence (because the gradient values are comparatively small) and makes training inefficient
However, tradeoff could be made based on the model size, masking scheme and optimization algorithm. Refer to Should you mask 15% in MLM? to know more
BERT
[mask]
enjoyed
the
[mask]
transformers
Encoder Layer (attention,FFN,Normalization,Residual connection)
Encoder Layer (attention,FFN,Normalization,Residual connection)
Encoder Layer (attention,FFN,Normalization,Residual connection)
A multi-layer bidirectional transformer encoder architecture.
BERT Base model contains 12 layers with 12 Attention heads per layer
The masked words (15%) in an input sequence are sampled uniformly
Of these, 80% are replaced with [mask] token and 10% are replaced with random words and remaining 10% are retained as it is. (Why?)
Is pre-training objective of MLM sufficient for downstream tasks like Question-Answering where interaction between sentences is important?
the special mask token won't be a part of the dataset while adapting for downstream tasks
Next Sentence Prediction(NSP)
Now, let's extend the input with a pair of sentences (A,B) and the label that indicates whether the sentence B naturally follows sentence A.
[CLS]
I
enjoyed
the
movie
Feed Forward Network
Self-Attention
transformers
[SEP]
The
visuals
were
amazing
input: Sentence: A
input: Sentence: B
Label: IsNext
Next Sentence Prediction(NSP)
[CLS]
I
enjoyed
the
movie
Feed Forward Network
Self-Attention
transformers
[SEP]
The
visuals
were
amazing
Two sentences are separated with a special token [SEP]
The hidden representation corresponding to the [CLS] token is used for final classification (prediction)
Next Sentence Prediction (NSP)
[CLS]
I
enjoyed
the
movie
Feed Forward Network
Self-Attention
transformers
[SEP]
The
visuals
were
amazing
In 50% of the instances, the sentence B is the natural next sentence that follows sentence A
In 50% of the instances, the sentence B is a random sentence from the corpus labelled as NotNext
Pretraining with NSP objective improves the performance of QA and similar tasks significantly
To distinguish the belongingness of the token to sentence A or B, a separate learnable segment embedding is used in addition to token and positional embeddings
[CLS]
I
enjoyed
the
movie
transformers
[SEP]
The
visuals
were
amazing
Position Embeddings
Segment Embeddings
Token Embeddings
Feed Forward Network
Self-Attention
Encoder Layer
Encoder Layer
Encoder Layer
Encoder Layer
[CLS]
[mask]
enjoyed
the
[mask]
transformers
[SEP]
The
[mask]
were
amazing
Minimize the objective:
Pre-training
Dataset:
BookCorpus
800M words
WikiPedia
2500M words
[CLS]
Masked Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoder
Masked Sentence B
Vocabulary size: 30,000
Context length: \(\leq\)512 tokens
context length (T)
Pre-training
Dataset:
BookCorpus
800M words
WikiPedia
2500M words
[CLS]
Masked Sentence A
[SEP]
[PAD]
Encoder Layer-1
Masked Sentence B
Vocabulary size: 30,000
Context length: \(\leq\)512 tokens
Encoder Layer-2
Encoder Layer-12
Num. of layers:12 (base)/24(large)
Model Architecture:
Hidden size: 768/1024
Pre-training
Dataset:
BookCorpus
800M words
WikiPedia
2500M words
[CLS]
Masked Sentence A
[SEP]
[PAD]
Encoder Layer-1
Masked Sentence B
Vocabulary size: 30,000
Context length: \(\leq\)512 tokens
Encoder Layer-12
Num. of Attention heads:12 /16
Num. of layers:12 (base)/24(large)
Model Architecture:
Intermediate size: 4H = 3072 /4096
Hidden size: 768/1024
Parameter Calculation
[CLS]
Masked Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoders
Masked Sentence B
Embedding Layer
Position Embeddings
Segment Embeddings
Token Embeddings
Token Embeddings
Segment Embeddings
Position Embeddings
Total : 23.4M
Parameter Calculation
[CLS]
Masked Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoder
Masked Sentence B
Self-Attention (Ignoring bias)
For a Single Layer
FFN
Total: \(\approx\) 7M
For 12 layers:
\(\approx\) 84M
Total number of Parameters: 23.4+84 = 107.4M* (110M in Paper)
*Actual vocabulary size 30522, parameters for layer normalization are excluded
Module 3.2 : Adapting to Downstream tasks
Mitesh M. Khapra, Arun Prakash A
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Two Approaches
BERT as feature extractor
Fine-tuning BERT
[CLS]
Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoders
Sentence B
Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)
Take the final hidden representation (output of the last encoder) as a feature vector for the entire sentence.
This representation is superior to merely concatenating representations (say from word2vec) of individual words
Logistic Regression, Naive Bayes, NN, ..
Classification: Feature Based
[CLS]
Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoders
Sentence B
Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)
Take the final hidden representation (output of the last encoder) as a feature vector for the entire sentence.
This representation is superior to merely concatenating representations (say from word2vec) of individual words
Logistic Regression, Naive Bayes, NN, ..
Finally, we can use any ML model (called head) like Logistic regression, Naive Bayes or Neural Networks for classification
All the parameters of BERT are freezed and only the classification head is trained from scratch.
Classification: Feature Based
BODY
Head
[CLS]
Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoders
Sentence B
Take a sentence of length less than 512 and feed it as an input to BERT (with appropriate padding, if required)
Logistic Regression, Naive Bayes, NN, ..
Add a classification head (again it could be any suitable ML model)
Now, train the entire model including the parameters of the pre-trained BERT for the new dataset.
Classification: Fine-Tuning
Initialize the parameters of the classification head randomly.
It is observed that the model used in the classification head converges quickly with a less number of labelled training samples than the feature-based approach
Note, however, that we do not mask words in the input sequence (the reason why we replaced 10% of masked words by random words during pre-trining)
Extractive Question-Answering
[CLS]
Masked Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoder
Masked Sentence B
Unlabelled Data
PreTraining
BookCorpus
800M words
WikiPedia
2500M words
Extractive Question-Answering
Fine-Tuning
labelled Data
Paragraph: What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Question: What is the unique about the mission?
Answer : role of artificial intelligence (AI) in guiding the spacecraft
Starting token: 9
[CLS]
[SEP]
Bidirectional Transformer Encoder
what
is
mission
what
sets
in
surface
We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.
Extractive Question-Answering
Fine-Tuning
[CLS]
[SEP]
Bidirectional Transformer Encoder
what
is
mission
what
sets
in
surface
We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.
One approach is to pose this as a classification problem
The classification head takes in all these final representations and returns the probability distribution for a token to be a start or end token
Extractive Question-Answering
Fine-Tuning
[CLS]
[SEP]
Bidirectional Transformer Encoder
what
is
mission
what
sets
in
surface
We need to make use of these final representations (\(h_1,h_2,\cdots,h_{25}\)) to find the start and end tokens.
Let \(S\) denotes a start vector of size of \(h_i\)
the probability that \(i-th\) word in the paragraph being the start token is
Let \(E\) denotes an end vector of size of \(h_i\)
the probability that \(i-th\) word in the paragraph being the end token is
Both \(S\) and \(E\) are learnable parameters
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Probability distribution for start token
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Probability distribution for end token
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Span
role of artificial intelligence (AI) in guiding the spacecraft
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Return empty string (implies that the answer is not in the paragraph)
It is possible that the end token index might be lesser than the start token index.
In that case, return an empty string.
Module 3.3 : BERT Variations
Mitesh M. Khapra, Arun prakash A
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Instead of masking tokens randomly,
Bidirectional Transformer Encoder
[GO]
[EOS]
Mask a contiguous span of random tokens which is more natural for downstream tasks like Question-Answering
[mask]
[mask]
[mask]
It is a random variable that follows the Geometric Distribution (that is skewed towards shorter span)
What should be the span length?
Dropped NSP objective, instead adds an auxilary objective call Span Boundary Objective (SBO) that predicts the entire span of texts using the representation of (unmasked) span boundary tokens (\(x_3\) and \(x_7\), in this case)
\(x_3\)
\(x_7\)
Proposed a set of Robust design choices for BERT (because training is an expensive process)
[CLS]
Masked Sentence A
[SEP]
[PAD]
Encoder Layer-1
Masked Sentence B
Encoder Layer-2
Encoder Layer-12
-
Training the model longer, with bigger batches (10 to 30 fold) , over more data (almost 10-fold).
-
remove the next sentence prediction objective
-
training on longer sequences
-
dynamically changing the masking pattern applied to the training data
It stands for Robustly Optimized BERT pre-training Approach
The other aspect is to reduce the number of parameters in BERT using factorization of embedding matrix, weight sharing and knowledge distillation (KD) techniques.
This is helpful for storing and running BERT on resource-constrained devices
The model size is reduced from 108MB to 12 MB and about 5 to 9 times speedup in inference time
Instead of masking tokens randomly,
Replace the tokens randomly with the plausible alternatives sampled from a a small generator network
Then the discriminator network is tasked to distinguish the replaced token from real token
This improves performance even without increasing the train data size with comparatively lesser compute.
The performance is comparable to that of RoBERTa and ALBERT (A Little BERT)
Lecture-3-BERT
By Arun Prakash
Lecture-3-BERT
- 634