Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(x_2=?)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(x_2=?)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(.,P(x_2|x_1,).,\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(p(x_2|x_1)\)
BERT, GPT,BART
Dataset
BERT, GPT,BART
BERT, GPT,BART
BERT, GPT,BART
Standford Sentiment Tree Bank (SST)
SNLI
LAMBADA
Predict the last word of a Long sentence
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Input text: TL;DR:
Watched the transformer movie
Transformer Block 1
Transformer Block 2
Transformer Block 12
Embedding Matrix
GPT with 110 Million Parameters
GPT-2 with 1.5 Billion Parameters
Layers: \(4X\)
Parameters: \(10X\)
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 48
Embedding Matrix
naturally occurring demonstrations of English to French and French to English translation found throughout the WebText training set.
Parameters | Layers | d_model |
---|---|---|
117 M | 12 | 768 |
345 M | 24 | 1024 |
762 M | 36 | 1280 |
1542 | 48 | 1600 |
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 48
Embedding Matrix
GPT-2 with 1.5 Billion Parameters
Pre-trained
(GPT-2)
Wiki-text
text8
LAMBADA
1BW
Pre-trained
(GPT-2)
Pre-trained
(GPT-2)
Pre-trained
(GPT-2)
Pre-trained
(GPT-2)
Wiki-text
Measure
Perplexity
Pre-trained
(GPT-2)
Text-8
Measure BPC
Pre-trained
(GPT-2)
LAMBADA
Measure: Accuracy
Pre-trained
(GPT-2)
Wiki-text
text8
LAMBADA
1BW
Pre-trained
(GPT-2)
Pre-trained
(GPT-2)
Pre-trained
(GPT-2)
Perplexity
BPC
Accuracy
PPL
VS
CLM
MLM
VS
GPT
(117M)
BERT Large
(336M)
VS
VS
VS
GPT
(117M)
Input text
Linear layer: Learned from scratch
BERT base
(117M)
Input text
[CLS]
Linear layer is finetuned
encoder
encoder
decoder
decoder
Scale:
Objective:MLM
Scale:
Scale:
pre-Training
pre-Training
pre-Training
Pretraining:
Fine-Tuning
Fine-Tuning
Fine-Tuning
FineTuning:
hyp-params:
hyp-params:
hyp-params:
Objective:
FineTuning:
Objective:
FineTuning:
Scale:
Objective:
hyp-params:
Pretraining:
FineTuning:
Text To Text Transfer Transformer (T5)
" translate English to Tamil: I enjoyed the movie"
T5:
Encoder-Decoder Model
"summarize: state authorities
dispatched emergency crews tuesday to
survey the damage after an onslaught
of severe weather in mississippi…"
""stsb sentence1: The rhino grazed
on the grass. sentence2: A rhino
is grazing in a field.""
"3.8"
"six people hospitalized after
a storm in attala county."
" Naan padathai rasithen"
with a task-specific prefix prepended to the input context
750 GB
300M documents
156B tokens
Top 25 domains
Top 25 websites
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(I,<mask>, the ,movie\)
enjoyed
\(<go>\)
What sets this <w> is the pivotal role of artificial <x> (AI) in guiding the spacecraft during its <y> to the moon's surface.
Original Text
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Input Text
Target
<w> mission apart <x> intelligence <y> critical descent <z>
What sets this <X> is the pivotal role of artificial <Y> (AI) in guiding the spacecraft during its <Z> to the moon's surface.
What sets this <X> is the pivotal role of artificial <Y> (AI) in guiding the spacecraft during its <Z> to the moon's surface.
What sets this <X> is the pivotal role of artificial <Y> (AI) in guiding the spacecraft during its <Z> to the moon's surface.
What sets this <X> is the pivotal role of artificial <Y> (AI) in guiding the spacecraft during its <Z> to the moon's surface.
What sets this <X> is the pivotal role of artificial <Y> (AI) in guiding the spacecraft during its <Z> to the moon's surface.
What sets this <X> is the pivotal role of artificial <Y> (AI) in guiding the spacecraft during its <Z> to the moon's surface.
* during fine-tuning we use same number of tokens but keep B=512, T=128
C4 Dataset
\(BERT_{base}\) Like enc-dec with 220M parameters
Denoising objective
steps
\(\approx 34B\) tokens
GLUE
SuperGLUE
CNN/DM
SQuAD
WMT16 EnRo
steps*
\(\approx 17B\) tokens
GLUE and SuperGLUE : Collection of tasks (datasets) to train and test natural language undetstanding.
CoLA
SST
STSB
MNLI
RTE
For text summarization.
For Question Answering
For Translation
report an average score (as defined by the authors of GLUE)
* a trade-off between high resource tasks (benefit from longer fine tuning) and low resource tasks (overfits)
\(\eta=0.001\)
C4 Dataset
\(BERT_{base}\) Like enc-dec with 220M parameters
Denoising objective
steps
\(\approx 34B\) tokens
GLUE
SuperGLUE
CNN/DM
SQuAD
WMT16 EnRo
steps
\(\approx 17B\) tokens
Choose the best
model parameters for reporting the performance
in the subsequent slides dentoes the baseline model
Step 810000
Step 800000
Step 790000
Step 780000
Schematics
Mask Patterns
Dark cells: Self-attention allowed
Light cells: Self-attention not allowed
encoder
decoder
decoder
decoder
12 x encoder
12 x decoder
12 x encoder
12 x decoder
6 x encoder
6 x decoder
12 x decoder
12 x decoder
Language
Modelling
BERT-Style
Deshuffling
Objective | Inputs | Targets |
---|---|---|
Prefix language modeling | Thank you for inviting | me to your party last week . |
BERT-style* | Thank you for <M> me to <M> party apple week . | (original text) |
*this is slightly different from original BERT objective where we only predict masked tokens
** MASS-style is similar to BERT-style without random replacement of tokens
Deshuffling | party me for your to . last fun you inviting week Thank | (original text) |
Random spans | Thank you <X> to <Y> week . | <X> for inviting me <Y> your party last <Z> |
MAsked Seq-to-Seq pretraining (MASS)** | Thank you <M> <M> me to your party <M>week . | (original text) |
Language
Modelling
BERT-Style
Deshuffling
high-level
objectives
Thank you <M> <M> me to your party apple week . |
Language
Modelling
BERT-Style
Deshuffling
Mask
Replace Tokens
Drop
corruption
strategies
high-level
objectives
Thank you <X> to <Y> week . | <X> for inviting me <Y> your party last <Z> |
Language
Modelling
BERT-Style
Deshuffling
Mask
Replace Tokens
Drop
10%
15%
25%
50%
corruption
stratagies
corruption
rate
high-level
objectives
Thank you for inviting <X> to your party last week . | <X> me <Z> |
Thank you for inviting <X> to your party<Y>week . | <X> me <Y>Last <Z> |
Thank you for inviting <X> to your <Y>week . | <X> me <Y>party last <Z> |
Thank you <X> to your <Y>week . | <X>for inviting me <Y>party last <Z> |
Language
Modelling
BERT-Style
Deshuffling
Mask
Replace Tokens
Drop
10%
15%
25%
50%
2
3
5
10
corruption
stratagies
corruption
rate
corrupted span length
high-level
objectives
T5
Enc-Dec
T5
Enc-Dec
Standford Sentiment Tree Bank (SST)
SNLI
C4 Dataset
\(BERT_{base}\) Like enc-dec with 220M parameters
Denoising objective
steps
steps
C4 Dataset
\(BERT_{base}\) Like enc-dec with 220M parameters
Denoising objective
steps
GLUE
SuperGLUE
CNN/DM
SQuAD
WMT16 EnRo
steps
" translate English to Tamil: I enjoyed the movie"
T5:
Encoder-Decoder Model
"summarize: state authorities
dispatched emergency crews tuesday to
survey the damage after an onslaught
of severe weather in mississippi…"
""stsb sentence1: The rhino grazed
on the grass. sentence2: A rhino
is grazing in a field.""
"3.8"
"six people hospitalized after
a storm in attala county."
" Naan padathai rasithen"
C4 Dataset
\(BERT_{base}\) Like enc-dec with 220M parameters
Denoising objective
steps
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
train for
C4 Dataset
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
Sampler
\(r_m\)
C4 Dataset
\(BERT_{base}\) Like enc-dec with 220M parameters
Denoising objective
steps
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
steps
GLUE
SuperGLUE
CNN/DM
SQuAD
WMT16 EnRo
C4 Dataset
\(BERT_{base}\) Like enc-dec with 220M parameters
Denoising objective
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
Take Best checkpoints for each task
steps
steps
GLUE
SuperGLUE
CNN/DM
SQuAD
WMT16 EnRo
C4 Dataset
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
SQuAD
CNN/DM
steps
steps
EP: (Examples-Proportional)
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
C4 Dataset
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
C4 Dataset
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
steps
steps
C4 Dataset
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
C4 Dataset
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
SQuAD
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
C4 Dataset
EP (Examples-Proportional)
GLUE
SGLUE
CNN/DM
SQuAD
WMT16 EnRo
Aggregate
Baseline
Baseline
Baseline
Baseline
\(128\)
\(512\)
Baseline
\(2 \times\)
\(4 \times\)
Model | layers | heads | ||||
---|---|---|---|---|---|---|
Small | 60M | 6 | 512 | 2048 | 64 | 8 |
Base | 220M | 12 | 768 | 3072 | 64 | 12 |
Large | 770M | 24 | 1024 | 4096 | 64 | 16 |
3B | 3B | 24 | 1024 | 16384 | 128 | 32 |
11B | 11B | 24 | 1024 | 65536 | 128 | 128 |