Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
Text in source language
Translated text in target language
Input text
Predict the class/sentiment
Input text
Summarize
Question
Answer
Input text
Input text
Predict the class/sentiment
Input text
Summarize
Question
Answer
Input text
Input text
Predict the class/sentiment
Input text
Summarize
Question
Answer
Input text
Input text
Predict the class/sentiment
Input text
Summarize
Question
Answer
Input text
" Wow, India has now reached the moon"
An excerpt from business today "What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface."
He likes to stay
He likes to stray
He likes to sway
Language Modeling
(Pre-training)
Raw text
Downstream tasks
(Fine-tuning)
(Samples and labels)
a. An apple ate I
b. I ate an apple
c. I ate apple
d. an apple
e. ....
Definition
a. I enjoyed reading a book
b. I enjoyed reading a thermometer
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(<mask>)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
Feed Forward Network
Masked Multi-Head (Self)Attention
Multi-Head (Cross) Attention
Feed Forward Network
Masked Multi-Head (Self)Attention
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 4
Transformer Block 5
Where \(h_n[i]\) is the \(i-\)the output vector in \(h_n\) block.
BookCorpus
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Transformer Block 1
<go>
at
the
bell
labs
hamming
bound
...................
new
a
devising
..............
<stop>
Transformer Block 1
<go>
at
the
bell
labs
hamming
bound
...................
new
a
devising
..............
<stop>
Feed Forward Neural Network
Multi-head masked attention
<go>
at
the
bell
labs
hamming
bound
...................
new
a
devising
..............
<stop>
Concatenate
Linear
Layer norm
Residual connection
<go>
at
the
bell
labs
hamming
bound
...................
new
a
devising
..............
<stop>
Layer norm
Residual connection
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Layer | Parameters (in Millions) |
---|---|
Embedding Layer | |
Attention layers | |
FFN Layers | |
Total |
Embedding Matrix
*Without rounding the number of parameters in each layer
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
\(\leftarrow\) Question \(\rightarrow\)
\(\leftarrow\) Choice-1 \(\rightarrow\)
Linear-1
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
\(\leftarrow\) Question \(\rightarrow\)
\(\leftarrow\) Choice-2 \(\rightarrow\)
Linear-2
Linear-2
Linear-1
Transformer Block 1
Transformer Block 2
Transformer Block 3
Transformer Block 12
Embedding Matrix
Input:
Output:
Stoping Criteria:
Something to ponder about
GPT and its versions are pre-trained on a large corpus, however, what if we want it to be domain specific? Like finanance
Pretrian the GPT on new text corpus specific to finance? or directly fine the GPT on supervised tasks related to finance
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or on scientific papers e.g. LysandreJik/arxiv-nlp.
Decoding the most likely output sequence involves searching through all the possible output sequences based on their likelihood.
Therefore, the search problem is exponential in the length of the output sequence and is intractable (NP-complete) to search completely.
I like cold water
I like cold coffee
coffee like cold coffee
coffee coffee coffee coffee
I like I like
I like cold coffee
coffee like cold coffee
I like I like
I like coffee
coffee coffee coffee coffee
I like cold water
I like cold coffee
Outputs*
* Assuming the sequence has the highest probability among all \(|\mathcal{V}|^5\) sequences
t=1
t=2
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
cold
<stop>
coffee
I
like
water
time steps
I
like
cold
coffee
<stop>
0.1
0.15
0.25
0.4
0.05
0.05
0.1
0.15
0.25
0.05
0.35
0.1
0.45
0.05
0.1
0.05
0.15
0.2
0.15
0.3
0.35
0.01
0.09
0.1
0.1
0.5
0.2
0.1
0.05
0.05
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
cold
<stop>
coffee
I
like
water
time steps
I
like
cold
coffee
<stop>
0.1
0.15
0.25
0.4
0.05
0.05
0.1
0.15
0.25
0.05
0.35
0.1
0.45
0.05
0.1
0.05
0.15
0.2
0.15
0.3
0.35
0.01
0.09
0.1
0.1
0.5
0.2
0.1
0.05
0.05
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
cold
<stop>
coffee
I
like
water
time steps
coffee
like
cold
water
<stop>
0.1
0.15
0.25
0.4
0.05
0.05
0.65
0.05
0.05
0.05
0.1
0.1
0.04
0.01
0.1
0.03
0.02
0.8
0.1
0.5
0.2
0.1
0.05
0.05
0.15
0.1
0.05
0.05
0.55
0.1
t=1
t=2
t=1
t=2
t=3
t=1
t=2
t=3
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
cold
<stop>
coffee
I
like
water
time steps
0.1
0.15
0.25
0.4
0.05
0.05
0.65
0.05
0.05
0.05
0.09
0.11
0.04
0.01
0.1
0.03
0.02
0.8
0.1
0.5
0.2
0.1
0.05
0.05
0.1
0.15
0.05
0.05
0.55
0.1
I
<stop>
I
<stop>
The proability of top-k tokens will be normalized relatively, \(P(I)=0.61,P(coffee)=0.39\) before sampling a token.
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
cold
<stop>
coffee
I
like
water
time steps
0.1
0.15
0.25
0.4
0.05
0.05
0.65
0.05
0.05
0.05
0.09
0.11
0.04
0.01
0.1
0.03
0.02
0.8
0.1
0.5
0.2
0.1
0.05
0.05
0.1
0.15
0.05
0.05
0.55
0.1
coffee
like
cold
coffee
<stop>
coffee
like
cold
coffee
<stop>
I
<stop>
she
said
,
"
I
never
thought
knew
had
saw
said
could
meant
I
ate
the
pizza
while
it
was
still
hot
cooling
warm
on
heating
going
she
said
,
"
I
never
thought
knew
had
saw
said
could
meant
I
ate
the
pizza
while
it
was
still
hot
cooling
warm
on
heating
going
The various approaches that we have discussed so far was focusing on generating diverse and quality sentences but not on reducing the time complexity (or latency)
All tokens are generated sequentially in autoregressive sampling. Therefore, the time complexity to generate \(N\) tokens is \(O(N \cdot t_{model})\)
That's what the speculative decoding strategy try to address
It becomes a significant performance bottleneck for larger models.
Assigning 1 byte of memory requires 530GB of GPU memory
Assume a larger language model with 500+ Billion parameters.
This requires us to distribute the model across multiple GPUs (Model parallelism) that adds additional communication overhead to the final latency
To generate a single token, we need to carryout at least 500+ Billion operations
How? Let's see
(somehow) Parallelizing the process reduce the latency
Moreover, the sampling process is sequential
Suppose we want to sample \(\tilde{x}\) from a complex distribution \(q(x)\) called target distribution.
Instead of directly sampling from the complex distribution \(q(x)\), we sample \(\tilde{x}\) from a simple distribution \(p(x)\) called draft distribution.
We accept or reject the sample \(\tilde{x}\) based on the following rule
where \(\mathbb{I} \in \{0,1\}\) is an indicator variable and \(r \sim U(0,1)\) and \(c\) is a constant. The sample is accepted if \(\mathbb{I}=1\).
Note that a lot of samples got rejected!
Out of 100 samples, only 15 were accepted. The efficiency is 0.15 percent!
Increasing efficiency requires us to assume reasonably better draft distribution!
7B parameters
We Apply the same idea here
Use a smaller but faster model (called draft model) that speculates (lookahead) \(K\) future(candidate) tokens in the Auto-Regressive manner
The more you learn
the
more
you
can
do
7B parameters
We Apply the same idea here
Use a smaller but faster model (called draft model) that speculates (lookahead) \(K\) future(candidate) tokens in the Auto-Regressive manner
Using greedy search (for ex)
The more you learn
the
more
you
can
do
70 B parameters
Use a larget but slower model (called target model) that accepts or rejectis a subset of \(K\) tokens using some rejection criteria.
In Parallel, compute
The more you learn
q(the)
q(more)
q(you)
q(can)
q(do)
The more you learn the
The more you learn the more
The more you learn the more you can
The more you learn the more you can do
q(.)
Now we have probabilities from both draft \(p(.|.)\) and target \(q(.|.)\)models.
Remember that the propablities for the \(K\) tokens were computed in parallel by the target \(q(.|.)\)model.
for \(t=1:K\):
sample \(r \sim U(0,1)\)
if \(r < min\big(1,\frac{q(x|x_1,\cdots,x_{n+t-1})}{p(x|x_1,\cdots,x_{n+t-1})}\big)\):
Accept the token: \(x_{n+t} \leftarrow x\)
else:
Resample from modified distribution
\(x_{n+t} \sim (q(x|x_1,\cdots,x_{n+t-1}) - p(x|x_1,\cdots,x_{n+t-1}))_+\)
exit for loop
If all tokens are accepted, sample \(K+1\)th token from \(q(.|.)\)
At each iteration, it predicts atleast 1 (if all rejected) to atmost \(K+1\) tokens (if all accepted).
It reduces the time complexity from \(O(N \cdot t_{target}) \) to the best case of \(O(\frac{N}{K+1} \cdot (Kt_{draft}+t_{target}) \)
GPT-XL
GPT-small
Barack Obama was born in Honolulu, Hawai. He was born in
0.27 | Hawaii |
0.18 | the |
0.16 | Honolulu |
0.10 | 1961 |
0.02 | Washington |
0.08 | Honolulu |
0.04 | washington |
0.04 | the |
0.001 | 1961 |
Prompt
Greedy search
Barack Obama was born in Honolulu, Hawai. He was born in Hawai
Top-k/Top-p
Barack Obama was born in Honolulu, Hawai. He was born in Washington
We are facing a problem of repetition and incoherence!
Consider only plausible tokens from the expert model using plausibility constraint.
GPT-XL
GPT-small
Barack Obama was born in Honolulu, Hawai. He was born in
0.27 | Hawaii |
0.18 | the |
0.16 | Honolulu |
0.10 | 1961 |
0.02 | Washington |
0.08 | Honolulu |
0.04 | washington |
0.04 | the |
0.001 | 1961 |
4.6 | 1961 |
2.34 | Hawaii |
0.65 | Honolulu |
-0.73 | Washington |
Contrastive Decoding
Prompt
Barack Obama was born in Honolulu, Hawai. He was born in 1961
Controlled text generation summary: https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/