Introduction to Large Language Models
Position Encoding Methods and Length Generalization
Mitesh M. Khapra, Arun Prakash A
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Absolute Positional Encoding
Positional encoding helps to capture the order of tokens in the input sequences
the values for the position vector are fixed and come from
One approach is to encode the absolute positions
somewhere
something
incredible
is
waiting
to
be
known
we have done this by adding the position vector \(p_i\) with the word embedding \(x_i\)
0
1
2
3
4
5
6
7
Question Everything
the values for the position vector are fixed and come from
One approach is to encode the absolute positions
by adding the position vector \(p_i\) with the word embedding \(x_i\)
Why not relative positions?
Why not multiply?
Why not learn it?
0
1
2
3
-2
-1
0
1
Why not add PE with the attention score?
It is still an active area of research (alongside attention mechanisms)
Absolute
Relative
Types
Fixed
(non-parametric)
Learnable
(parametric)
Learnable
(parametric)
BERT
GPT-x
(Vanilla Transformer)
We can broadly group existing methods under two categories
Fixed
(non-parametric)
LLaMA
GPT-J
PALM
BLOOM
T5
OPT
BART
Transformer-XL
Module 1: Limitations of Absolute Position Encoding
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra, Arun Prakash A
All prominent models like BERT, RoBERTa,GPT-X, BART and OPT use learned APE
somewhere
something
incredible
is
waiting
to
be
known
0
1
2
7
It is hypothesized that APE does learn to attend to tokens by relative position (Why is this important? We will come back to this later)
Does it? Let's see
0
1
2
3
4
5
6
7
0
somewhere
something
incredible
is
waiting
to
be
known
1
2
3
4
5
6
7
Suppose we take a pre-trained model (say, GPT-2) and use it for grammatical acceptance test (based on perplexity score).
Ideally, we expect the model prediction to be consistent for the given input sentence with (learned) absolute position encoding
What if we change the starting position of the input by shifting by \(k\) units to the right. (i.e., the embeddings of first token added to \((0+k)\))
100
somewhere
something
incredible
is
waiting
to
be
known
101
102
103
104
105
106
107
If the model learned to attend to relative positions, then the prediction should not change. Therefore, the performance should not degrade.
However, it was observed that the performance degrades as \(k\) increases as shown in the figure [paper]
This motivated the need to explicitly encode relative positions
The other limitation of APE is its inability to extrapolate to a longer sequence beyond the context length of a model!
Let's take an example to understand these better
closely related to the above is the "length generalization" ability to extrapolate from shorter training sequences to longer test ones (within the context length of a model)
Suppose a GPT (uses APE) model was pre-trained with a context length \(T\) equals to 512.
Therefore, enabling the model to extrapolate to sequences longer than \(T\) is important
However, we want the model to generate a story of 1024 tokens.
Well, we do not have the learned embeddings for the positions 513 to 1024. This would require us to re-train the model with \(T=1024\)
However, re-training the same model with increased context length is computationally expensive.
Now, consider the task of adding two integers using LLMs with a context length of 128
Say, all training examples demonstrate how to add two 4 digit integers
Can the model generalize the algorithm to add, say, two 12 digit integers?
Yes, the type of PE is a major factor in enabling this ability, but APE is not a good choice! [Paper]
APE makes it difficult if we want to introduce some recurrence mechanism like RNN by sharing the representation across segments. [To know more]
Moreover, what if the sequence itself has no absolute ordering (such as graphs)?
All these reasons motivate us to use relative position encoding!
Module 2 : Relative Position Encoding Methods
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra, Arun Prakash A
We have only one absolute position for each token in the sequence
However, in the case of Relative Position Encoding (RPE), we have \(T\) relative positions for each token in a sequence of length \(T\)
0
somewhere
something
incredible
is
waiting
to
be
known
1
2
3
4
5
6
7
APE
Therefore, RPE encodes pairwise relationships between tokens. (APE does not)
Let us understand this better
0
somewhere
something
incredible
is
waiting
to
be
known
1
2
3
4
5
6
7
0
-1
-2
-3
1
2
3
4
APE
RPE
The relative position of the word "is" 0 with respect to the position of the word itself and differs with respect to the position of the words surrounding it
We have \(T\) relative positions for each token in a sequence of length \(T\) and, in general, \(2T-1\) embeddings to encode all relative positions (whereas APE requires only \(T\) embeddings)
Therefore, the relative positions for the word "is" are (-3,-2,-1,0,1,2,3,4)
Let us illustrate by displaying all relative positions for the above sequence
0
-1
-2
-3
1
2
3
4
somewhere
something
incredible
is
waiting
to
be
known
somewhere
something
incredible
is
waiting
to
be
known
3
2
1
0
4
5
6
7
2
1
0
-1
3
4
5
6
2
1
-1
-2
3
4
5
0
-1
-2
-3
-4
0
1
2
3
-1
-2
-3
-4
0
1
2
-5
-1
-2
0
1
-3
-4
-5
-6
-6
-5
-4
-1
-2
0
-3
-7
We can view the pairwise relationship of tokens in the sequence as a labelled, directed, fully-connected graph
For example, a directed edge from the word "somewhere" to "something" has a label 1. Whereas, a directed edge from the word "something" to "somewhere" has a label -1.
For \(T=8\)
-7
-1
0
7
1
embedding matrix
With a naive implementation, for each token embedding \(x_i\), we need a way to combine (say, addition) \(T\) relative position embeddings.
With naive implementation, for each token embedding \(x_i\), we need a way to combine (say, addition) \(T\) relative position embeddings.
For \(T=8\)
So, if \(T=1024\), then we need to do 1024 additions for each token in the input sequence!.
In general, it requires \(T^2\) additions instead of \(T\) additions as in APE.
Can we do better?
The Idea
recall that,
Therefore,
score for the given query and key without any position encoding
Adding a constant that comes from position encoding
PE just adds a constant to the pre-attention score. Therefore, we can inject position information in the attention layer directly by adjusting the pre-attention score \(e_{ij}\)
Correlation beween a word and position
Let's look at the decomposition of the pre-attention computation
The Idea
recall that,
Therefore,
Visualization of the above equation in BERT [paper]. From left: Correlation between word-to-word, word-to-position, position-to-word, position-to-position
Adding APE to the input embeddings introduces positional artefact
The Idea
recall that,
Therefore,
Visualization of the above equation in BERT [paper]. From left: Correlation between word-to-word, word-to-position, position-to-word, position-to-position
Uniformly distributed (that is, no correlation)
APE introduces positional artefact
With naive implementation, for each token embedding \(x_i\), we need to add \(T\) relative position embeddings
For \(T=8\)
So, if \(T=1024\), then we need to do 1024 additions for each token in the input sequence!.
In general, it requires \(T^2\) additions instead of \(T\) additions in APE.
Can we do better?
Yes, by modifying the pre-attention score \(e_{ij}\) directly
There are multiple ways of modifying the attention score !
Adding a constant to the attention score is the simplest approach!
\(p_{ij}^K\) is also called position bias (a bias \(b\) as in \(xW+b\))
Note that in this formulation, the dimension of \(p_{ij}\) is \(d\) (head dimension), whereas in APE, it is \(d_{model}\)
No. Because the size of the position embedding matrix still depends on \(T\) !
Suppose the relative distance information is clipped beyond a certain distance \(k\) as follows
Does this approach help the model to generalize to sequence lengths not seen during training?
Then, the size of position embedding matrix to \((2k+1)\), making it independent of \(T\)
It is also empirically observed that clipping the distance does not hurt the performance.
for \(T=8,k=4\)
somewhere
something
incredible
is
waiting
to
be
known
somewhere
3
2
1
0
4
4
4
4
for \(T=8,k=4\)
0
-1
-2
-3
1
2
3
4
somewhere
something
incredible
is
waiting
to
be
known
somewhere
something
incredible
is
waiting
to
be
known
3
2
1
0
4
4
4
4
2
1
0
-1
3
4
4
4
2
1
-1
-2
3
4
4
0
-1
-2
-3
-4
0
1
2
3
-1
-2
-3
-4
0
1
2
-4
-1
-2
0
1
-3
-4
-4
-4
-4
-4
-4
-1
-2
0
-3
-4
Finally, add the relative position information to value vectors
The position embeddings \( p_{ij}^{\tiny{K}} , p_{ij}^{\tiny{V}} \) are shared across heads (not across layers)
Replacing the APE in vanilla transformer architecture by RPE with \(k=16\) leads to better performance
Source: [paper]
We can compute the attention score as usual
Efficient Implementation
We can expand it further
The transpose happens along the last dimension of \((P^K)^T \in \mathbb{R}^{T \times d \times T}\).
The space complexity of \(P^K\) is \(O(T^2d)\)
How can we reduce the space complexity?
The second term,\(x_iW_Q \in \mathbb{R}^{1 \times d}\) gets multiplied with all relative positions (\(j=0,\cdots,T-1\)) given by the matrix \((p_{i:}^{K})^T \in \mathbb{R}^{d \times T}\) .
By reshaping the dimension of \(XW_Q \in \mathbb{R}^{T \times d }\) to \( \mathbb{R}^{T \times 1 \times d }\)
Computing the first term can be parallelized. However, the second term requires a small tweak
resulting dim of product\(XW_Q (P^K)^T \) in \(T \times 1 \times T\)
Writing \(e_{ij}\) in vectorized form,
Note that we construct \(P\) from the embedding matrix which leads to additional memory requirement.
Is there a way to transform the embedding matrix directly instead of computing a separate relative position matrix \(P\)?.
RPE Variations
Transformer-XL (recursivevly extend context)
Where,
\({\color{red}{u,v}}\) are learnable vectors.
\({\color{blue}R_{i-j}^T}\) is the relative distance between the position \(i\) and \(j\) computed from sinusoidal function (non-learnable)
T5-bias
All that is required is adding a constant, why not just do that?
Where,
\({\color{blue}r_{i-j}} \in \mathbb{R}\) are learnable scalars, shared across layers. Maximum relative distance is clipped to 128
It also greatly reduces number of learnable parameters when the model size is scaled
recall the decomposition
RPE injects the position information in the pre-attention score
What about the attention mechanism that does not directly compute or modify \(e_{ij}\) as in the case of kernel attention and low-rank approximation of attention
Can we inject the relative position into the token embeddings (without combining \(T\) PEs per token)?
In kernel and low-rank approximation of attention, the PEs are assumed to be injected into the token embeddings
Encode the relative position in angles and multiply it with the tokens embedding
T5
We can generalize this by letting,
However, we want the inner product between tokens embeddings to encode the relative position information. That is we want it to be
T-XL
Shaw
Consider the following pair of words in a sentence
the book . . .
APE
read the book . . .
Adding absolute position embedding changes the angle between the representation of words "the" and "book" based on where they appear in a sentence despite their relative distance being the same in both sentences
Can we preserve the angle between word representations so that the relative information can be encoded in angles?
the book . . .
\(m=0,n=1\)
read the book . . .
\(m=1,n=2\)
you read the book . . .
\(m=2,n=3\)
you must read the book . . .
\(m=3,n=4\)
he says you must read the book . . .
\(m=5,n=6\)
The relative distance between the words "the" and "book" is 1 in all the sentences given below
Let us assume the following functional form for the left-hand side of the above equation (keeping rotation in mind)
Recall that the inner product between two complex vectors \(x,y \in \mathbb{C}^d\) is \(x \cdot y^H\), where \(H\) is a Hermitian transpose
is a non-zero constant (why?).
Here, we multiply a (complex) constant to the affine-transformed word embedding
then
(replaced \(i,j\) by \(m,n\) to avoid confusion with the imaginary number)
we can write
Can you recognize this matrix?
Therefore, for \(x_m \in \mathbb{R}^2\)
as a matrix as follows
in
yeah, a rotation matrix \(\mathbf{R}\)
Essentially, it rotates affine-transformed \(x_m\) by \(m\theta\)
For efficient computation, we can write
Can you recognize this matrix?
Therefore, for \(x_m \in \mathbb{R}^2\)
as a matrix as follows
in
ya, a rotation matrix \(\mathbf{R}\)
Essentially, the tranformation rotates the embeddings \(x_m,x_n\) by \(m\theta, n\theta\), respectively
similarly for \(f_k(x_n,n)\)
Then the inner product results in
We can generalize this to \(x \in \mathbb{R}^{d}, d>2\)
For efficient computation, we can write
Can you recognize this matrix?
Therefore, for \(x_m \in \mathbb{R}^2\)
as a matrix as follows
in
yeah, a rotation matrix \(\mathbf{R}\)
Essentially, it rotates affine-transformed \(x_m\) by \(m\theta\)
The vector is then transformed by \(W_Q\) (say, \(W_Q=I\)) and rotated by \(m\theta\).
Since the absolute position information is encoded in the rotation, it is called Rotary Position Encoding (RoPE)
As show in the figure, we can extend the idea to \(x_m \in \mathbb{R}^d\) as follows
where \(\mathbf{R}_{\Theta,m }^d\) is given by
Then, the pre-attention score is computed by encoding the relative position as follows
It is similar to the one we used in sinusoidal embedding
APE of \(m,n\)
RPE of \(m-n\)
Again, let's take 2D example
Let's assume \(W_Q=W_K=I\)
Performance of RoPE during pre-training
Unlike sinusoidal APE, RoPE injects the positional information at every layer of the model and the position embedding is not injected into the value vectors.
Module 3 : PE and Length Generalization
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra, Arun Prakash A
So far, we talked about various PE methods which lead to better performance
How do they help in length generalization?
Let's define length generalization as the ability of a model to extrapolate to longer sequences than the one it has seen during training
From the transformer paper
However, no experiments were conducted to validate the same
Which approach (APE or RPE) is good at extrapolation?
Suppose we train a causal language model with \(T=512\) tokens
During inference, we validate the model by allowing the model to predict next tokens up to \(T_{valid}\) tokens
Perplexity score is used to measure the quality of predictions.
\(T_{valid}\) is varied from 512 to 16000 tokens
We repeat the experiment only by changing the position encoding method and keeping the other aspects as is.
The figure below shows the performance of using sinusoidal,T5, RoPE and ALiBi position encoding methods
Does increasing the training context window \(T=1024\) help?
Does increasing the training context window \(T=1024\) help?
Yes, it helps T5 Bias RPE more than rotary and sinusoidal APE. For example, for \(T_{valid}=8000\), the model trained with \(T=512\) has a higher perplexity score than the model trained with \(T=1024\)
This suggests that modifying T5-bias might improve the performance.
Recall the T5-bias RPE
\({\color{blue}r_{j-i}} \in \mathbb{R}\) are learnable scalars
where
Why not hard-code \({\color{blue}r_{j-i}}\)?
as it helps increase the inference speed
How do we do that?
Just add a constant (negative bias) to the attention score based on its position as show in the figure
The attention score is computed as follows
\(m\) is a head specific scalar that follows geometric progression.
For \(n=8\) heads, \(m_i=\frac{1}{2^i},i=1,\cdots,n\)
Since it is RPE, position information is added at every layer of the model.
It is called linear because the bias value increase linearly with respect to the distance
ALiBi Acts Like a Local Windowed Attention
We can observe that the constant value added for nearby tokens is larger (0,-1,-2..) than the one added for distant tokens say (-128,-129..)
This effectively acts as a local windowed attention that enables the ALiBi PE to predict the next token effectively from the recent past \(k\) tokens (a natural structure in languages)
Why does this work remarkably well even for \(T_{valid}=16K\)?
Subsequent studies generalized ALiBi such as KERPLE (Kernalized Relative Position embedding for Length Extrapolation) and improved RoPE such as xPOS.
The previous approach (ALiBi) used "perplexity" as a metric to measure the length generalization ability during inference
A study [NoPos] conjectures that using a causal mask implicitly encodes the position information
This may not be suitable for all downstream tasks like sorting, addition, summation,reversing...
Validation perplexity
Keeping these two observations in mind, one could extend the study on length generalization ability of decoder-only models, to other downstream tasks
One study considered sinusoidal APE, RoPE, T5-bias, No Position Encoding (NoPE)
Here is the summary of the findings
A study [NoPos] conjectures that using a causal mask implicitly encodes the position information
It is shown theoretically that NoPE (i.e, decoder-only models)
-
can represent both absolute and relative PEs
-
learns to use relative PE in practice
Despite all these studies, length generalization remains a challenge for transformer based models
This raises a question
Is there an inherent limitation in Transformers’ design preventing effective length generalization?
a recent study (Feb 2024) suggests so
Summary
source:[paper]
NoPE is not in the list for obvious reasons (but we can list it under RPE)
References
Lecture-9-Positional-Encoding-Schemes
By Arun Prakash
Lecture-9-Positional-Encoding-Schemes
- 414