Introduction to Large Language Models

 Position Encoding Methods and Length Generalization 

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Absolute Positional Encoding


x_i \in \mathbb{R}^{512}
+
p_i
h_i \in \mathbb{R}^{512}

Positional encoding helps to capture the order of tokens in the input sequences

h_i=x_i+p_i

the values for the position vector are fixed and come from

PE_{(i,j)} = \begin{cases} \sin\left(\frac{i}{10000^{2j/d_{\text{model}}}}\right), \ j=0,2,\cdots \\ \cos\left(\frac{i}{10000^{2j/d_{\text{model}}}}\right), \ j=1,3,\cdots \end{cases}

One approach is to encode the absolute positions

somewhere

something

incredible

is 

waiting

to 

be

known

we have done this by adding the position vector \(p_i\) with the word embedding \(x_i\)

j=0
j=1
j=2
j=511
j=510
j=509
\vdots

0

1

2

3

4

5

6

7

Question Everything

the values for the position vector are fixed and come from

One approach is to encode the absolute positions

 by adding the position vector \(p_i\) with the word embedding \(x_i\)

Why not relative positions?

Why not multiply? 

Why not learn it? 

0

1

2

3

-2

-1

0

1

x_i+p_i
x_i \odot f(p_i)

Why not add PE with the attention score?

q_i k_j^T+f(p_i,p_j)
PE_{(i,j)} = (sin,cos)
PE(i,j) = (\theta,\nabla)

It is still an active area of research (alongside attention mechanisms)

Absolute 

Relative

Types

Fixed

(non-parametric)

Learnable

(parametric)

Learnable

(parametric)

BERT

GPT-x

(Vanilla Transformer)

We can broadly group existing methods under two categories

Fixed

(non-parametric)

LLaMA

GPT-J 

PALM

BLOOM

T5

OPT

BART

Transformer-XL

Module 1: Limitations of Absolute Position Encoding

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

All prominent models like BERT, RoBERTa,GPT-X, BART and OPT use learned APE

somewhere

something

incredible

is 

waiting

to 

be

known

0

1

2

7

\vdots

It is hypothesized that APE does learn to attend to tokens by relative position (Why is this important? We will come back to this later) 

Does it? Let's see

0

1

2

3

4

5

6

7

0

somewhere

something

incredible

is 

waiting

to 

be

known

1

2

3

4

5

6

7

Suppose we take a pre-trained model (say, GPT-2) and use it for grammatical acceptance test (based on perplexity score).

Ideally, we expect the model prediction to be consistent for the given input sentence with (learned) absolute position encoding

What if we change the starting position of the input by shifting by \(k\) units to the right. (i.e., the embeddings of first token added to \((0+k)\))

100

somewhere

something

incredible

is 

waiting

to 

be

known

101

102

103

104

105

106

107

k=100

If the model learned to attend to relative positions, then the prediction should not change. Therefore, the performance should not degrade.

However, it was observed that the performance degrades as \(k\) increases as shown in the figure  [paper]

This motivated the need to explicitly encode relative positions

The other limitation of APE is its inability to extrapolate to a longer sequence beyond the context length of a model!

Let's take an example to understand these better

closely related to the above is the "length generalization" ability to extrapolate from shorter training sequences to longer test ones (within the context length of a model)

Suppose a GPT (uses APE) model was pre-trained with a context length \(T\) equals to 512. 

Therefore, enabling the model to extrapolate to sequences longer than \(T\) is important

However, we want the model to generate a story of 1024 tokens.

Well, we do not have the learned embeddings for the positions 513 to 1024. This would require us to re-train the model with \(T=1024\)

However, re-training the same model with increased context length is computationally expensive.

Now, consider the  task of adding two integers using  LLMs with a context length of 128

Say, all training examples demonstrate how to add two 4 digit integers

9999 \\ 0001 \\
10000 \\

Can the model generalize the algorithm to add, say, two 12 digit integers?

Yes, the type of PE is a major factor in enabling this ability, but APE is not a good choice! [Paper]

APE makes it difficult if we want to introduce some recurrence mechanism like RNN by sharing the representation across segments. [To know more]

Moreover, what if the sequence itself has no absolute ordering (such as graphs)? 

All these reasons motivate us to use relative position encoding!

Module 2 : Relative Position Encoding Methods

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

We have only one absolute position for each token in the sequence

However, in the case of Relative Position Encoding (RPE), we have \(T\) relative positions for each token in a sequence of length \(T\) 

0

somewhere

something

incredible

is 

waiting

to 

be

known

1

2

3

4

5

6

7

APE

Therefore, RPE encodes pairwise relationships between tokens. (APE does not)

Let us understand this better

0

somewhere

something

incredible

is 

waiting

to 

be

known

1

2

3

4

5

6

7

0

-1

-2

-3

1

2

3

4

APE

RPE

The relative position of the word "is" 0 with respect to the position of the word itself and differs with respect to the position of the words surrounding it

We have \(T\) relative positions for each token in a sequence of length \(T\) and, in general, \(2T-1\) embeddings to encode all relative positions (whereas APE requires only \(T\) embeddings)

Therefore, the relative positions for the word "is" are (-3,-2,-1,0,1,2,3,4)

Let us illustrate by displaying all relative positions for the above sequence

0

-1

-2

-3

1

2

3

4

somewhere

something

incredible

is 

waiting

to 

be

known

somewhere

something

incredible

is 

waiting

to 

be

known

3

2

1

0

4

5

6

7

2

1

0

-1

3

4

5

6

2

1

-1

-2

3

4

5

0

-1

-2

-3

-4

0

1

2

3

-1

-2

-3

-4

0

1

2

-5

-1

-2

0

1

-3

-4

-5

-6

-6

-5

-4

-1

-2

0

-3

-7

We can view the pairwise relationship of tokens in the sequence as a labelled, directed, fully-connected graph

For example, a directed edge from the word "somewhere" to "something" has a label 1. Whereas, a directed edge from the word "something" to "somewhere" has a label -1.

h_i=x_i+\sum \limits_{j=0}^{T-1}p_{j-i}
h_3=x_3+(p_{-3}+p_{-2}+\cdots+p_{4})

For \(T=8\)

-7

-1

0

7

\vdots
\vdots

1

-(T-1)
(T-1)

embedding matrix

With a naive implementation, for each token embedding \(x_i\), we need a way to combine (say, addition) \(T\) relative position embeddings. 

With naive implementation, for each token embedding \(x_i\), we need a way to combine (say, addition) \(T\) relative position embeddings. 

h_i=x_i+\sum \limits_{j=0}^{T-1}p_{j-i}
h_3=x_3+(p_{-3}+p_{-2}+\cdots+p_{4})

For \(T=8\)

So, if \(T=1024\), then we need to do 1024 additions for each token in the input sequence!. 

In general, it requires \(T^2\) additions instead of \(T\) additions as in APE.

Can we do better?

The Idea

\begin{aligned} e_{ij} = q_ik_j^T \end{aligned}
e_{ij} = (x_i+p_i)W_Q W_K^T(x_j+p_j)^T\\
= x_iW_QW_K^Tx_j^T+x_iW_QW_K^Tp_j^T+p_iW_QW_K^Tx_j^T+p_iW_QW_K^Tp_j^T

recall that,

q_i=h_iW_Q=(x_i+p_i)W_Q, \\k_j=h_jW_K=(x_j+p_j)W_K

Therefore,

=(x_iW_Q+p_iW_Q)(W_K^Tx_j^T+W_K^Tp_j^T)

score for the given query and key without any position encoding

Adding a constant that comes from position encoding

 PE just adds a constant to the pre-attention score. Therefore, we can inject position information in the attention layer directly by adjusting the pre-attention score \(e_{ij}\)

Correlation beween a word and position 

Let's look at the decomposition of the pre-attention computation

The Idea

\begin{aligned} e_{ij} = q_ik_j^T \end{aligned}
e_{ij} = (x_i+p_i)W_Q W_K^T(x_j+p_j)^T\\
= x_iW_QW_K^Tx_j^T+x_iW_QW_K^Tp_j^T+p_iW_QW_K^Tx_j^T+p_iW_QW_K^Tp_j^T

recall that,

q_i=h_iW_Q=(x_i+p_i)W_Q, \\k_j=h_jW_K=(x_j+p_j)W_K

Therefore,

=(x_iW_Q+p_iW_Q)(W_K^Tx_j^T+W_K^Tp_j^T)

Visualization of the above equation in BERT [paper]. From left: Correlation between word-to-word, word-to-position, position-to-word, position-to-position

Adding APE to the input embeddings introduces positional artefact

The Idea

\begin{aligned} e_{ij} = q_ik_j^T \end{aligned}
e_{ij} = (x_i+p_i)W_Q W_K^T(x_j+p_j)^T\\
= x_iW_QW_K^Tx_j^T+x_iW_QW_K^Tp_j^T+p_iW_QW_K^Tx_j^T+p_iW_QW_K^Tp_j^T

recall that,

q_i=h_iW_Q=(x_i+p_i)W_Q, \\k_j=h_jW_K=(x_j+p_j)W_K

Therefore,

=(x_iW_Q+p_iW_Q)(W_K^Tx_j^T+W_K^Tp_j^T)

Visualization of the above equation in BERT [paper]. From left: Correlation between word-to-word, word-to-position, position-to-word, position-to-position

Uniformly distributed (that is, no correlation)

APE introduces positional artefact

With naive implementation, for each token embedding \(x_i\), we need to add \(T\) relative position embeddings

h_i=x_i+\sum \limits_{j=0}^Tp_{j-i}
h_3=x_3+(p_{-3}+p_{-2}+\cdots+p_{4})

For \(T=8\)

So, if \(T=1024\), then we need to do 1024 additions for each token in the input sequence!. 

In general, it requires \(T^2\) additions instead of \(T\) additions in APE.

Can we do better?

Yes, by modifying the pre-attention score \(e_{ij}\) directly

There are multiple ways of modifying the attention score  !

Adding a constant to the attention score is the simplest approach!

\text{where,}\ p_{ij}^{\tiny{K}}\in \mathbb{R}^{d}

\(p_{ij}^K\)  is also called position bias (a bias \(b\) as in \(xW+b\))

Note that in this formulation, the dimension of \(p_{ij}\) is  \(d\) (head dimension), whereas in APE, it is \(d_{model}\) 

No. Because the size of the position embedding matrix still depends on \(T\) !

Suppose the relative distance information is clipped beyond a certain distance \(k\) as follows

p_{ij}^{\tiny{K}}=w_{\text{clip}(j-i,k)}^{\tiny{K}}
\text{clip}(j-i,k) = \max(-k,\min(j-i,k))

Does this approach help the model to generalize to sequence lengths not seen during training?

Then, the size of position embedding matrix to \((2k+1)\), making it independent of \(T\)

e_{ij}=x_iW_Q(x_jW_K+p_{ij}^{\tiny{K}})^T

It is also empirically observed that clipping the distance does not hurt the performance.

for \(T=8,k=4\)

somewhere

something

incredible

is 

waiting

to 

be

known

somewhere

3

2

1

0

4

4

4

4

\text{clip}(j-i,k) = \max(-k,\min(j-i,k))
i=0
j=5
p_{ij}^{\tiny{K}}=w_{\text{\blue{clip}(j-i,k)}}^{\tiny{K}}

for \(T=8,k=4\)

0

-1

-2

-3

1

2

3

4

somewhere

something

incredible

is 

waiting

to 

be

known

somewhere

something

incredible

is 

waiting

to 

be

known

3

2

1

0

4

4

4

4

2

1

0

-1

3

4

4

4

2

1

-1

-2

3

4

4

0

-1

-2

-3

-4

0

1

2

3

-1

-2

-3

-4

0

1

2

-4

-1

-2

0

1

-3

-4

-4

-4

-4

-4

-4

-1

-2

0

-3

-4

\text{clip}(j-i,k) = \max(-k,\min(j-i,k))
p_{ij}^{\tiny{K}}=w_{\text{\blue{clip}(j-i,k)}}^{\tiny{K}}
e_{ij}=x_iW_Q(x_jW_K+p_{ij}^{\tiny{K}})^T = x_iW_QW_K^Tx_j^T+x_iW_Q(p_{ij}^{\tiny{K}})^T
\alpha_{ij}=\text{softmax}(e_{ij})
z_{i}= \sum \limits_{j}\alpha_{ij}(x_jW_V+p_{ij}^{\tiny{V}})

Finally, add the relative position information to value vectors

The position embeddings \( p_{ij}^{\tiny{K}} , p_{ij}^{\tiny{V}} \) are shared across heads  (not across layers)

Replacing the APE in vanilla transformer architecture by RPE with \(k=16\) leads to better performance

Source: [paper]

We can compute the attention score as usual

Efficient Implementation

E = XW_Q(XW_K+P^{\tiny{K}})^T
E=XW_Q(XW_K+P^{\tiny{K}})^T = XW_QW_K^TX^T+XW_Q(P^K)^T
X \in \mathbb{R}^{T \times d_{model}},\\ W_Q,W_K \in \mathbb{R}^{d_{model} \times d} \\ P^K \in \mathbb{R}^{T \times T \times d}

We can expand it further

The transpose happens along the last dimension of \((P^K)^T \in \mathbb{R}^{T \times d \times T}\). 

The space complexity of \(P^K\) is \(O(T^2d)\)

How can we reduce the space complexity? 

e_{ij}= x_iW_QW_K^Tx_j^T+x_iW_Q(p_{ij}^{\tiny{K}})^T

The second term,\(x_iW_Q \in \mathbb{R}^{1 \times d}\) gets multiplied with all relative positions (\(j=0,\cdots,T-1\)) given by the  matrix \((p_{i:}^{K})^T \in \mathbb{R}^{d \times T}\) . 

By reshaping the dimension of \(XW_Q \in \mathbb{R}^{T \times d }\) to \( \mathbb{R}^{T \times 1 \times d }\)

Computing the first term can be parallelized. However, the second term requires a small tweak

resulting dim of product\(XW_Q (P^K)^T \)  in \(T \times 1 \times T\)

Writing \(e_{ij}\)  in vectorized form,

 Note that we construct \(P\) from the embedding matrix which leads to additional memory requirement.

Is there a way to transform the embedding matrix directly instead of computing a separate relative position matrix \(P\)?. 

Yes. You can refer to [here]

RPE Variations

e_{ij} = x_iW_QW_K^Tx_j^T+x_iW_QW_K^Tp_j^T+p_iW_QW_K^Tx_j^T+p_iW_QW_K^Tp_j^T
e_{ij} = x_iW_QW_K^Tx_j^T+x_iW_Q{\color{blue}R_{j-i}^T}+{\color{red}u}W_K^Tx_j^T+{\color{red}v}W_K^T{\color{blue}R_{j-i}^T}

Transformer-XL (recursivevly extend context)

Where, 

 \({\color{red}{u,v}}\) are learnable vectors.

\({\color{blue}R_{i-j}^T}\) is the relative distance between the position \(i\) and \(j\) computed from sinusoidal function (non-learnable)

T5-bias

All that is required is adding a constant, why not just do that?

e_{ij} = x_iW_QW_K^Tx_j^T+{\color{blue}r_{j-i}}

Where, 

 \({\color{blue}r_{i-j}} \in \mathbb{R}\) are learnable scalars, shared across layers. Maximum relative distance is clipped to 128

It also greatly reduces number of learnable parameters when the model size is scaled

recall the decomposition

RPE injects the position information in the pre-attention score

What about the attention mechanism that does not directly compute or modify \(e_{ij}\) as in the case of kernel attention and low-rank approximation of attention

Can we inject the relative position into the token embeddings (without combining \(T\) PEs per token)?

In kernel and low-rank approximation of attention, the PEs are assumed to be injected into the token embeddings

Encode the relative position in angles and multiply it with the tokens embedding

e_{ij} = x_iW_QW_K^Tx_j^T+{\color{blue}r_{j-i}}

T5

f_{q}(x_i,i)=x_iW_Q
f_{k}(x_j,j)=x_jW_K

We can generalize this by letting,

However, we want the inner product between tokens embeddings to encode the relative position information. That is we want it to be

\langle f_{q}(x_i,i), f_{k}(x_j,j) \rangle=g(x_i,x_j,j-i)
e_{ij} = x_iW_QW_K^Tx_j^T+x_iW_Q{\color{blue}R_{j-i}^T}+{\color{red}u}W_K^Tx_j^T+{\color{red}v}W_K^T{\color{blue}R_{j-i}^T}
e_{ij} = x_iW_QW_K^Tx_j^T+x_iW_Q(p_{ij}^{\tiny{K}})^T

T-XL

Shaw

\langle f_{q}(x_i,i), f_{k}(x_j,j) \rangle = x_iW_QW_K^Tx_j^T

Consider the following pair of words in a sentence

the book . . .

PE=\begin{bmatrix} 0 & 1 \\ 0.84 & 0.54 \\ 0.9 & -0.41 \\ 0.14 & -0.98 \end{bmatrix}

APE

read the book . . .

Adding absolute position embedding changes the angle between the representation of words "the" and "book" based on where they appear in a sentence despite their relative distance being the same in both sentences

the' = the+p_0 \\ book'=book+p_1
the^{''} = the+p_1 \\ book^{''}=book+p_2

Can we preserve the angle between word representations so that the relative information can be encoded in angles?

p_0 \rightarrow
p_3 \rightarrow
q \cdot k=|q||k|cos(\theta)

the book . . .

\(m=0,n=1\)

read the book . . .

\(m=1,n=2\)

you read the book . . .

\(m=2,n=3\)

you must read the book . . .

\(m=3,n=4\)

he says you must read the book . . .

\(m=5,n=6\)

The relative distance between the words "the" and "book" is 1 in all the sentences given below

q \cdot k=|q||k|cos(\theta)

Let us assume the following functional form for the left-hand side of the above equation (keeping rotation in mind)

\langle f_{q}(x_i,i), f_{k}(x_j,j) \rangle=g(x_i,x_j,j-i)
f_{q}(x_m,m)=x_mW_Qe^{jm\theta}
f_{k}(x_n,n)=x_nW_Ke^{jn\theta}
g(x_m,x_n,m-n) = Re[(x_mW_Q)(x_nW_K)^Te^{j(m-n)\theta}]
\langle f_{q}(x_m,m), f_{k}(x_n,n) \rangle=x_mW_Qe^{jm\theta}(x_nW_Ke^{jn\theta})^H

Recall that the inner product between two complex vectors \(x,y \in \mathbb{C}^d\) is \(x \cdot y^H\), where \(H\) is a Hermitian transpose 

=x_mW_Q(x_nW_K)^Te^{j(m-n)\theta}
\theta

is a non-zero constant (why?). 

Here, we multiply a (complex) constant to the affine-transformed word embedding

then

(replaced \(i,j\) by \(m,n\) to avoid confusion with the imaginary number)

f_q(x_m,m)
= \begin{bmatrix} cos(m\theta) & -sin(m\theta) \\ sin(m\theta) & cos(m\theta) \end{bmatrix}
\begin{bmatrix} W_q^{11} & W_q^{12} \\ \\ W_q^{21} & W_q^{22} \end{bmatrix}
\begin{bmatrix} x_m^1 \\ \\ x_m^2 \end{bmatrix}

we can write 

f_{q}(x_m,m)=\mathbf{R}W_Qx_m
e^{jm\theta}= \begin{bmatrix} cos(m\theta) & -sin(m\theta) \\ sin(m\theta) & cos(m\theta) \end{bmatrix}

Can you recognize this matrix?

Therefore, for \(x_m \in \mathbb{R}^2\)

as a matrix as follows

e^{jm\theta}=cos(m\theta)+jsin(m\theta)

in

yeah, a rotation matrix \(\mathbf{R}\)

Essentially, it rotates affine-transformed \(x_m\) by \(m\theta\) 

f_{q}(x_m,m)=e^{jm\theta}W_Qx_m
x
x'
m \theta
f_q(x_m,m)
= \begin{bmatrix} cos(m\theta) & -sin(m\theta) \\ sin(m\theta) & cos(m\theta) \end{bmatrix}
\begin{bmatrix} W_q^{11} & W_q^{12} \\ \\ W_q^{21} & W_q^{22} \end{bmatrix}
\begin{bmatrix} x_m^1 \\ \\ x_m^2 \end{bmatrix}

For efficient computation,  we can write 

f_{q}(x_m,m)=\mathbf{R}W_Qx_m
e^{jm\theta}= \begin{bmatrix} cos(m\theta) & -sin(m\theta) \\ sin(m\theta) & cos(m\theta) \end{bmatrix}

Can you recognize this matrix?

Therefore, for \(x_m \in \mathbb{R}^2\)

as a matrix as follows

e^{jm\theta}=cos(m\theta)+jsin(m\theta)

in

ya, a rotation matrix \(\mathbf{R}\)

Essentially, the tranformation rotates the embeddings \(x_m,x_n\) by \(m\theta, n\theta\), respectively 

f_{q}(x_m,m)=e^{jm\theta}W_Qx_m

similarly for \(f_k(x_n,n)\)

f_k(x_n,n)
= \begin{bmatrix} cos(n\theta) & -sin(n\theta) \\ sin(n\theta) & cos(n\theta) \end{bmatrix}
\begin{bmatrix} W_k^{11} & W_k^{12} \\ \\ W_k^{21} & W_k^{22} \end{bmatrix}
\begin{bmatrix} x_n^1 \\ \\ x_n^2 \end{bmatrix}
f_{k}(x_n,n)=\mathbf{R}W_Kx_n
\langle f_{q}(x_m,m), f_{k}(x_n,n) \rangle

Then the inner product results in

=x_m^TW_Q^T\begin{bmatrix} cos(m\theta) & sin(m\theta) \\ -sin(m\theta) & cos(m\theta) \end{bmatrix} \begin{bmatrix} cos(n\theta) & -sin(n\theta) \\ sin(n\theta) & cos(n\theta) \end{bmatrix} W_Kx_n
q_m^Tk_n=x_m^TW_Q^T\begin{bmatrix} cos((m-n)\theta) & sin((m-n)\theta) \\ -sin((m-n)\theta) & cos((m-n)\theta) \end{bmatrix} W_Kx_n

We can generalize this to \(x \in \mathbb{R}^{d}, d>2\)

f_q(x_m,m)
= \begin{bmatrix} cos(m\theta) & -sin(m\theta) \\ sin(m\theta) & cos(m\theta) \end{bmatrix}
\begin{bmatrix} W_q^{11} & W_q^{12} \\ \\ W_q^{21} & W_q^{22} \end{bmatrix}
\begin{bmatrix} x_m^1 \\ \\ x_m^2 \end{bmatrix}

For efficient computation,  we can write 

f_{q}(x_m,m)=\mathbf{R}W_Qx_m
e^{jm\theta}= \begin{bmatrix} cos(m\theta) & -sin(m\theta) \\ sin(m\theta) & cos(m\theta) \end{bmatrix}

Can you recognize this matrix?

Therefore, for \(x_m \in \mathbb{R}^2\)

as a matrix as follows

e^{jm\theta}=cos(m\theta)+jsin(m\theta)

in

yeah, a rotation matrix \(\mathbf{R}\)

Essentially, it rotates affine-transformed \(x_m\) by \(m\theta\) 

f_{q}(x_m,m)=e^{jm\theta}W_Qx_m

The vector is then transformed by \(W_Q\) (say, \(W_Q=I\)) and rotated by \(m\theta\). 

Since the absolute position information is encoded in the rotation, it is called Rotary Position Encoding (RoPE)

As show in the figure, we can extend the idea  to \(x_m \in \mathbb{R}^d\) as follows

f_{q}(x_m,m)=\mathbf{R}_{\Theta,m }^dW_Qx_m

where \(\mathbf{R}_{\Theta,m }^d\) is given by  

\mathbf{R}_{\Theta,m }^d= \begin{bmatrix} cos(m\theta_1) & -sin(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ sin(m\theta_1) & cos(m\theta_1) & 0 & 0 & \cdots & 0 & 0\\ 0 & 0 & cos(m\theta_2) & -sin(m\theta_2) & \cdots & 0 & 0 \\ 0 & 0 & sin(m\theta_2) & cos(m\theta_2) & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \\ 0 & 0 & 0 & 0 & \cdots & cos(m\theta_{d/2}) & -sin(m\theta_{d/2}) \\ 0 & 0 & 0 & 0 & \cdots & sin(m\theta_{d/2}) & cos(m\theta_{d/2}) \end{bmatrix}

Then, the pre-attention score  is computed by encoding the relative position as follows

q_m^Tk_n=(\mathbf{R}_{\Theta,m }^dW_Qx_m)^T(\mathbf{R}_{\Theta,n }^dW_Kx_n)
=x_m^TW_Q^T(\mathbf{R}_{\Theta,m }^d)^T(\mathbf{R}_{\Theta,n }^d)W_Kx_n
=x_m^TW_Q^T(\mathbf{R}_{\Theta,m }^d)^T(\mathbf{R}_{\Theta,n }^d)W_Kx_n
=x_m^TW_Q^T\mathbf{R}_{\Theta,m-n}^dW_Kx_n
\text{where,}\mathbf{R}_{\Theta,n-m}^T=(\mathbf{R}_{\Theta,m }^d)^T(\mathbf{R}_{\Theta,n }^d)
\Theta=\{\theta_i=10000^{-\frac{2(i-1)}{d}},i\in(1,2,\cdots,d/2)\}

It is similar to the one we used in sinusoidal embedding

APE of \(m,n\)

RPE of \(m-n\)

=x_m^TW_Q^T\mathbf{R}_{\Theta,n-m}^dW_Kx_n

Again, let's take 2D example

Let's assume \(W_Q=W_K=I\)

=x_m^T\mathbf{R}_{\Theta,n-m}^dx_n
\begin{bmatrix} cos(n\theta) & -sin(n\theta) \\ sin(n\theta) & cos(n\theta) \end{bmatrix}
\mathbf{R}_{\Theta,n-m}^T=(\mathbf{R}_{\Theta,m }^d)^T(\mathbf{R}_{\Theta,n }^d)
=\begin{bmatrix} cos(m\theta) & sin(m\theta) \\ -sin(m\theta) & cos(m\theta) \end{bmatrix}
=\begin{bmatrix} cos((m-n)\theta)& -sin((m-n)\theta) \\ sin((m-n)\theta) & cos((m-n)\theta) \end{bmatrix}

Performance of RoPE during pre-training

Unlike sinusoidal APE, RoPE injects the positional information at every layer of the model and the position embedding is not injected into the value vectors.

Module 3 : PE and Length Generalization

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

So far, we talked about various PE methods which lead to better performance

How do they help in length generalization?

Let's define length generalization as the ability of a model to extrapolate to longer sequences than the one  it has seen during training

From the transformer paper

However, no experiments were conducted to validate the same

Which approach (APE or RPE) is good at extrapolation? 

Suppose we train a causal language model with \(T=512\) tokens

During inference, we validate the model by allowing the model to predict next tokens up to \(T_{valid}\) tokens

Perplexity score is used to measure the quality of predictions.

\(T_{valid}\) is varied from 512 to 16000 tokens

We repeat the experiment only by changing the position encoding method and keeping the other aspects as is. 

The figure below shows the performance of using sinusoidal,T5, RoPE and ALiBi  position encoding methods

Does increasing the training context window \(T=1024\)  help?

Does increasing the training context window \(T=1024\) help?

Yes, it helps T5 Bias RPE more than rotary and sinusoidal APE. For example, for \(T_{valid}=8000\), the model trained with \(T=512\) has a higher perplexity score than the model trained with \(T=1024\)

This suggests that modifying T5-bias might improve the performance. 

Recall the T5-bias RPE

e_{ij} = x_iW_QW_K^Tx_j^T+{\color{blue}r_{j-i}}

 \({\color{blue}r_{j-i}} \in \mathbb{R}\) are learnable scalars

where

Why not hard-code \({\color{blue}r_{j-i}}\)?

as it helps increase the inference speed

How do we do that?

Just add a constant (negative bias) to the attention score based on its position as show in the figure

The attention score is computed as follows

\text{softmax}(q_iK^T+m.(-(i-1),\cdots,-2,-1,0)))

\(m\) is a head specific scalar that follows geometric progression.

For \(n=8\) heads, \(m_i=\frac{1}{2^i},i=1,\cdots,n\)

Since it is RPE, position information is added at every layer of the model.

It is called linear because the bias value increase linearly with respect to the distance

ALiBi Acts Like a Local Windowed Attention

We can  observe that the constant value added for nearby tokens is larger (0,-1,-2..) than the one added for distant tokens say (-128,-129..)

This effectively acts as a local windowed attention that enables the ALiBi PE to predict the next token effectively from the recent past \(k\) tokens (a natural structure in languages)

Why does this work remarkably well even for \(T_{valid}=16K\)?

Subsequent studies generalized ALiBi such as  KERPLE (Kernalized Relative Position embedding for Length Extrapolation) and improved RoPE such as xPOS

The previous approach (ALiBi) used "perplexity" as a metric to measure the length generalization ability during inference

A study [NoPos] conjectures that using a causal mask implicitly encodes the position information 

This may not be suitable for all downstream tasks like sorting, addition, summation,reversing...

Validation perplexity

Keeping these two observations in mind, one could extend the study on length generalization ability of decoder-only models, to other downstream tasks

One study considered  sinusoidal APE, RoPE, T5-bias, No Position Encoding (NoPE)

Here is the summary of the findings

A study [NoPos] conjectures that using a causal mask implicitly encodes the position information 

It is shown theoretically that NoPE (i.e, decoder-only models)

  1.  can represent both absolute and relative PEs 

  2. learns to use relative PE in practice

Despite all these studies, length generalization remains a challenge for transformer based models

This raises a question

Is there an inherent limitation in Transformers’ design preventing effective length generalization? 

a recent study (Feb 2024) suggests so

Summary

source:[paper]

NoPE is not in the list for obvious reasons (but we can list it under RPE)

References