Introduction to Large Language Models

Lecture 1: Transformers:Multi-headed self-attention, cross attention

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Module 1.1 : Limitations of sequential model

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

I

enjoyed

the

movie

transformers

h_1
h_2
h_3
h_4
h_5
h_0

Encoder RNN

Decoder RNN

h_i:\text{Hidden States of encoder RNN}

The final state \(h_5\) is called a concept vector, thought vector or annotation

Sequence to Sequence 

Sequence to Sequence 

I

enjoyed

the

movie

transformers

h_1
h_2
h_3
h_4
h_5
h_0

Encoder RNN

Decoder RNN

s_0=h_5
h_i:\text{Hidden States of encoder RNN}
s_i:\text{Hidden States of decoder RNN}

Naan

transfarmar

padaththai

rasiththen

Naan

transfarmar

padaththai

s_1
s_2
s_3

The final state \(h_5\) is called a concept vector, thought vector, context vector or annotation

However, it doesn't care about the alignment of words between source sentence and target sentence

Attention mechanism resolves this.

Attention: A quick tour

I

enjoyed

the

film

transformers

h_1
h_2
h_3
h_4
h_5
h_0

RNN Encoder

Attention a quick Tour:

I

enjoyed

the

film

transformers

h_1
h_2
h_3
h_4
h_5
h_0

RNN Encoder

Suppose that we create a copy of  these hidden state vectors \((h_1,\cdots,h_5\)) and make those available to the decoder.

Once all these vectors are available to the decoder we can  drop the encoder block (until decoding completes)

The attention mechanism consumes these vectors as one of the inputs.

h_1
h_2
h_3
h_4
h_5
h_1
h_2
h_3
h_4
h_5
\sum
\alpha_{11}
\alpha_{12}
\alpha_{13}
\alpha_{14}
\alpha_{15}
c_1
s_0
\begin{aligned} \mathbf{c}_t &= \sum_{i=1}^n \alpha_{ti} \boldsymbol{h}_i & \small{\text{; Context vector for output }y_t} \end{aligned}

\(n\), number of words

\(t\)= time step for decoder

\begin{bmatrix} \alpha_{11}\\ \alpha_{12}\\ \alpha_{13}\\ \alpha_{14}\\ \alpha_{15} \end{bmatrix}
\begin{bmatrix} |&|&|&|&|\\ h_1&h_2&h_3&h_4&h_5\\ |&|&|&|&| \end{bmatrix}
s_1

Naan

I

enjoyed

the

film

transformers

<Go>

h_1
h_2
h_3
h_4
h_5
\sum
\alpha_{21}
\alpha_{22}
\alpha_{23}
\alpha_{24}
\alpha_{25}
c_2
\begin{bmatrix} \alpha_{21}\\ \alpha_{22}\\ \alpha_{23}\\ \alpha_{24}\\ \alpha_{25} \end{bmatrix}
\begin{bmatrix} |&|&|&|&|\\ h_1&h_2&h_3&h_4&h_5\\ |&|&|&|&| \end{bmatrix}
s_1

Naan

s_2

transfarmar

I

enjoyed

the

film

transformers

s_0

<Go>

h_1
h_2
h_3
h_4
h_5
\sum
\alpha_{31}
\alpha_{32}
\alpha_{33}
\alpha_{34}
\alpha_{35}
c_3
\begin{bmatrix} \alpha_{31}\\ \alpha_{32}\\ \alpha_{33}\\ \alpha_{34}\\ \alpha_{35} \end{bmatrix}
\begin{bmatrix} |&|&|&|&|\\ h_1&h_2&h_3&h_4&h_5\\ |&|&|&|&| \end{bmatrix}
s_1

Naan

s_2

transfarmar

I

enjoyed

the

film

transformers

s_3

padaththai

s_0

<Go>

h_1
h_2
h_3
h_4
h_5
\sum
\alpha_{41}
\alpha_{42}
\alpha_{43}
\alpha_{44}
\alpha_{45}
c_4
\begin{bmatrix} \alpha_{41}\\ \alpha_{42}\\ \alpha_{43}\\ \alpha_{44}\\ \alpha_{45} \end{bmatrix}
\begin{bmatrix} |&|&|&|&|\\ h_1&h_2&h_3&h_4&h_5\\ |&|&|&|&| \end{bmatrix}
s_1

Naan

s_2

transfarmar

I

enjoyed

the

film

transformers

s_3

padaththai

s_4

rasiththen

s_0

<Go>

h_1
h_2
h_3
h_4
h_5
\sum
\alpha_{41}
\alpha_{42}
\alpha_{43}
\alpha_{44}
\alpha_{45}
c_4
s_1

Naan

s_2

transfarmar

I

enjoyed

the

film

transformers

s_3

padaththai

s_4

rasiththen

Alignment of words:

Naan

transfarmar

padaththai

rasiththen

the

I

enjoyed

film

transformers

\alpha_{ti}=align(y_t,h_i)
\begin{aligned} &= \frac{\exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_i))}{\sum_{i'=1}^n \exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_{i'}))} \end{aligned}
y_t \rightarrow
s_0

<Go>

the

I

enjoyed

film

transformers

Naan

transfarmar

padaththai

rasiththen

\alpha_{11}
\alpha_{25}
\alpha_{34}
\alpha_{42}
h_1
h_2
h_3
h_4
h_5
\sum
\alpha_{41}
\alpha_{42}
\alpha_{43}
\alpha_{44}
\alpha_{45}
c_4
s_1

Naan

s_2

transfarmar

I

enjoyed

the

film

transformers

s_3

padaththai

s_4

rasiththen

\alpha_{ti}=align(y_t,h_i)
\begin{aligned} &= \frac{\exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_i))}{\sum_{i'=1}^n \exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_{i'}))} \end{aligned}

QP: Can \(h_{i}\),for all \(i\), be computed in parallel ?

QP: Can \(\alpha_{ti}\) be computed in parallel for all \(i\) at time step \(t\)?

s_0

<Go>

Yes. \(h_i\) is available for all \(i\) and \(s_{t-1}\) is also available at time step \(t\).

No!

Take away: Attention can be parallelized

Approaches to implement score function

Content-Base attention

cosine(s_{t-1},h_i)

Additive (concat) attention

Dot product attention

v_a^T \ tanh(W_a[s_{t-1}:h_i])
s_{t-1}^Th_i

Scaled Dot product attention

\frac{s_{t-1}^Th_i}{\sqrt{n}}

All score functions take in two vectors and produce a scalar. 

Major Limitation

Everything about the RNN based sequence-to-sequence model seems good so far.

They performed well for translation using an attention mechanism.

However, there is a major limitation in training the model.

Given a training example, we can't parallelize the sequence of computations across time steps.

I

enjoyed

the

movie

transformers

h_1
h_2
h_3
h_4
h_5
h_0
s_0=h_5

Naan

transfarmar

padaththai

rasiththen

Naan

transfarmar

padaththai

s_1
s_2
s_1
s_2
s_3

Wishlist: come up with a new architecture that incorporates the attention mechanism and also allows parallelization (and of course, get rid of vanishing/exploding gradient problems)

Module 1.2 : Attention is all you need

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

Transformers: Attention is all you need

*RNN seq2seq models

I

enjoyed

the

movie

transformers

h_1
h_2
h_3
h_4
h_5
h_0
s_0=h_5

Naan

transfarmar

padaththai

rasiththen

Naan

transfarmar

padaththai

s_1
s_2
s_1
s_2
s_3

Transition to Transformers

Encoder

Decoder

h_1
h_2
h_3
h_4
h_5

Naan

transfarmar

padaththai

rasiththen

Naan

transfarmar

padaththai

s_1
s_2
s_1
s_2
s_3

Transition to Transformers

Encoder

Decoder

With Attention

\begin{bmatrix} \alpha_{11}\\ \alpha_{12}\\ \alpha_{13}\\ \alpha_{14}\\ \alpha_{15} \end{bmatrix}

I

enjoyed

the

movie

transformers

Naan

transfarmar

padaththai

rasiththen

Transition to Transformers

Encoder

Decoder

Feed Forward Networks

Self-Attention

Self-Attention

Encoder-Decoder Attention

Feed Forward Networks

We will see each of these components (and a few more) one at a time in detail and connect them together to synthesize the final architecture

Word Embedding

h_1
h_2
h_3
h_4
h_5
s_1
s_2
s_3
s_4
s_5
z_1
z_2
z_3
z_4
z_5

I

enjoyed

the

movie

transformers

Self-Attention

Encoder

Self-Attention

Note that the inputs are vectors (word embeddings) and the outputs are also vectors

1 0 0
0 1 0
1 1 0
1 0 1
1 1 1

Well, we know what attention is. All it requires is a pair of vectors as input.

You can think of these word embeddings as \(h_i\) in RNN encoder (if you wish to compare)

Self-Attention 

Let's take another example sentence

"The animal didn't cross the street because it was too tired".

The word "it" in the sentence refers to "Animal" or "Street"?

We know "it" is referring to the word "Animal"

Let's modify the sentence:

"The animal didn't cross the street because it was congested".

Now the word "it" is referring to the word "Street"

Therefore, it is important to establish a strong connection between the word "it" to the word "street" or "animal"  based on the context.

It calls for an Attention mechanism. That's why it is called a self-attention (to distinguish from cross-attention which we will see later)

The goal

Given a word in a sentence, we want to compute the relational score between the word and the rest of the words in the sentence.

The animal didn't cross the street bacause it 
The
animal
didn't
cross
the
street
because
it

The goal

Given a word in a sentence, we want to compute the relational score between the word and the rest of the words in the sentence....

The animal didn't cross the street bacause it 
The 0.6 0.1 0.05 0.05 0.02 0.02 0.02 0.1
animal 0.02 0.5 0.06 0.15 0.02 0.05 0.01 0.12
didn't 0.01 0.35 0.45 0.1 0.01 0.02 0.01 0.03
cross .
the .
street .
because .
it 0.01 0.6 0.02 0.1 0.01 0.2 0.01 0.01

...such that the score is higher if they are related contextually

The goal

The animal didn't cross the street bacause it 
The 0.6 0.1 0.05 0.05 0.02 0.02 0.02 0.1
animal 0.02 0.5 0.06 0.15 0.02 0.05 0.01 0.12
didn't 0.01 0.35 0.45 0.1 0.01 0.02 0.01 0.03
cross .
the .
street .
because .
it 0.01 0.6 0.02 0.1 0.01 0.2 0.01 0.01

We can think of headers in the first column as \(s_{i}\) and  headers in the first row as \(h_j\) (just for convenience)

s_{i}
h_{j}

The goal

The animal didn't cross the street bacause it 
The 0.6 0.1 0.05 0.05 0.02 0.02 0.02 0.1
animal 0.02 0.5 0.06 0.15 0.02 0.05 0.01 0.12
didn't 0.01 0.35 0.45 0.1 0.01 0.02 0.01 0.03
cross .
the .
street .
because .
it 0.01 0.6 0.02 0.1 0.01 0.2 0.01 0.01

However, now both the vectors \((s_i)\) and \((h_j)\), for all \(i,j\) are available for all the time (whereas in the seq2seq model, \(h_j\) for all \(j\) were available and \(s_i\) was obtained one at a time)

s_{i}
h_{j}

Does it allow us to compute the values for the rows parallelly (i.e., all at a time) ?

Choice of Attention function

Recall, the score function \(e_{jt}\) used in the seq2seq model

score(s_{t-1},h_j)=V_a^T \tanh(U_{att}s_{t-1}+W_{att}h_j)
\bigg\{
\bigg\{
  • Two Linear transformations

There are three vectors \((s,h,v)\) involved in computing the score at each time step (of decoder)

  • One non-linearity

Choice of Attention function

score(s_{t-1},h_j)=V_a^T
  • Two Linear transformations

There are three vectors \((s,h,v)\) involved in computing the score for each time step (of decoder)

  • One non-linearity

  • Finally Dot-Product

Choice of Attention function

However, the input to the self-attention module is only the word embeddings \(h_j\) for all \(j\)

So, we need to get three vectors for each word embedding. How do we do it?

Matrix transformation!. How many matrices do we need?

3

The value of elements in the matrix?

score(s_{t-1},h_j)=V_a^T
  • Two Linear transformations

There are three vectors \((s,h,v)\) involved in computing the score for each time step (of decoder)

  • One non-linearity

  • Finally Dot-Product

Transformation Matrices

=
W_Q
h_j
q_j
=
W_K
h_j
k_j
=
W_V
h_j
v_j

\(q_j\) is called query vector for the word embedding \(h_j\)

\(k_j\) is called key vector for the word embedding \(h_j\)

\(v_j\) is called value vector for the word embedding \(h_j\)

\(W_Q,W_K \ and \ W_V\) are called respective linear transformation (parameter) matrices.

Animal \(\mathbb{R^3}\)

It \(\mathbb{R^3}\)

Animal \(\mathbb{R^2}\)

It \(\mathbb{R^2}\)

W \in R^{2 \times 3}
W_Q \in \mathbb{R}^{64 \times 512}
W_K \in \mathbb{R}^{64 \times 512}
W_V \in \mathbb{R}^{64 \times 512}
h_j \in \mathbb{R}^{512 \times 1}
q \in \mathbb{R}^{64 \times 1}
k \in \mathbb{R}^{64 \times 1}
v \in \mathbb{R}^{64 \times 1}

I

enjoyed

the

movie

transformers

Self-Attention

1 0 0
0 1 0
1 1 0
1 0 1
1 1 1

Let's focus on first calculating the first output from self-attention layer

I

transformers

Self-Attention

1 0 0
1 1 1
0.3 0.2
\cdots
\cdots
\cdots
0.1 0.5
-0.1 0.25
0.11 0.89
0 0.4
0.2 0.7
k_1
v_1
q_1
k_5
v_5
q_5
W_K
W_V
W_Q
W_k
W_V
W_Q
z_1
score(s_{t-1},h_j)

Fixed 

variable

score(q_1, \ \ k_j)

Score func: dot product

e_{1j}=[q_1 \cdot k_1, \quad q_1 \cdot k_2, \quad \cdots, \quad \quad \cdots \quad \quad q_1 \cdot k_5]
\alpha_{1j} = softmax(e_{1j})
z_{1} = \sum \limits_{j=1}^5 \alpha_{1j}v_j

What about the \(z_2\)?

Let's focus on first calculating the first output from self-attention layer

I

transformers

Self-Attention

1 0 0
1 1 1

Let's focus on first calculating the first output from self-attention layer

0.3 0.2
\cdots
\cdots
\cdots
0.1 0.5
-0.1 0.25
0.11 0.89
0 0.4
0.2 0.7
k_1
v_1
q_1
k_5
v_5
q_5
W_K
W_V
W_Q
W_k
W_V
W_Q
z_1
score(s_{t-1},h_j)

Fixed 

variable

score(q_2, \ \ k_j)
e_{2j}=[q_2 \cdot k_1, \quad q_2 \cdot k_2, \quad \cdots, \quad \quad \cdots \quad \quad q_2 \cdot k_5]
\alpha_{2j} = softmax(e_{2j})
z_{2} = \sum \limits_{j=1}^5 \alpha_{2j}v_j
z_1
z_2

Repeat the procedure for all other \(z\).

Score func: dot product

Wait, can we vectorize all these computations and compute the outputs (\(z_1,z_2,\cdots,z_T\)) in one go?

=
W_Q
h_1
q_T
h_2
h_T
q_2
q_1
\cdots
\cdots
\Bigg[
\Bigg]
\Bigg[
\Bigg]
Q
Q \in \mathbb{R}^?
Q \in \mathbb{R}^{64 \times T}

Wait, can we vectorize all these computations and compute the outputs (\(z_1,z_2,\cdots,z_T\)) in one go?

=
W_K
h_1
k_T
h_2
h_T
k_2
k_1
\cdots
\cdots
\Bigg[
\Bigg]
\Bigg[
\Bigg]
K
K \in \mathbb{R}^?
K \in \mathbb{R}^{64 \times T}

Wait, can we vectorize all these computations and compute the outputs (\(z_1,z_2,\cdots,z_T\)) in one go?

=
W_V
h_1
v_1
h_2
h_T
v_2
v_1
\cdots
\cdots
\Bigg[
\Bigg]
\Bigg[
\Bigg]
V
V \in \mathbb{R}^?
V \in \mathbb{R}^{64 \times T}
\begin{aligned} Z &= [z_1,z_2,\cdots,z_T] \\ &=softmax\big(\frac{Q^TK}{\sqrt{d_k}}\big)V^T \end{aligned}

Where \(d_k\) is the dimension of key vector.

Since \(d_k\) scales the values of \(Q^TK\), it is called a scaled-dot product.

dim(Q^TK): T \times 64 \times64 \times T = T \times T
dim(Z): T \times T \times T \times 64 = T \times 64

Vectorized Output 

Q
K
V
\text{MatMul:} \ Q^TK
\text{Scale}:\frac{1}{\sqrt{d_k}}
\text{Softmax}
\text{MatMul}

\(^*\)Note: The original paper follows \(softmax(\frac{QK^T}{\sqrt{d_k}})V\) as it represents embedding \(h_j\) as a row vector

Q
K
V
\text{MatMul:} \ Q^TK
\text{Scale}:\frac{1}{\sqrt{d_k}}
\text{Softmax}
\text{MatMul}

Scaled Dot Product

Head

Q
K
V
W_Q
W_K
W_V
H=\{h_1,h_2,\cdots,h_T\}

Two-head Attention

Scaled Dot Product

Head-1

Q
K
V
W_Q
W_K
W_V
H=\{h_1,h_2,\cdots,h_T\}

Scaled Dot Product

Head-2

Q
K
V
W_Q
W_K
W_V
H=\{h_1,h_2,\cdots,h_T\}

Motivation for Multi-Head attention

What is the significance of having more than one filter/kernel in a CNN layer?

To learn more abstract representations, capture more meaningful interactions between inputs

Similarly, we can have more than one self-attention heads with different parameter matrices \((W_Q^i,W_K^i,W_V^i)\) with a hope that it learns subtle contextual information.

This motivates "Multi-head Attention", which is a simple extension of single-head attention

Like, each kernel independently learns its feature in CNN, each head independently computes the attention in Transformers. (Parallel computation!)

Motivation 

Single-Head

The word "it" is strongly connected to the word "was".

Link to the colab: Tensor2Tesnor

Motivation 

Two-Head

The word "it" is strongly connected to the word "was" in the first head

The word "it" is strongly connected to the word "animal" in the second head.

So it is evident (empirically) that adding more than one attention helps in capturing different contextual information of the sentence

So, the multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

=
W_Q
h_j
q_j

A little more detail

h_j \in \mathbb{R}^{512 \times 1}
W_Q \in \mathbb{R}^{64 \times 512}
q_j \in \mathbb{R}^{64 \times 1}

The dimension of \(q_j\) is much less than \(h_j\). In fact,

The word embedding \(h_j \in \mathbb{R}^{512}\) is projected to a low dimensional representation subspace of size \(q_j,k_j,v_j \in \mathbb{R}^{64}\) by the matrices \(W_Q,W_K\) and \(W_V\).

h_j
h_j
h_j
h_j
h_j
h_j
W_Q^1
W_K^1
W_V^1
W_Q^2
W_K^2
W_V^2

Scaled Dot Product

Attention

Scaled Dot Product

Attention

Two head Attention

Concatenate

Linear

How do we extend this to multi-head attention?

Recall How \(Q,K,V\) are obtained from \(H\).

h_j
h_j
h_j
h_j
h_j
h_j
h_j
h_j
h_j
\cdots
\cdots
\cdots
W_Q^1
W_K^1
W_V^1
W_Q^2
W_K^2
W_V^2
W_Q^8
W_K^8
W_V^8

Scaled Dot Product

Attention

Scaled Dot Product

Attention

Scaled Dot Product

Attention

Concatenate (:\(T \times 512\))

Linear

Multi-Head Attention

Concatenate (:\(T \times 512\))

Linear

Multi-Head Attention 

W_Q^8
W_K^8
W_V^8

Scaled Dot Product

Attention

W_Q^2
W_K^2
W_V^2

Scaled Dot Product

Attention

Scaled Dot Product

Attention

h_j
h_j
h_j
W_Q^1
W_K^1
W_V^1

Concatenate (:\(T \times 512\))

Linear

Multi-Head Attention 

W_Q^8
W_K^8
W_V^8

Scaled Dot Product

Attention

W_Q^2
W_K^2
W_V^2

Scaled Dot Product

Attention

Scaled Dot Product

Attention

h_j
h_j
h_j
W_Q^1
W_K^1
W_V^1
\small \text{MultiHead}(Q,K,V) =\small \text{Concatenate}(head_1,\cdots,head_8)
head_i=Attention
W_O
(Q^i,K^i,V^i)

The input is projected into \(h=8\) different representation subspaces.

So, the multi-head attention allows the model to jointly

attend to information from different representation

subspaces at different positions.

I

enjoyed

the

movie

transformers

Encoder

Feed Forward Network

Self-Attention

Back to Basic Block

I

enjoyed

the

movie

transformers

Encoder

Feed Forward Network

Multi-Head Attention

A slight change in terminology. These two components form an encoder layer. The encoder is composed of a stack of \(N=6\) layers

Therefore, each layer in the encoder is composed of two sublayers, namely, multi-head attention and feed forward neural networks

I

enjoyed

the

movie

transformers

Multi-Head Attention

FFN

1

FFN

2

FFN

5

FFN

4

FFN

3

z_1=\mathbb{R}^{512}
l={2048}
o_1=\mathbb{R}^{512}
o_1

I

enjoyed

the

movie

transformers

Multi-Head Attention

FFN

1

FFN

2

FFN

5

FFN

4

FFN

3

z_2=\mathbb{R}^{512}
l={2048}
o_2=\mathbb{R}^{512}
o_2

I

enjoyed

the

movie

transformers

Multi-Head Attention

FFN

1

FFN

2

FFN

5

FFN

4

FFN

3

z_5=\mathbb{R}^{512}
l={2048}
o_5=\mathbb{R}^{512}
o_5
\text{FFN}(z)=max(0,W_1z+b_1)W_2+b_2

Identical network  for each position \(z_i\)

Let's calculate the number of learnable parameters in the encoder layer

T=40
d_k,d_q,d_v=64
W_Q :64 \times 512=32,178
W_K :64 \times 512=32,178
W_V :64 \times 512=32,178
8*64 \times 512 = 2,62144
h=8
8*64 \times 512 = 2,62144
8 * 64 \times 512 = 2,62144
\begin{aligned} FFN &=(512*2048)+2048+(2048*512)+512 \\ &=2 \times 10^6 \end{aligned}

Feed Forward Network

Multi-Head Attention

I

enjoyed

the

movie

transformers

W_O :512 \times 512 =262,144

Therefore, about 3 million parameters per layer

Considering 8 heads

\vdots

Encoder Stack

The encoder is composed of \(N\) identical layers and each layer is composed of 2  sub-layers.

Number of Parameters: \(N \times 3 \times 10^6\)

Number of Parameters: \(6 \times \) 3 million

about 18 million parameters

The computation is parallelized in the horizontal direction (i.e., within a training sample) of the encoder stack, not along the vertical direction.

I

enjoyed

the

movie

transformers

Layer-1

Layer-2

Layer-6

T \times 512
T \times 512
T \times 512
T \times 512
T \times 512

Let us denote the output sequence of vectors from the encoder as \(e_j\), for \(j=1,\cdots,?\)

Decoder

I

enjoyed

the

movie

transformers

T \times 512
e_j

Encoder

T \times 512
1 \times 37000

Decoder

Naan

transfarmar

padaththai

rasithen

h_j
e_j

What will be the output dimension?

Decoder Stack

Naan

transfarmar

padaththai

rasithen

\vdots

Layer-1

Layer-2

Layer-6

The decoder is a stack of \(N=6\) Layers. However, each layer is composed of three sublayers.

Feed Forward Network

Masked Multi-Head (Self)Attention

Multi-Head (Cross) Attention

e

Teacher Forcing

Why the target sentence is being fed as one of the inputs to the decoder?

Usually, we use only the decoder's previous prediction as input to make the next prediction in the sequence.

However, the drawback of this approach is that if the first prediction goes wrong then there is a high chance the rest of the predictions will go wrong (because of conditional probability)

Of course, the algorithm has to fix this as training progresses. But, it takes a long time to train the model.

The other approach is to use so-called "Teacher Forcing" algorithm. Let's say the target language is English

Ground Truth

I

enjoyed

the sunshine

I

enjoyed

the film

last night

This will lead to an accumulation of errors.

transformer

D

D

D

D

D

Masked (Self) Attention

Q
K
V
\text{MatMul:} \ Q^TK
\text{Scale}:\frac{1}{\sqrt{d_k}}
\text{Softmax}
\text{MatMul}

With one important difference:

Recall that in self-attention we computed the query, key, and value vectors \(q,k\&v\) by multiplying the word embeddings \(h_1,h_2,\cdots,h_T\) with the transformation matrices \(W_Q,W_K,W_V\),respectively.

The same is repeated in the decoder layers. This time the \(h_1,h_2,\cdots,h_T\)  are the word embeddings of target sentence. But,

Q
K
V
\text{MatMul:} \ Q^TK
\text{Scale}:\frac{1}{\sqrt{d_k}}
\text{Softmax}
\text{MatMul}
Mask

With one important difference: Masking to implement the teacher-forcing approach during training.

Note: Encoder block also uses masking in attention sublayer  in practice to mask the padded tokens in sequences having length < T

Of course, we can't use teacher forcing during inference. Instead, the decoder act as a auto-regressor.

Recall that in self-attention we computed the query, key, and value vectors \(q,k\&v\) by multiplying the word embeddings \(h_1,h_2,\cdots,h_T\) with the transformation matrices \(W_Q,W_K,W_V\),respectively.

The same is repeated in the decoder layers. This time the \(h_1,h_2,\cdots,h_T\)  are the word embeddings of target sentence. But,

Masked Multi-Head Self Attention

How do we create the mask? Where should we incorporate it? At the input or output or somewhere in between?

Q_1 = W_{Q_1} H \\
Q_1^TK_1=\Alpha
K_1 = W_{K_1}H \\
V_1 = W_{V_1}H
\alpha_{1:}
\alpha_{2:}
\alpha_{T:}
v_{1:}
v_{2:}
v_{T:}
\vdots
\vdots

Assign zero weights \(\alpha_{ij}=0\) for the value vectors \(v_j\) to be masked in a sequence

Let us denote the mask matrix by \(M\), then

Z = softmax(A+M)V^T

Masked Multi-Head (Self)Attention

Masked Multi-Head (Self)Attention

Masked Multi-Head (Self)Attention

Naan

transfarmar

padaththai

rasiththen

<Go>

Naan

transfarmar

padaththai

rasiththen

<Go>

Naan

transfarmar

padaththai

rasiththen

<Go>

Masking is done by inserting negative infinite at the respective positions.

T
T

This actually forms an triangular matrix with one half all zeros and the other half all negative infinity

M=\begin{bmatrix} 0 & -\infty & -\infty & -\infty&-\infty\\ 0 & 0 & -\infty & -\infty & -\infty \\ \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 &0\\ \end{bmatrix}

 Multi-Head Cross Attention

Now we have the vectors \(\{s_1,s_2,\cdots,s_T\}\) coming from the self-attention layer of decoder.

We have also a set of vector \( \{e_1,e_2,\cdots,e_T\}\) coming from top layer of encoder stack that is shared with all layers of decoder stack.

Again, we need to create query, key and value vectors by applying the linear transformation matrices \(W_{Q_2},W_{K_2},\ \& \,W_{V_2}\) on these vectors \(s_i\) and \(e_j\).

Therefore, it is called  Encoder-Decoder attention or cross attention.

We construct query vectors using vectors from self-attention layer,\(S\) and key,value vectors using vectors from the encoder,\(E\)

Q_2=W_{Q_2}S
K_2=W_{K_2}E
V_2=W_{V_2}E
Z = softmax(Q_2^TK_2)V_2^T

We compute the multi-head attention using \(Q_2,K_2,V_2\) and concatenate the resulting vectors.

Finally, we pass the concatenated vectors through the feed-forward neural network to obtain the output vectors \(O\)

Feed Forward Network

Masked Multi-Head (Self)Attention

Multi-Head (Cross) Attention

e

First decoder layer

Naan

transfarmar

padaththai

rasiththen

<Go>

Feed Forward Network

e_{1:T}

Naan

transfarmar

padaththai

rasiththen

<Go>

f\big(W_{Q_1},W_{K_1},W_{V_1},h_{1:T})
Q_2
f(Q_1,K_1,V_1;Mask;W_{O_1})
s_{1:T}
W_{Q_2}
W_{K_2}
W_{V_2}
V_2
K_2
f(Q_2,K_2,V_2;W_{O_2})
\{Q_1,K_1,V_1\}

Number of Parameters:

FFN = 2\times(512 \times2048)+2048+512 \\
=2 \times 10^6 \\

About 2 million parameters from FFN layer

About 1 million parameters from Masked-Multi Head Attention layer

About 1 million parameters from Multi Head Cross Attention layer

About 4 million parameter per decoder layer

Decoder output

Feed Forward Network

e_{1:T}
f\big(W_{Q_1},W_{K_1},W_{V_1},h_{1:T})
Q_2
f(Q_1,K_1,V_1;Mask;W_{O_1})
s_{1:T}
W_{Q_2}
W_{K_2}
W_{V_2}
V_2
K_2
f(Q_2,K_2,V_2;W_{O_2})
\{Q_1,K_1,V_1\}

Linear \(W_D\)

Softmax

The output from the top most decoder layer is linearly transformed by the matrix \(W_D\) of size \(512 \times |V|\) where \(|V|\) is the size of the vocbulary.

The probability distribution for the predicted words is obtained by applying softmax function.

This alone contributes about 19 million parameters of the total  65 million parameters of the architecture.

Module 16.3 : Positional Encoding

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

Positional Encoding

The position of words in a sentence was encoded in the hidden states  of RNN based sequence to sequence models.

However, in transformers, no such information is available to either encoder or decoder. Moreover, the output from self-attention is permutation-invariant.

How do we embed positional information in the word embedding \(h_j\) (of size 512)?

I

h_0^\prime \in \mathbb{R}^{512}
i=0
i=1
i=2
i=511
i=510
i=509
\vdots
\text{position,} \ j \rightarrow \quad
\text{dimension} \ i \rightarrow \quad

Enjoyed

h_1^\prime \in \mathbb{R}^{512}

So, it is necessary to encode the positional information.

Positional Encoding

i=0
i=1
i=2
i=511
i=510
i=509
\vdots

I


h_0\prime \in \mathbb{R}^{512}
+
p_0
h_0 \in \mathbb{R}^{512}

How do we fill the elements of the positional vector \(p_0\)?

Could it be a constant vector (i.e, all elements are of constant (position) value \(j\), for \(p_j\))?

Can we use one hot encoding for the position \(j\), \(j=0,1,\cdots,T\)?

or learn embedding for all possible positions?

Not suitable if the sentence length is dynamic.

PE_{(j,i)} = \begin{cases} \sin\left(\frac{j}{10000^{2i/d_{\text{model}}}}\right) & i=0,1,\cdots,255 \\ \cos\left(\frac{j}{10000^{(2i)/d_{\text{model}}}}\right) & i=0,1,\cdots,255\\ \end{cases}

Sinusoidal encoding function

Hypothesis: Embed a unique pattern of features  for each position \(j\) and  the model will learn to attend by the relative position.

How do we generate the features?

\text{where,}d_{model} = 512

For the fixed position \(j\), the value for \(i\) is sampled from \(sin()\) if \(i\) is even or \(cos()\) if \(i\) is odd

Let's evaluate the function \(PE_{(j,i)}\) for \(j={0,1,\cdots,8}\) and \(i={0,1,\cdots,63}\)

Then, we can visualize this matrix as a heat map.

j
i
p_0=\begin{bmatrix} 0\\1\\0\\1\\ 0\\\vdots\\0\\1 \end{bmatrix}
PE_{(j,i)} = \begin{cases} \sin\left(\frac{j=0}{10000^{2i/d_{\text{model}}}}\right) & i=0,1,\cdots,255 \\ \cos\left(\frac{j=0}{10000^{(2i)/d_{\text{model}}}}\right) & i=0,1,\cdots,255\\ \end{cases}

This alternating 0's and 1's will be added to the first word(embedding) of all  sentences (sequences)

j
i
p_0=\begin{bmatrix} 0\\1\\0\\1\\ 0\\\vdots\\0\\1 \end{bmatrix}
PE_{(j,i)} = \begin{cases} \sin\left(\frac{j=0}{10000^{2i/d_{\text{model}}}}\right) & i=0,1,\cdots,255 \\ \cos\left(\frac{j=0}{10000^{(2i)/d_{\text{model}}}}\right) & i=0,1,\cdots,255\\ \end{cases}

This alternating 0's and 1's will be added to the first word(embedding) of all  sentences.

Let's ask some interesting questions

I Enjoyed  the film transformer
0 1 2 3 4
1 0 1 2 3
2 1 0 1 2
3 2 1 0 1
4 3 2 1 0

Distance matrix

I

Enjoyed

the

film

transformer

The interesting observation is that the distance increases on the left and right of 0 (in all the rows) and  is symmetric at the centre position of the sentence

Does the PE function satisfy this property?

Let's verify it graphically..

Does one-hot encoding satisfy this

property?

No.

The Euclidean distance between any two vectors (independent of their position) is always \(\sqrt{2}\).

j
i
p_0=\begin{bmatrix} 0\\1\\0\\1\\ 0\\\vdots\\0\\1 \end{bmatrix}
j
i

At every even indexed column,\((i=0,2,4,\cdots,512)\), we have a sine function with decreasing frequency (or increasing wavelength) as \(i\) increases.

Similarly, at odd indexed column,\((i=1,3,5,\cdots,511)\), we have a cosine function with decreasing frequency (or increasing wavelength) as \(i\) increases.

Interleaving these two as alternative columns creates the (name it: wavy,aurora?) pattern.

Wavelength progress  from \(2\pi \rightarrow 10000 2\pi\)

Module 16.4 : Training the Transformer

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

For a (rough) comparison, we may think of the transformer architecture is composed of attention layers and hidden layers.

Attention layer

Attention layer

Attention layer

Linear

+

Softmax

W_1
W_2
W_{\tiny atten}

Then there is one attention layer and two hidden layers in encoder layer

For a (rough) comparison, we may think of the transformer architecture is composed of attention layers and hidden layers.

Attention layer

Attention layer

Attention layer

Linear

+

Softmax

How do we ensure the gradient  flow across the network?

How do we speed up the training process?

Residual Connections!

Normalization!

W_1
W_2
W_{\tiny atten}

Then there is one attention layer and two hidden layers in encoder layer and 2 attention layers and 2 hidden layers in decoder layer. Then, the network is deep with 42 layers

Batch Normalization at 

\begin{matrix} x_1^2\\ x_2^2\\ x_3^2\\ x_4^2\\ x_5^2\\ \end{matrix}
\begin{matrix} x_1^1\\ x_2^1\\ x_3^1\\ x_4^1\\ x_5^1\\ \end{matrix}
\begin{matrix} x_1^m\\ x_2^m\\ x_3^m\\ x_4^m\\ x_5^m\\ \end{matrix}
\cdots
Accumulator
l^{th} \text{layer}

Accumulated activations for \(m\) training samples

Let \(x_i^j\) denotes the activation of \(i^{th}\) neuron for \(j^{th}\) training sample

\mu_{i=2} =\frac{1}{m} \sum \limits_{j=1}^{m}x_{2}^j \quad
\sigma_i^2 =\frac{1}{m} \sum \limits_{j=1}^{m}(x_{2}^j-\mu_i)^2 \quad
\hat{x_i} = \frac{x_i-\mu_i}{\sqrt{\sigma_i^2+\epsilon}}
(x_i)
j
l

We have three variables \(l,i, j\) involved in the statistics computation. Let's visualize these as three axes that form a cube.

Let us associate an accumulator with \(l^{th}\) layer that stores the activations of batch inputs.

\hat{y_i} = \gamma \hat{x_i}+\beta

Batch Normalization at 

\begin{matrix} x_1^2\\ x_2^2\\ x_3^2\\ x_4^2\\ x_5^2\\ \end{matrix}
\begin{matrix} x_1^1\\ x_2^1\\ x_3^1\\ x_4^1\\ x_5^1\\ \end{matrix}
\begin{matrix} x_1^m\\ x_2^m\\ x_3^m\\ x_4^m\\ x_5^m\\ \end{matrix}
\cdots
Accumulator
l^{th} \text{layer}

Accumulated activations for \(m\) training samples

Let \(x_i^j\) denotes the activation of \(i^{th}\) neuron for \(j^{th}\) training sample

\mu_{i=3} =\frac{1}{m} \sum \limits_{j=1}^{m}x_{3}^j \quad
\sigma_i^2 =\frac{1}{m} \sum \limits_{j=1}^{m}(x_{3}^j-\mu_i)^2 \quad
\hat{x_i} = \frac{x_i-\mu_i}{\sqrt{\sigma_i^2+\epsilon}}
l

We have three variables \(l,i, j\) involved in the statistics computation. Let's visualize these as three axes that form a cube.

Let us associate an accumulator with \(l^{th}\) layer that stores the activations of batch inputs.

(x_i)
j
\hat{y_i} = \gamma\hat{x_i}+\beta

Can we apply batch normalization to transformers?

Of course, yes. However, there are some limitations to BN.

The accuracy of estimation of mean and variance depends on the size of \(m\). So using a smaller size of \(m\) results in high error.

Because of this, we can't use a batch size of 1 at all (i.e, it won't make any difference, \(\mu_i=x_i,\sigma_i=0\))

Fortunately, we have another simple normalization technique called Layer Normalization that works well.

Other than this limitation, it was also empirically found that the naive use of BN leads to performance degradation in NLP tasks (source).

There was also a systematic study that validated the statement and proposed a new normalization technique (by modifying BN) called powerNorm.

Layer Normalization at \(l^{th}\) layer

The computation is simple. Take the average across outputs of hidden units in the layer. Therefore, the normalization is independent of number of samples in a batch.

This allows us to work with a batch size of 1 (if needed as in the case of RNN)

\begin{matrix} x_1\\ x_2\\ x_3\\ x_4\\ x_5\\ \end{matrix}
l^{th} \text{layer}
\sum
\frac{1}{H}
\mu_l=\frac{1}{H}\sum \limits_{i=1}^H x_i
\sigma_l=\sqrt{\frac{1}{H}\sum \limits_{i=1}^H (x_i-\mu_l)^2}
(x_i)
j
l
\hat{x_i} = \frac{x_i-\mu_l}{\sqrt{\sigma_l^2+\epsilon}},\quad \forall i
\hat{y_i} = \gamma \hat{x_i}+\beta

The complete Layer

Encoder

Feed Forward Network

Multi-Head Attention

The complete Layer

Encoder

Feed Forward Network

Multi-Head Attention

Add & Layer Norm

Add & Layer Norm

Add residual connection and layer norm block after every Multi-Head attention, feed-forward network, cross attention blocks

The Transformer Architecture

The output from the top encoder layer is fed as input to all the decoder layers to compute multi-head cross attention.

The input embedding for words in a sequence is learned while training the model. (No pretrained embedding models like Word2Vec was used).

This amounts to an additional set of weights in the model.

The positional information is encoded and added with input embedding (this function  can also be parameterized)

Training the Transformer

m_t = \beta_1m_{t-1}+(1-\beta_1)\nabla w
\hat{m_t} =\frac{m_t}{1-\beta_1^t}
v_t = \beta_2 v_{t-1}+(1-\beta_2)(\nabla w_t)^2
w_{t+1}=w_t-\frac{ \eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
\hat{v_t} =\frac{v_t}{1-\beta_2^t}

ADAM

\beta_1=0.9
\beta_2=0.999
\eta=0.001

Defaults

However, the learning rate \(\eta\) was changed across time steps.

Module 1.5 : Warm-up strategy

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

Attempt-1: Using a decaying learning rate 

\eta=\frac{1}{\sqrt{steps}}=steps^{-0.5}

Let's assume that the model  is taking too long to converge (i.e., the error rate decays very slowly, especially for the first 4000 steps)

So we decide to see whether increasing the learning rate would be of any help.

Let's say we start off  training the model with mini-batches

Attempt-2: Using a growing learning rate 

\eta=steps \times 4000^{-1.5}

Let's say the error rate decays significantly well for the first 4000 steps and increases thereafter..

So, what will be our next attempt?

We increase the learning linearly by setting

Increase the learning rate for the first 4000 steps and  decrease it thereafter.

Increase the learning rate for the first 4000 steps and  decrease it thereafter.

How do we combine both?

\eta=\begin{cases} steps \cdot 4000^{-1.5},& \text{if } steps\leq 4000\\ steps^{-0.5}, & \text{otherwise} \end{cases}

This is counterintuitive. We usually start with a high learning rate and keep decreasing it.

However, here we do a 'warm-up' during initial steps and decrease the learning rate after "warm-up steps (4000)"

Well, "Warm-up steps" is another hyper-parameter.

\eta=d_{model}^{-0.5} \cdot min(step_{num}^{-0.5},step_{num} \cdot warmup{steps}^{-1.5})

Learning rate with Warmup Strategy

Notice that the red curve is  decreasing monotonically and the blue curve is increasing monotonically.

So, it is true that, after the warmup steps, the blue curve is always greater than the red curve.

This allows us to rewrite the learning rate schedule as follows

Scaling factor 

\eta=d_{model}^{-0.5} \cdot min(step_{num}^{-0.5},step_{num} \cdot warmup{steps}^{-1.5})

Learning rate with Warm-Up Strategy

warmupSteps=4000

References