I
enjoyed
the
movie
transformers
Encoder RNN
Decoder RNN
I
enjoyed
the
movie
transformers
Encoder RNN
Decoder RNN
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
padaththai
I
enjoyed
the
film
transformers
RNN Encoder
I
enjoyed
the
film
transformers
RNN Encoder
\(n\), number of words
\(t\)= time step for decoder
Naan
I
enjoyed
the
film
transformers
<Go>
Naan
transfarmar
I
enjoyed
the
film
transformers
<Go>
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
<Go>
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
rasiththen
<Go>
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
rasiththen
Alignment of words:
Naan
transfarmar
padaththai
rasiththen
the
I
enjoyed
film
transformers
<Go>
the
I
enjoyed
film
transformers
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
rasiththen
<Go>
Content-Base attention
Additive (concat) attention
Dot product attention
Scaled Dot product attention
All score functions take in two vectors and produce a scalar.
I
enjoyed
the
movie
transformers
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
padaththai
*RNN seq2seq models
I
enjoyed
the
movie
transformers
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
padaththai
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
padaththai
With Attention
I
enjoyed
the
movie
transformers
Naan
transfarmar
padaththai
rasiththen
Feed Forward Networks
Self-Attention
Self-Attention
Encoder-Decoder Attention
Feed Forward Networks
Word Embedding
I
enjoyed
the
movie
transformers
Self-Attention
1 | 0 | 0 |
---|
0 | 1 | 0 |
---|
1 | 1 | 0 |
---|
1 | 0 | 1 |
---|
1 | 1 | 1 |
---|
The | animal | didn't | cross | the | street | bacause | it | |
---|---|---|---|---|---|---|---|---|
The | ||||||||
animal | ||||||||
didn't | ||||||||
cross | ||||||||
the | ||||||||
street | ||||||||
because | ||||||||
it |
The | animal | didn't | cross | the | street | bacause | it | |
---|---|---|---|---|---|---|---|---|
The | 0.6 | 0.1 | 0.05 | 0.05 | 0.02 | 0.02 | 0.02 | 0.1 |
animal | 0.02 | 0.5 | 0.06 | 0.15 | 0.02 | 0.05 | 0.01 | 0.12 |
didn't | 0.01 | 0.35 | 0.45 | 0.1 | 0.01 | 0.02 | 0.01 | 0.03 |
cross | . | |||||||
the | . | |||||||
street | . | |||||||
because | . | |||||||
it | 0.01 | 0.6 | 0.02 | 0.1 | 0.01 | 0.2 | 0.01 | 0.01 |
The | animal | didn't | cross | the | street | bacause | it | |
---|---|---|---|---|---|---|---|---|
The | 0.6 | 0.1 | 0.05 | 0.05 | 0.02 | 0.02 | 0.02 | 0.1 |
animal | 0.02 | 0.5 | 0.06 | 0.15 | 0.02 | 0.05 | 0.01 | 0.12 |
didn't | 0.01 | 0.35 | 0.45 | 0.1 | 0.01 | 0.02 | 0.01 | 0.03 |
cross | . | |||||||
the | . | |||||||
street | . | |||||||
because | . | |||||||
it | 0.01 | 0.6 | 0.02 | 0.1 | 0.01 | 0.2 | 0.01 | 0.01 |
The | animal | didn't | cross | the | street | bacause | it | |
---|---|---|---|---|---|---|---|---|
The | 0.6 | 0.1 | 0.05 | 0.05 | 0.02 | 0.02 | 0.02 | 0.1 |
animal | 0.02 | 0.5 | 0.06 | 0.15 | 0.02 | 0.05 | 0.01 | 0.12 |
didn't | 0.01 | 0.35 | 0.45 | 0.1 | 0.01 | 0.02 | 0.01 | 0.03 |
cross | . | |||||||
the | . | |||||||
street | . | |||||||
because | . | |||||||
it | 0.01 | 0.6 | 0.02 | 0.1 | 0.01 | 0.2 | 0.01 | 0.01 |
Animal \(\mathbb{R^3}\)
It \(\mathbb{R^3}\)
Animal \(\mathbb{R^2}\)
It \(\mathbb{R^2}\)
I
enjoyed
the
movie
transformers
Self-Attention
1 | 0 | 0 |
---|
0 | 1 | 0 |
---|
1 | 1 | 0 |
---|
1 | 0 | 1 |
---|
1 | 1 | 1 |
---|
I
transformers
Self-Attention
1 | 0 | 0 |
---|
1 | 1 | 1 |
---|
0.3 | 0.2 |
---|
0.1 | 0.5 |
---|
-0.1 | 0.25 |
---|
0.11 | 0.89 |
---|
0 | 0.4 |
---|
0.2 | 0.7 |
---|
Fixed
variable
I
transformers
Self-Attention
1 | 0 | 0 |
---|
1 | 1 | 1 |
---|
Let's focus on first calculating the first output from self-attention layer
0.3 | 0.2 |
---|
0.1 | 0.5 |
---|
-0.1 | 0.25 |
---|
0.11 | 0.89 |
---|
0 | 0.4 |
---|
0.2 | 0.7 |
---|
Fixed
variable
Wait, can we vectorize all these computations and compute the outputs (\(z_1,z_2,\cdots,z_T\)) in one go?
Wait, can we vectorize all these computations and compute the outputs (\(z_1,z_2,\cdots,z_T\)) in one go?
\(^*\)Note: The original paper follows \(softmax(\frac{QK^T}{\sqrt{d_k}})V\) as it represents embedding \(h_j\) as a row vector
Link to the colab: Tensor2Tesnor
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Concatenate
Linear
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Concatenate (:\(T \times 512\))
Linear
Concatenate (:\(T \times 512\))
Linear
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Concatenate (:\(T \times 512\))
Linear
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Scaled Dot Product
Attention
The input is projected into \(h=8\) different representation subspaces.
I
enjoyed
the
movie
transformers
Feed Forward Network
Self-Attention
I
enjoyed
the
movie
transformers
Feed Forward Network
Multi-Head Attention
I
enjoyed
the
movie
transformers
Multi-Head Attention
FFN
1
FFN
2
FFN
5
FFN
4
FFN
3
I
enjoyed
the
movie
transformers
Multi-Head Attention
FFN
1
FFN
2
FFN
5
FFN
4
FFN
3
I
enjoyed
the
movie
transformers
Multi-Head Attention
FFN
1
FFN
2
FFN
5
FFN
4
FFN
3
Identical network for each position \(z_i\)
Feed Forward Network
Multi-Head Attention
I
enjoyed
the
movie
transformers
I
enjoyed
the
movie
transformers
Layer-1
Layer-2
Layer-6
I
enjoyed
the
movie
transformers
Naan
transfarmar
padaththai
rasithen
Naan
transfarmar
padaththai
rasithen
Layer-1
Layer-2
Layer-6
Feed Forward Network
Masked Multi-Head (Self)Attention
Multi-Head (Cross) Attention
Ground Truth
I
enjoyed
the sunshine
I
enjoyed
the film
last night
transformer
D
D
D
D
D
Note: Encoder block also uses masking in attention sublayer in practice to mask the padded tokens in sequences having length < T
Masked Multi-Head Self Attention
How do we create the mask? Where should we incorporate it? At the input or output or somewhere in between?
Assign zero weights \(\alpha_{ij}=0\) for the value vectors \(v_j\) to be masked in a sequence
Let us denote the mask matrix by \(M\), then
Masked Multi-Head (Self)Attention
Masked Multi-Head (Self)Attention
Masked Multi-Head (Self)Attention
Naan
transfarmar
padaththai
rasiththen
<Go>
Naan
transfarmar
padaththai
rasiththen
<Go>
Naan
transfarmar
padaththai
rasiththen
<Go>
Feed Forward Network
Masked Multi-Head (Self)Attention
Multi-Head (Cross) Attention
Naan
transfarmar
padaththai
rasiththen
<Go>
Feed Forward Network
Naan
transfarmar
padaththai
rasiththen
<Go>
Feed Forward Network
Linear \(W_D\)
Softmax
I
Enjoyed
I
I | Enjoyed | the | film | transformer |
---|---|---|---|---|
0 | 1 | 2 | 3 | 4 |
1 | 0 | 1 | 2 | 3 |
2 | 1 | 0 | 1 | 2 |
3 | 2 | 1 | 0 | 1 |
4 | 3 | 2 | 1 | 0 |
I
Enjoyed
the
film
transformer
At every even indexed column,\((i=0,2,4,\cdots,512)\), we have a sine function with decreasing frequency (or increasing wavelength) as \(i\) increases.
Similarly, at odd indexed column,\((i=1,3,5,\cdots,511)\), we have a cosine function with decreasing frequency (or increasing wavelength) as \(i\) increases.
Interleaving these two as alternative columns creates the (name it: wavy,aurora?) pattern.
Wavelength progress from \(2\pi \rightarrow 10000 2\pi\)
Attention layer
Attention layer
Attention layer
Linear
+
Softmax
Attention layer
Attention layer
Attention layer
Linear
+
Softmax
Batch Normalization at
Accumulated activations for \(m\) training samples
Let \(x_i^j\) denotes the activation of \(i^{th}\) neuron for \(j^{th}\) training sample
We have three variables \(l,i, j\) involved in the statistics computation. Let's visualize these as three axes that form a cube.
Let us associate an accumulator with \(l^{th}\) layer that stores the activations of batch inputs.
Batch Normalization at
Accumulated activations for \(m\) training samples
Let \(x_i^j\) denotes the activation of \(i^{th}\) neuron for \(j^{th}\) training sample
We have three variables \(l,i, j\) involved in the statistics computation. Let's visualize these as three axes that form a cube.
Let us associate an accumulator with \(l^{th}\) layer that stores the activations of batch inputs.
Feed Forward Network
Multi-Head Attention
Feed Forward Network
Multi-Head Attention
Add & Layer Norm
Add & Layer Norm
Defaults
However, the learning rate \(\eta\) was changed across time steps.
Attempt-1: Using a decaying learning rate
Attempt-2: Using a growing learning rate
Scaling factor
warmupSteps=4000