I
enjoyed
the
movie
transformers
Encoder RNN
Decoder RNN
I
enjoyed
the
movie
transformers
Encoder RNN
Decoder RNN
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
padaththai
I
enjoyed
the
film
transformers
RNN Encoder
I
enjoyed
the
film
transformers
RNN Encoder
\(n\), number of words
\(t\)= time step for decoder
Naan
I
enjoyed
the
film
transformers
<Go>
Naan
transfarmar
I
enjoyed
the
film
transformers
<Go>
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
<Go>
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
rasiththen
<Go>
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
rasiththen
Alignment of words:
Naan
transfarmar
padaththai
rasiththen
the
I
enjoyed
film
transformers
<Go>
the
I
enjoyed
film
transformers
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
rasiththen
<Go>
Content-Base attention
Additive (concat) attention
Dot product attention
Scaled Dot product attention
All score functions take in two vectors and produce a scalar.
I
enjoyed
the
movie
transformers
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
padaththai
*RNN seq2seq models
I
enjoyed
the
movie
transformers
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
padaththai
Naan
transfarmar
padaththai
rasiththen
Naan
transfarmar
padaththai
With Attention
I
enjoyed
the
movie
transformers
Naan
transfarmar
padaththai
rasiththen
Feed Forward Networks
Self-Attention
Self-Attention
Encoder-Decoder Attention
Feed Forward Networks
Word Embedding
I
enjoyed
the
movie
transformers
Self-Attention
1 | 0 | 0 |
---|
0 | 1 | 0 |
---|
1 | 1 | 0 |
---|
1 | 0 | 1 |
---|
1 | 1 | 1 |
---|
The | animal | didn't | cross | the | street | bacause | it | |
---|---|---|---|---|---|---|---|---|
The | ||||||||
animal | ||||||||
didn't | ||||||||
cross | ||||||||
the | ||||||||
street | ||||||||
because | ||||||||
it |
The | animal | didn't | cross | the | street | bacause | it | |
---|---|---|---|---|---|---|---|---|
The | 0.6 | 0.1 | 0.05 | 0.05 | 0.02 | 0.02 | 0.02 | 0.1 |
animal | 0.02 | 0.5 | 0.06 | 0.15 | 0.02 | 0.05 | 0.01 | 0.12 |
didn't | 0.01 | 0.35 | 0.45 | 0.1 | 0.01 | 0.02 | 0.01 | 0.03 |
cross | . | |||||||
the | . | |||||||
street | . | |||||||
because | . | |||||||
it | 0.01 | 0.6 | 0.02 | 0.1 | 0.01 | 0.2 | 0.01 | 0.01 |
The | animal | didn't | cross | the | street | bacause | it | |
---|---|---|---|---|---|---|---|---|
The | 0.6 | 0.1 | 0.05 | 0.05 | 0.02 | 0.02 | 0.02 | 0.1 |
animal | 0.02 | 0.5 | 0.06 | 0.15 | 0.02 | 0.05 | 0.01 | 0.12 |
didn't | 0.01 | 0.35 | 0.45 | 0.1 | 0.01 | 0.02 | 0.01 | 0.03 |
cross | . | |||||||
the | . | |||||||
street | . | |||||||
because | . | |||||||
it | 0.01 | 0.6 | 0.02 | 0.1 | 0.01 | 0.2 | 0.01 | 0.01 |
The | animal | didn't | cross | the | street | bacause | it | |
---|---|---|---|---|---|---|---|---|
The | 0.6 | 0.1 | 0.05 | 0.05 | 0.02 | 0.02 | 0.02 | 0.1 |
animal | 0.02 | 0.5 | 0.06 | 0.15 | 0.02 | 0.05 | 0.01 | 0.12 |
didn't | 0.01 | 0.35 | 0.45 | 0.1 | 0.01 | 0.02 | 0.01 | 0.03 |
cross | . | |||||||
the | . | |||||||
street | . | |||||||
because | . | |||||||
it | 0.01 | 0.6 | 0.02 | 0.1 | 0.01 | 0.2 | 0.01 | 0.01 |
Animal \(\mathbb{R^3}\)
It \(\mathbb{R^3}\)
Animal \(\mathbb{R^2}\)
It \(\mathbb{R^2}\)
I
enjoyed
the
movie
transformers
Self-Attention
1 | 0 | 0 |
---|
0 | 1 | 0 |
---|
1 | 1 | 0 |
---|
1 | 0 | 1 |
---|
1 | 1 | 1 |
---|
I
transformers
Self-Attention
1 | 0 | 0 |
---|
1 | 1 | 1 |
---|
0.3 | 0.2 |
---|
0.1 | 0.5 |
---|
-0.1 | 0.25 |
---|
0.11 | 0.89 |
---|
0 | 0.4 |
---|
0.2 | 0.7 |
---|
Fixed
variable
I
transformers
Self-Attention
1 | 0 | 0 |
---|
1 | 1 | 1 |
---|
Let's focus on first calculating the first output from self-attention layer
0.3 | 0.2 |
---|
0.1 | 0.5 |
---|
-0.1 | 0.25 |
---|
0.11 | 0.89 |
---|
0 | 0.4 |
---|
0.2 | 0.7 |
---|
Fixed
variable
Wait, can we vectorize all these computations and compute the outputs (\(z_1,z_2,\cdots,z_T\)) in one go?
Wait, can we vectorize all these computations and compute the outputs (\(z_1,z_2,\cdots,z_T\)) in one go?
Link to the colab: Tensor2Tesnor
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Concatenate
Linear
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Concatenate (:\(T \times 512\))
Linear
Concatenate (:\(T \times 512\))
Linear
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Concatenate (:\(T \times 512\))
Linear
Scaled Dot Product
Attention
Scaled Dot Product
Attention
Scaled Dot Product
Attention
The input is projected into \(h=8\) different representation subspaces.
I
enjoyed
the
movie
transformers
Feed Forward Network
Self-Attention