Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(<mask>)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(<mask>)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
play |
---|
watch |
go |
read |
Nothing has shipped its new OS to Nothing Phone 2
Company
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(P(y_2=x_2)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,\cdots,[mask],\cdots,x_{T}\)
\(P(y_i=x_i)\)
Multi-Head masked Attention
Feed forward NN
Add&Norm
Add&Norm
\(x_1,x_2,\cdots,x_{i-1}\)
\(P(x_i)\)
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Maksed Attention
Add&Norm
\(x_1,<mask>,\cdots,x_{T}\)
\(<go>\)
\(P(<mask>)\)
Self-Attention
1 | 0 | 0 |
---|
0 | 1 | 0 |
---|
1 | 1 | 0 |
---|
1 | 0 | 1 |
---|
1 | 1 | 1 |
---|
I
enjoyed
the
movie
transformers
Self-Attention
1 | 0 | 0 |
---|
0 | 1 | 0 |
---|
1 | 1 | 0 |
---|
1 | 0 | 1 |
---|
1 | 1 | 1 |
---|
[mask]
enjoyed
the
[mask]
transformers
I
transformers
Self-Attention
1 | 0 | 0 |
---|
1 | 1 | 1 |
---|
0.3 | 0.2 |
---|
0.1 | 0.5 |
---|
-0.1 | 0.25 |
---|
0.11 | 0.89 |
---|
0 | 0.4 |
---|
0.2 | 0.7 |
---|
movie
[mask]
transformers
Self-Attention
1 | 0 | 0 |
---|
1 | 1 | 1 |
---|
0.3 | 0.2 |
---|
0.1 | 0.5 |
---|
-0.1 | 0.25 |
---|
0.11 | 0.89 |
---|
0 | 0.4 |
---|
0.2 | 0.7 |
---|
[mask]
[mask]
enjoyed
the
[mask]
transformers
Feed Forward Network
Self-Attention
[mask]
enjoyed
the
[mask]
transformers
Encoder Layer (attention,FFN,Normalization,Residual connection)
Encoder Layer (attention,FFN,Normalization,Residual connection)
Encoder Layer (attention,FFN,Normalization,Residual connection)
[CLS]
I
enjoyed
the
movie
Feed Forward Network
Self-Attention
transformers
[SEP]
The
visuals
were
amazing
input: Sentence: A
input: Sentence: B
Label: IsNext
[CLS]
I
enjoyed
the
movie
Feed Forward Network
Self-Attention
transformers
[SEP]
The
visuals
were
amazing
[CLS]
I
enjoyed
the
movie
Feed Forward Network
Self-Attention
transformers
[SEP]
The
visuals
were
amazing
[CLS]
I
enjoyed
the
movie
transformers
[SEP]
The
visuals
were
amazing
Position Embeddings
Segment Embeddings
Token Embeddings
Feed Forward Network
Self-Attention
Encoder Layer
Encoder Layer
Encoder Layer
Encoder Layer
[CLS]
[mask]
enjoyed
the
[mask]
transformers
[SEP]
The
[mask]
were
amazing
800M words
2500M words
[CLS]
Masked Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoder
Masked Sentence B
800M words
2500M words
[CLS]
Masked Sentence A
[SEP]
[PAD]
Encoder Layer-1
Masked Sentence B
Encoder Layer-2
Encoder Layer-12
800M words
2500M words
[CLS]
Masked Sentence A
[SEP]
[PAD]
Encoder Layer-1
Masked Sentence B
Encoder Layer-12
[CLS]
Masked Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoders
Masked Sentence B
Position Embeddings
Segment Embeddings
Token Embeddings
Token Embeddings
Segment Embeddings
Position Embeddings
[CLS]
Masked Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoder
Masked Sentence B
*Actual vocabulary size 30522, parameters for layer normalization are excluded
[CLS]
Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoders
Sentence B
[CLS]
Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoders
Sentence B
[CLS]
Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoders
Sentence B
[CLS]
Masked Sentence A
[SEP]
[PAD]
Bidirectional Transformer Encoder
Masked Sentence B
800M words
2500M words
Paragraph: What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Question: What is the unique about the mission?
Answer : role of artificial intelligence (AI) in guiding the spacecraft
Starting token: 9
[CLS]
[SEP]
Bidirectional Transformer Encoder
what
is
mission
what
sets
in
surface
[CLS]
[SEP]
Bidirectional Transformer Encoder
what
is
mission
what
sets
in
surface
[CLS]
[SEP]
Bidirectional Transformer Encoder
what
is
mission
what
sets
in
surface
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Span
role of artificial intelligence (AI) in guiding the spacecraft
What sets this mission apart is the pivotal role of artificial intelligence (AI) in guiding the spacecraft during its critical descent to the moon's surface.
Return empty string (implies that the answer is not in the paragraph)
Bidirectional Transformer Encoder
[GO]
[EOS]
[mask]
[mask]
[mask]
\(x_3\)
\(x_7\)
[CLS]
Masked Sentence A
[SEP]
[PAD]
Encoder Layer-1
Masked Sentence B
Encoder Layer-2
Encoder Layer-12