somewhere
something
incredible
is
waiting
to
be
known
0
1
2
3
4
5
6
7
0
1
2
3
-2
-1
0
1
(non-parametric)
(parametric)
(parametric)
(non-parametric)
somewhere
something
incredible
is
waiting
to
be
known
0
1
2
7
0
1
2
3
4
5
6
7
0
somewhere
something
incredible
is
waiting
to
be
known
1
2
3
4
5
6
7
100
somewhere
something
incredible
is
waiting
to
be
known
101
102
103
104
105
106
107
0
somewhere
something
incredible
is
waiting
to
be
known
1
2
3
4
5
6
7
APE
0
somewhere
something
incredible
is
waiting
to
be
known
1
2
3
4
5
6
7
0
-1
-2
-3
1
2
3
4
APE
RPE
0
-1
-2
-3
1
2
3
4
somewhere
something
incredible
is
waiting
to
be
known
somewhere
something
incredible
is
waiting
to
be
known
3
2
1
0
4
5
6
7
2
1
0
-1
3
4
5
6
2
1
-1
-2
3
4
5
0
-1
-2
-3
-4
0
1
2
3
-1
-2
-3
-4
0
1
2
-5
-1
-2
0
1
-3
-4
-5
-6
-6
-5
-4
-1
-2
0
-3
-7
-7
-1
0
7
1
embedding matrix
score for the given query and key without any position encoding
Adding a constant that comes from position encoding
Correlation beween a word and position
Visualization of the above equation in BERT [paper]. From left: Correlation between word-to-word, word-to-position, position-to-word, position-to-position
Visualization of the above equation in BERT [paper]. From left: Correlation between word-to-word, word-to-position, position-to-word, position-to-position
Uniformly distributed (that is, no correlation)
for \(T=8,k=4\)
somewhere
something
incredible
is
waiting
to
be
known
somewhere
3
2
1
0
4
4
4
4
for \(T=8,k=4\)
0
-1
-2
-3
1
2
3
4
somewhere
something
incredible
is
waiting
to
be
known
somewhere
something
incredible
is
waiting
to
be
known
3
2
1
0
4
4
4
4
2
1
0
-1
3
4
4
4
2
1
-1
-2
3
4
4
0
-1
-2
-3
-4
0
1
2
3
-1
-2
-3
-4
0
1
2
-4
-1
-2
0
1
-3
-4
-4
-4
-4
-4
-4
-1
-2
0
-3
-4
Source: [paper]
the book . . .
read the book . . .
the book . . .
\(m=0,n=1\)
read the book . . .
\(m=1,n=2\)
you read the book . . .
\(m=2,n=3\)
you must read the book . . .
\(m=3,n=4\)
he says you must read the book . . .
\(m=5,n=6\)
APE of \(m,n\)
RPE of \(m-n\)
Again, let's take 2D example
Let's assume \(W_Q=W_K=I\)
Validation perplexity
source:[paper]