Lecture 10: Learning Vectorial Representations Of Words
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
CS6910: Fundamentals of Deep Learning
References/Acknowledgments
‘word2vec Parameter Learning Explained’ by Xin Rong
Sebastian Ruder’s blogs on word embeddings\(^a\)
‘word2vec Explained:deriving Mikolov et al.’s negative-sampling wordembedding method’ by Yoav Goldberg and Omer Levy
Module 10.1: One-hot representations of words
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Let us start with a very simple motivation for why we are interested in vectorial representations of words
Suppose we are given an input stream of words (sentence, document, etc.) and we are interested in learning some function of it (say, \( \hat{y} = sentiments(words) \))
Say, we employ a machine learning algorithm (some mathematical model) for learning such a function (\( \hat{y} = f(x) \))
We first need a way of converting the input stream (or each word in the stream) to a vector \(x\) (a mathematical quantity)
This is by far AAMIR KHAN’s best one. Finest casting and terrific acting by all.
\([5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]\)
\(\uparrow\)
Model
\(\big\uparrow\)
Given a corpus, consider the set \(V\) of all unique words across all input streams (\(i.e.\), all sentences or documents)
\(V\) is called the vocabulary of the corpus (\(i.e.\), all sentences or documents)
We need a representation for every word in \(V\)
One very simple way of doing this is to use one-hot vectors of size \(|V |\)
The representation of the \(i-th\) word will have a \(1\) in the \(i-th\) position and a \(0\) in the remaining \(|V | − 1\) positions
Corpus:
Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
\(V\) = [human, machine, interface, for, computer, applications, user, opinion, of, system, response, time, management, engineering, improved]
machine:
0 | 1 | 0 | ... | 0 | 0 | 0 |
---|
Problems:
\(V\) tends to be very large (for example, \(50K\) for \(PTB\), \(13M\) for Google \(1T\) corpus)
These representations do not capture
any notion of similarity
Ideally, we would want the representations of cat and dog (both domestic animals) to be closer to each other than the representations of cat and truck
However, with \(1\)-hot representations, the Euclidean distance between any two words in the vocabulary in \(\sqrt 2\)
And the cosine similarity between any two words in the vocabulary is \(0\)
cat:
0 | 0 | 0 | 0 | 0 | 1 | 0 |
---|
dog:
0 | 1 | 0 | 0 | 0 | 0 | 0 |
---|
truck:
0 | 0 | 0 | 1 | 0 | 0 | 0 |
---|
\(euclid\_ dist\)(cat, dog) = \(\sqrt 2\)
\(euclid\_ dist\)(dog, truck) = \(\sqrt2\)
\(cosine\_ sim\)(cat, dog) = \(0\)
\(cosine\_ sim\)(dog, truck) = \(0\)
Module 10.2: Distributed Representations of words
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
\(You\) \(shall\) \(know\) \(a\) \(word\) \(by\) \(the\) \(company\) \(it\) \(keeps\) - \(Firth, J. R. 1957:11 \)
Distributional similarity based representations
This leads us to the idea of cooccurrence matrix
A bank is a financial institution that accepts deposits from the public and creates credit.
The idea is to use the accompanying words (financial, deposits, credit) to represent bank
A co-occurrence matrix is a terms \(\times\) terms matrix which captures the number of times a term appears in the context of another term
The context is defined as a window of \(k\) words around the terms
Let us build a co-occurrence matrix for this toy corpus with \(k = 4\)
This is also known as a word \(\times\) context matrix
You could choose the set of words
and contexts to be same or different
Each row (column) of the cooccurrence matrix gives a vectorial representation of the corresponding word (context)
Corpus:
Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
0 1 0 1 . . . 0 |
1 0 0 1 . . . 0 |
0 0 0 1 . . . 2 |
1 1 1 0 . . . 0 |
... ... ... ... . . . ... |
0 0 2 0 . . . 0 |
Co-occurence Matrix
Some (fixable) problems
Stop words (a, the, for, etc.) are very frequent → these counts will be very high
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
0 1 0 1 . . . 0 |
1 0 0 1 . . . 0 |
0 0 0 1 . . . 2 |
1 1 1 0 . . . 0 |
... ... ... ... . . . ... |
0 0 2 0 . . . 0 |
Some (fixable) problems
Stop words (a, the, for, etc.) are very frequent → these counts will be very high
Solution \(1\): Ignore very frequent words
human | machine | system | ... | user | |
---|---|---|---|---|---|
human machine system . . . user |
0 1 0 . . . 0 |
1 0 0 . . . 0 |
0 0 0 . . . 2 |
... ... ... . . . ... |
0 0 2 . . . 0 |
Some (fixable) problems
Stop words (a, the, for, etc.) are very frequent → these counts will be very high
Solution \(1\): Ignore very frequent words
Solution \(2\): Use a threshold \(t\) (say, \(t = 100\))
\(X_{ij} = min(count(w_i , c_j ), t),\)
where \(w\) is word and \(c\) is context.
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
0 1 0 x . . . 0 |
1 0 0 x . . . 0 |
0 0 0 x . . . 2 |
x x x x . . . x |
... ... ... ... . . . ... |
0 0 2 x . . . 0 |
Some (fixable) problems
Solution \(3\): Instead of \(count(w, c)\) use \(PMI(w, c)\)
\(N\) \(is\) \(the\) \(total\) \(number\) \(of\) \(words\)
If \(count(w, c) = 0,\) \(PMI(w, c) =\) \(-\infty \)
Instead use,
\(PMI_0(w, c) = PMI(w, c)\) \(if count(w, c) > 0\)
\(= 0\) \(otherwise\)
or
\(PPMI(w, c) = PMI(w, c)\) \(if PMI(w, c) > 0\)
\(= 0\) \(otherwise\)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
0 2.944 0 2.25 . . . 0 |
2.944 0 0 2.25 . . . 0 |
0 0 0 1.15 . . . 1.84 |
2.252.251.15 0 . . . 0 |
... ... ... ... . . . ... |
0 0 1.84 0 . . . 0 |
Some (severe) problems
Very high dimensional \((|V |)\)
Very sparse
Grows with the size of the vocabulary
Solution: Use dimensionality reduction (SVD)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
0 2.944 0 2.25 . . . 0 |
2.944 0 0 2.25 . . . 0 |
0 0 0 1.15 . . . 1.84 |
2.252.251.15 0 . . . 0 |
... ... ... ... . . . ... |
0 0 1.84 0 . . . 0 |
Module 10.3: SVD for learning word representations
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Singular Value Decomposition gives a rank \(k\) approximation of the original matrix
\(X_{PPMI}\) (simplifying notation to \(X\)) is the co-occurrence matrix with PPMI values
SVD gives the best rank-\(k\) approximation of the original data \((X)\)
Discovers latent semantics in the corpus (let us examine this with the help of an example)
Notice that the product can be written as a sum of \(k\) rank-\(1\) matrices
Each \(\sigma_iu_i{v_i}^T \in \R^{m \times n}\) because it is a product of a \(m \times 1\) vector with a \(1 \times n\) vector
If we truncate the sum at \(\sigma_1u_1{v_1}^T\) then we get the best rank-\(1\) approximation of \(X\) (By SVD theorem\(!\) But what does this mean\(?\) We will see on the next slide)
If we truncate the sum at \(\sigma_1u_1{v_1}^T+ \sigma_2u_2{v_2}^T\)then we get the best rank-\(2\) approximation of \(X\) and so on
\(=\sigma_1u_1{v_1}^T\) \(+\) \(\sigma_2u_2{v_2}^T\) \(+\) . . . \(+\) \(\sigma_ku_k{v_k}^T\)
What do we mean by approximation here\(?\)
Notice that \(X\) has \(m\times n\) entries
When we use the rank-\(1\) approximation. we are using only \(n + m + 1\) entries to reconstruct \([u\in \R^m, v\in \R^n , \sigma \in \R^1 ]\)
But SVD theorem tells us that \(u_1,v_1\) and \(\sigma_1\) store the most information in \(X\) (akin to the principal components in \(X\))
Each subsequent term \((\sigma_2u_2{v_2}^T ,\) \(\sigma_3u_3{v_3}^T , . . .)\) stores less and less important information
\(=\sigma_1u_1{v_1}^T\) \(+\) \(\sigma_2u_2{v_2}^T\) \(+\) . . . \(+\) \(\sigma_ku_k{v_k}^T\)
As an analogy consider the case when we are using \(8\) bits to represent colors
The representation of very light, light, dark and very dark green would look different
But now what if we were asked to compress this into \(4\) bits\(?\) (akin to compressing \(m \times m\) values into \(m + m + 1\) values on the previous slide)
0 | 0 | 0 | 1 |
---|
0 | 0 | 1 | 0 |
---|
0 | 1 | 0 | 0 |
---|
1 | 0 | 0 | 0 |
---|
1 | 0 | 1 | 1 |
---|
1 | 0 | 1 | 1 |
---|
1 | 0 | 1 | 1 |
---|
1 | 0 | 1 | 1 |
---|
\(\overbrace{\qquad\qquad \qquad \qquad}^{green}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{green}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{green}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{green}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{very \ light}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{light}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{dark}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{very \ dark}\)
As an analogy consider the case when we are using \(8\) bits to represent colors
The representation of very light, light, dark and very dark green would look different
But now what if we were asked to compress this into \(4\) bits\(?\) (akin to compressing \(m \times m\) values into \(m + m + 1\) values on the previous slide)
We will retain the most important \(4\) bits and now the previously (slightly) latent similarity between the colors now becomes very obvious
Something similar is guaranteed by SVD (retain the most important information and discover the latent similarities between words)
0 | 0 | 0 | 1 |
---|
0 | 0 | 1 | 0 |
---|
0 | 1 | 0 | 0 |
---|
1 | 0 | 0 | 0 |
---|
1 | 0 | 1 | 1 |
---|
1 | 0 | 1 | 1 |
---|
1 | 0 | 1 | 1 |
---|
1 | 0 | 1 | 1 |
---|
\(\overbrace{\qquad\qquad \qquad \qquad}^{green}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{green}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{green}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{green}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{very \ light}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{light}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{dark}\)
\(\overbrace{\qquad\qquad \qquad \qquad}^{very \ dark}\)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
0 2.944 0 2.25 . . . 0 |
2.944 0 0 2.25 . . . 0 |
0 0 0 1.15 . . . 1.84 |
2.252.251.15 0 . . . 0 |
... ... ... ... . . . ... |
0 0 1.84 0 . . . 0 |
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
2.01 2.01 0.23 2.14 . . . 0.43 |
2.01 2.01 0.23 2.14 . . . 0.43 |
0.23 0.23 1.17 0.96 . . . 1.29 |
2.142.140.96 1.87 . . . -0.13 |
... ... ... ... . . . ... |
0.43 0.43 1.29 -0.13 . . . 1.71 |
Co-occurrence Matrix \((X)\)
Low rank \(X\) \(\rightarrow\) Low rank \(\hat{X}\)
Notice that after low rank reconstruction with SVD, the latent co-occurrence between \(\{system, machine\}\) and \(\{human, user\}\) has become visible
Recall that earlier each row of the original matrix \(X\) served as the representation of a word
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
0 2.944 0 2.25 . . . 0 |
2.944 0 0 2.25 . . . 0 |
0 0 0 1.15 . . . 1.84 |
2.25 2.25 1.15 0 . . . 0 |
... ... ... ... . . . ... |
0 0 1.84 0 . . . 0 |
\(X =\)
Recall that earlier each row of the original matrix \(X\) served as the representation of a word
Then \(XX^T\) is a matrix whose \(ij-th\) entry is the dot product between the representation of word \(i\) \((X[i :])\) and word \(j\) \((X[j :])\)
The \(ij-th\) entry of \(XX^T\) thus (roughly) captures the cosine similarity between \(word_i, word_j\)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
0 2.944 0 2.25 . . . 0 |
2.944 0 0 2.25 . . . 0 |
0 0 0 1.15 . . . 1.84 |
2.25 2.25 1.15 0 . . . 0 |
... ... ... ... . . . ... |
0 0 1.84 0 . . . 0 |
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
32.5 23.9 7.78 20.25 . . . 7.01 |
23.9 32.5 7.78 20.25 . . . 7.01 |
7.78 7.78 0 17.65 . . . 21.84 |
20.2520.25 17.65 36.3 . . . 11.8 |
... ... ... ... . . . ... |
7.01 7.01 21.84 11.8 . . . 28.3 |
\(X =\)
\(XX^T =\)
\(cosine\_ sim(human, user) = 0.21\)
\(X[i :]\)
\(X[j :]\)
\(\underbrace{\qquad \qquad \qquad }_{X}\)
\(\underbrace{\qquad \qquad \qquad }_{X^T}\)
\(\underbrace{\qquad \qquad \qquad }_{XX^T}\)
Once we do an SVD what is a good choice for the representation of \(word_i?\)
Obviously, taking the \(i-th\) row of the reconstructed matrix does not make sense because it is still high dimensional
But we saw that the reconstructed matrix \(\hat{X} = U \Sigma V^T\) discovers latent semantics and its word representations are more meaningful
Wishlist: We would want representations of words \((i, j)\) to be of smaller dimensions but still have the same similarity (dot product) as the corresponding rows of \(\hat{X}\)
\(\hat{X} =\)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
2.01 2.01 0.23 2.14 . . . 0.43 |
2.01 2.01 0.23 2.14 . . . 0.43 |
0.23 0.23 1.17 0.96 . . . 1.29 |
2.14 2.14 0.96 1.87 . . . -0.13 |
... ... ... ... . . . ... |
0.43 0.43 1.29 -0.13 . . . 1.71 |
\(\hat{X}\hat{X}^T =\)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
25.4 25.4 7.6 21.9 . . . 6.84 |
25.4 25.4 7.6 21.9 . . . 6.84 |
7.6 7.6 24.8 0.96 . . . 20.6 |
21.9 21.9 18.0324.6 . . . 15.32 |
... ... ... ... . . . ... |
6.84 6.84 20.6 15.32 . . . 17.11 |
\(cosine\_ sim(human, user) = 0.33\)
Notice that the dot product between the rows of the the matrix \(W_{word}=U\Sigma\) is the same as the dot product between the rows of \(\hat{X}\)
\(\hat{X}\hat{X}^T = (U \Sigma V^T )(U \Sigma V^T)^T\)
\(= (U \Sigma V^T )(V \Sigma U^T)\)
\(= U \Sigma \Sigma^T U^T\) \((\because V^T V=I)\)
\(= U \Sigma(U \Sigma)^T = W_{word}W_{word}^T\)
Conventionally,
\(W_{word} = U \Sigma \in \R^{m \times k}\)
is taken as the representation of the \(m\) words in the vocabulary and
\(W_{context} = V\)
is taken as the representation of the context words
\(\hat{X} =\)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
2.01 2.01 0.23 2.14 . . . 0.43 |
2.01 2.01 0.23 2.14 . . . 0.43 |
0.23 0.23 1.17 0.96 . . . 1.29 |
2.14 2.14 0.96 1.87 . . . -0.13 |
... ... ... ... . . . ... |
0.43 0.43 1.29 -0.13 . . . 1.71 |
\(\hat{X}\hat{X}^T =\)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
25.4 25.4 7.6 21.9 . . . 6.84 |
25.4 25.4 7.6 21.9 . . . 6.84 |
7.6 7.6 24.8 0.96 . . . 20.6 |
21.9 21.9 18.0324.6 . . . 15.32 |
... ... ... ... . . . ... |
6.84 6.84 20.6 15.32 . . . 17.11 |
\(similarity = 0.33\)
Module 10.4: Continuous bag of words model
Mitesh M. Khapra
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
The methods that we have seen so far are called count based models because they use the co-occurrence counts of words
We will now see methods which directly learn word representations (these are called (direct) prediction based models)
The story ahead ...
-
Continuous bag of words model
-
Skip gram model with negative sampling (the famous word\(2\)vec)
-
GloVe word embeddings
-
Evaluating word embeddings
-
Good old SVD does just fine\(!!\)
Consider this Task: Predict \(n-th\) word given previous \(n-1\) words
Example: he sat on a chair
Training data: All \(n\)-word windows in your corpus
Training data for this task is easily available (take all \(n\) word windows from the whole of wikipedia)
For ease of illustration, we will first focus on the case when \(n = 2\) (\(i.e.,\) predict second word based on first word)
\(society (\)
\(is\)
\(is\)
\(a\)
Some sample 4 word windows from a corpus
\(Sometime\) \(in\) \(the\) \(21st\) \(century,\) \(Joseph\) \(Cooper,\)
\(a\) \(widowed\) \(former\) \(engineer\)
\(and\)
\(former\) \(NASA\)
\(pilot,\) \(runs\) \(a\) \(farm\) \(with\) \(his\) \(father\)-\(in\)-\(law\) \(Donald,\)
\(son\) \(Tom,\) \(and\) \(daughter\) \(Murphy,\) \(It\) \(is\) \(post\)-\(truth\)
\(Cooper\) \(is\) \(reprimanded\) \(for\)
\(telling\)
\(Murphy\) \(that\) \(the\) \(Apollo\) \(missions\) \(did\) \(indeed\)
\(happen)\)\(and\) \(a\) \(series\) \(of\) \(crop\) \(blights\) \(threatens\) \(hu-\)
\(manity’s\) \(survival.\)
\(Murphy\) \(believes\) \(her\) \(bedroom\)
\(haunted\) \(by\) \(a\) \(poltergeist.\) \(When\) \(a\) \(pattern\)
\(created\) \(out\) \(of\) \(dust\)
\(on\)
\(the\) \(floor,\) \(Cooper\)
\(realizes\) \(that\) \(gravity\) \(is\) \(behind\) \(its\) \(formation,\)
\(not\) \(a\) \(”ghost”\). \(He\) \(interprets\) \(the\) \(pattern\) \(as\)
\(set\) \(of\) \(geographic\) \(coordinates\)
\(formed\)
\(into\)
\(binary\) \(code.\) \(Cooper\) \(and\) \(Murphy\) \(follow\) \(the\)
\(coordinates\) \(to\) \(a\) \(secret\) \(NASA\) \(facility,\) \(where\) \(they\)
\(are\) \(met\) \(by\) \(Cooper’s\) \(former\) \(professor,\) \(Dr.\)\(Brand.\)
We will now try to answer these two questions:
-
How do you model this task\(?\)
-
What is the connection between this task and learning word representations\(?\)
We will model this problem using a feedforward neural network
Input: One-hot representation of the context word
Output: There are \(|V |\) words (classes) possible and we want to predict a probability distribution over these \(|V |\) classes (multi-class classification problem)
Parameters: \(\textbf{W}_{context} \in \R^{k \times |V |}\) and \(\textbf{W}_{word} \in \R^{k \times |V |} \)
\((\)we are assuming that the set of words and context words is the same: each of size \(|V |)\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
. | . | . | . | . | . | . | . | . | . |
---|
\(W_{context} \in\)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
\(W_{word} \in \R^{|V| \times k}\)
h \(\in \R^k \)
What is the product \(\textbf{W}_{context} \ \mathbf{x}\) given that \(\mathbf{x}\) is a one hot vector
It is simply the \(i-th\) column of \(\textbf{W}_{context}\)
So when the \(i^{th}\) word is present the \(i ^{th}\) element in the one hot vector is ON and the \(i^{th}\) column of \(\textbf{W}_{context}\) gets selected
In other words, there is a one-to-one correspondence between the words and the column of \(\textbf{W}_{context}\)
More specifically, we can treat the \(i-th\) column of \(\textbf{W}_{context}\) as the representation of context \(i\)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
. | . | . | . | . | . | . | . | . | . |
---|
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
. | . | . | . | . | . | . | . |
---|
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{ |V | \times k }\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
How do we obtain \(P(on|sat)?\) For this multiclass classification problem what is an appropriate output function\(?\)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
. | . | . | . | . | . | . | . | . | . |
---|
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{|V| \times k}\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
How do we obtain \(P(on|sat)?\) For this multiclass classification problem what is an appropriate output function\(?\) (softmax)
Therefore, \(P(on|sat)\) is proportional to the dot product between \(j^{th}\) column of \(\textbf{W}_{context}\) and \(i^{th}\) row of \(\textbf{W}_{word}\)
\(P(word = i|sat)\) thus depends on the \(i^{th}\) row of \(\textbf{W}_{word}\)
We thus treat the \(i-th\) row of \(\textbf{W}_{word}\) as the representation of word \(i\)
Hope you see an analogy with SVD\(!\) (there we had a different way of learning \(\textbf{W}_{context}\) and \(\textbf{W}_{word}\) but we saw that the \(i^{th}\) row of \(\textbf{W}_{word}\) corresponded to the representation of the \(i^{th}\) word)
Now that we understood the interpretation of \(\textbf{W}_{context}\) and \(\textbf{W}_{word}\), our aim now is to learn these parameters
\(P(on|sat)=\)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
. | . | . | . | . | . | . | . | . | . |
---|
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{|V| \times k}\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
We denote the context word (sat) by the index \(c\) and the correct output word (on) by the index \(w\)
For this multiclass classification problem what is an appropriate output function \((\hat{y} = f(x)) ?\)
What is an appropriate loss function\(?\)
\(\mathscr {L}(\theta)=\)\(-\)\(log\) \({\hat{y}}_w=-\)\(log\) \(P(w|c)\)
\(h = W_{context} · x_c = u_c\)
\(\hat{y}_w =\)
\(u_c\) is the column of \(W_{context}\) corresponding to context \(c\) and \(v_w\) is the row of \(W_{word}\) corresponding to context \(w\)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(\hat{y} = P (on|sat)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
softmax
cross entropy
\(W_{word} \in \R^{|V| \times k}\)
How do we train this simple feed forward neural network\(?\)
Let us consider one input-output pair \((c, w)\) and see the update rule for \(v_w\)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
\(W_{word} \in \R^{|V| \times k}\)
backpropagation
\(\mathscr {L}(\theta)=\) \(-\) \(log\) \(\hat{y}_w\)
\(=\) \(-\) \(log\)
\(=\) \(-\) \((\)\(u_c . v_w\) \(-\) \(log\)
\(\nabla_{v_w}\) \(=\)
\(-(\) \(u_c\) \(-\)
\(. u_c)\)
\(=\) \(-\) \(u_c( 1 - \hat{y}_w)\)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
\(\mathscr {L}(\theta)=\) \(-\) \(log\) \(\hat{y}_w\)
\(=\) \(-\) \(log\)
\(=\) \(-\) \((\)\(u_c . v_w\) \(-\) \(log\)
\()\)
\(\nabla_{v_w}\) \(=\)
\(-(\) \(u_c\) \(-\)
\(. u_c)\)
\(=\) \(-\) \(u_c( 1 - \hat{y}_w)\)
And the update rule would be
\(v_w = v_w - \eta \nabla_{v_w} \)
\(= v_w + \eta u_c( 1 - \hat{y}_w)\)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
This update rule has a nice interpretation
\( v_w = v_w + \eta u_c( 1 - \hat{y}_w)\)
If \(\hat{y}_w \rightarrow 1\) then we are already predicting the right word and \(v_w\) will not be updated
If \(\hat{y}_w \rightarrow 0\) then \(v_w\) gets updated by adding a fraction of \(u_c\) to it
This increases the cosine similarity between \(v_w\) and \(u_c\) (How\(?\) Refer to slide \(38\) of Lecture \(2\))
The training objective ensures that the cosine similarity between word \((v_w)\) and context word \((u_c)\) is maximized
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
What happens to the representations of two words \(w\) and \(w'\) which tend to appear in similar context (c)
The training ensures that both \(v_w\) and \(v'_w\) have a high cosine similarity with \(u_c\) and hence transitively (intuitively) ensures that \(v_w\) and \(v'_w\) have a high cosine similarity with each other
This is only an intuition (reasonable)
Haven’t come across a formal proof for this\(!\)
\(P (he|sat)\)
\(P (chair|sat)\)
\(P (man|sat)\)
\(P (on|sat)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^k \)
\(\R^{k \times |V |}\)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
In practice, instead of window size of \(1\) it is common to use a window size of \(d\)
So now,
\(h = \displaystyle \sum_{i=1}^{d−1} u_{c_i}\)
\([W_{context}, W_{context}]\) just means that we are stacking \(2\) copies of \(W_{context}\) matrix
The resultant product would simply be the sum of the columns corresponding to ‘sat’ and ‘he’
\( \}\) \(sat \)
\( \}\) \(he \)
x \(\in \R^{2|V |} \)
h \(\in \R^k \)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
\(P (he|sat, he)\)
\(P (chair|sat, he)\)
\(P (man|sat, he)\)
\(P (on|sat, he)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\([W_{context}, W_{context}] \in\)
\(\R^{k \times 2|V |}\)
\(he \)
\(sat \)
\(W_{word} \in \R^{k \times |V |}\)
-
Of course in practice we will not do this expensive matrix multiplication
-
If \(‘he’\) is \(i^{th}\) word in the vocabulary and \(‘sat’\) is the \(j^{th}\) word then we will simply access columns \(\textbf{W}[i :]\) and \(\textbf{W}[j :]\) and add them
Now what happens during backpropagation
Recall that
\(h = \displaystyle \sum_{i=1}^{d−1} u_{c_i}\)
and
\(P(on|sat, he)=\)
where ‘k’ is the index of the word ‘on’
The loss function depends on \(\{W_{word}, u_{c_1} , u_{c_2} , . . . , u_{c_{d−1}} \}\) and all these parameters will get updated during backpropogation
Try deriving the update rule for \(v_w\) now and see how it differs from the one we derived before
Some problems:
Notice that the softmax function at the output is computationally very expensive
\(\hat{y}_w =\)
The denominator requires a summation over all words in the vocabulary
We will revisit this issue soon
x \(\in \R^{2|V |} \)
h \(\in \R^k \)
. | . | . | . | . | . | . | . |
---|
\(P (he|sat, he)\)
\(P (chair|sat, he)\)
\(P (man|sat, he)\)
\(P (on|sat, he)\)
. . . . . .
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\(\uparrow\)
\([W_{context}, W_{context}] \in\)
\(\R^{k \times 2|V |}\)
\(he \)
\(sat \)
. | . | . | . | . | . | . | . | . | . |
---|
\(W_{word} \in \R^{k \times |V |}\)
Module 10.5: Skip-gram model
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra
The model that we just saw is called the continuous bag of words model (it predicts an output word give a bag of context words)
We will now see the skip gram model (which predicts context words given an input word)
Notice that the role of \(context\) and \(word\) has changed now
In the simple case when there is only one \(context\) word, we will arrive at the same update rule for \(u_c\) as we did for \(v_w\) earlier
Notice that even when we have multiple context words the loss function would just be a summation of many cross entropy errors
\(\mathscr {L}(\theta)=\) \(-\) \(\displaystyle \sum_{i=1}^{d-1}\) \(log\) \(\hat{y}_{w_{i}}\)
Typically, we predict context words on both sides of the given word
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^{|k|} \)
. | . | . | . | . | . | . | . |
---|
\(\R^{k \times |V |}\)
\(he\)
\(sat\)
\(a\)
\(chair\)
Some problems
Same as bag of words
The softmax function at the output is computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estimation
Solution 3: Use hierarchical softmax
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^{|k|} \)
. | . | . | . | . | . | . | . |
---|
\(\R^{k \times |V |}\)
\(he\)
\(sat\)
\(a\)
\(chair\)
Some problems
Same as bag of words
The softmax function at the output is computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estimation
Solution 3: Use hierarchical softmax
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^{|k|} \)
. | . | . | . | . | . | . | . |
---|
\(\R^{k \times |V |}\)
\(he\)
\(sat\)
\(a\)
\(chair\)
Let \(D\) be the set of all correct \((w, c)\) pairs in the corpus
Let \(D'\) be the set of all incorrect \((w, r)\) pairs in the corpus
\(D'\) can be constructed by randomly sampling a context word \(r\) which has never appeared with \(w\) and creating a pair \((w, r)\)
As before let \(v_w\) be the representation of the word \(w\) and \(u_c\) be the representation of the context word \(c\)
\(D = [\)(sat, on), (sat, a), (sat, chair), (on, a), (on,chair), (a,chair), (on,sat), (a, sat), (chair,sat), (a, on), (chair, on), (chair, a) \(]\)
\(D' = [\)(sat, oxygen), (sat, magic), (chair, sad), (chair, walking)\(]\)
For a given \((w, c) \in D\) we are interested in maximizing
\(p(z = 1|w, c)\)
Let us model this probability by
\(p(z = 1|w, c) = \sigma(u_c^T v_w)\)
Considering all \((w, c) \in D\), we are interested in
\(= \frac{1}{1 + e^{-u_c^T v_w}}\)
\( maximize \displaystyle \prod_{(w,c) \in D} p(z = 1|w, c)\)
\(\theta\)
where \(\theta\) is the word representation \((v_w)\) and context representation \((u_c)\) for all words in our corpus
.
\(\sigma\)
\(P(z = 1|w, c)\)
\(u_c\)
\(v_w\)
For \((w, r) \in D'\) we are interested in maximizing
\(p(z = 0|w, r)\)
Again we model this as
\(p(z = 0|w, r) = 1 − \sigma(u_r^T v_w)\)
\(=\) \(1-\) \(\frac{1}{1 + e^{-u_r^T v_w}}\)
\(=\) \(\frac{1}{1 + e^{u_r^T v_w}}\) \(=\) \(\sigma(-u_r^T v_w)\)
Considering all \((w, r) \in D'\) , we are interested in
\( maximize \displaystyle \prod_{(w,r) \in D'} p(z = 0|w, r)\)
\(\theta\)
.
\(\sigma\)
\(P(z = 0|w, r)\)
\(u_r\)
\(v_w\)
\(-\)
Combining the two we get:
\( \displaystyle \prod_{(w,r) \in D^{'}} p(z = 0|w, r)\)
\( \displaystyle \prod_{(w,r) \in D^{'}} (1-p(z = 1|w, r))\)
\(+\) \( \displaystyle \sum_{(w,r) \in D^{'}}\) \(log\) \((1-p(z = 1|w, r))\)
\( \frac{1}{1 + e^{-v_c^T v_w}}\)
\(+\) \( \displaystyle \sum_{(w,r) \in D^{'}}\) \(log\)
\( \frac{1}{1 + e^{v_r^T v_w}}\)
\(+\) \( \displaystyle \sum_{(w,r) \in D^{'}}\) \(log\) \(\sigma(-v_r^T v_w)\)
where \(σ(x)\) \(=\) \(\frac{1}{1+e^{-x}}\)
.
\(\sigma\)
\(P(z = 0|w, r)\)
\(u_r\)
\(v_w\)
\(-\)
\( maximize \displaystyle \prod_{(w,c) \in D} p(z = 1|w, c)\)
\(\theta\)
\(=\) \( maximize \displaystyle \prod_{(w,c) \in D} p(z = 1|w, c)\)
\(\theta\)
\(=\) \( maximize \displaystyle \sum_{(w,c) \in D}\) \(log\) \(p(z = 1|w, c)\)
\(\theta\)
\(=\) \( maximize \displaystyle \sum_{(w,c) \in D}\) \(log\)
\(\theta\)
\(=\) \( maximize \displaystyle \sum_{(w,c) \in D}\) \(log\) \(\sigma(v_c^T v_w)\)
\(\theta\)
In the original paper, \(Mikolov\) \(et.\) \(al.\) sample \(k\) negative \((w, r)\) pairs for every positive \((w, c)\) pairs
The size of \(D'\) is thus \(k\) times the size of \(D\)
The random context word is drawn from a modified unigram distribution
\(r ∼ p(r)^{\frac{3}{4}}\)
\(r ∼\) \(\frac{count(r)^{\frac{3}{4}}}{N}\)
\(N\) \(=\) total number of words in the corpus
.
\(\sigma\)
\(P(z = 0|w, r)\)
\(u_r\)
\(v_w\)
\(-\)
Module 10.6: Contrastive Estimation
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra
Some problems
Same as bag of words
The softmax function at the output is computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estimation
Solution 3: Use hierarchical softmax
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^{|k|} \)
. | . | . | . | . | . | . | . |
---|
\(\R^{k \times |V |}\)
\(he\)
\(sat\)
\(a\)
\(chair\)
Positive: He sat on a chair
Negative: He sat abracadabra a chair
We would like \(s\) to be greater than \(s_c\)
So we can maximize \(s − (s_c + m)\)
. | . | . | . | . | . | . | . |
---|
\(W_h \in\)
\(\R^{2d \times h}\)
\(sat\)
\(on \)
\(v_c\)
\(v_w\)
\(s\)
\(W_{out} \in \R^{h \times |1|}\)
. | . | . | . | . | . | . | . |
---|
\(W_h \in\)
\(\R^{2d \times h}\)
\(sat\)
\(abracadabra\)
\(v_c\)
\(v_w\)
\(s_c\)
\(W_{out} \in \R^{h \times |1|}\)
Okay, so let us try to maximize \(s − s_c\)
But we would like the difference to be at least \(m\)
What if \(s > s_c + m\) \((\) \(don’t\) \(do\) \(any\) \(thing\) \()\)
maximize \(max(0, s − (s_c + m))\)
Module 10.7: Hierarchical softmax
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra
Some problems
Same as bag of words
The softmax function at the output is computationally expensive
Solution 1: Use negative sampling
Solution 2: Use contrastive estimation
Solution 3: Use hierarchical softmax
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^{|k|} \)
. | . | . | . | . | . | . | . |
---|
\(\R^{k \times |V |}\)
\(he\)
\(sat\)
\(a\)
\(chair\)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
\(h = v_c \)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
\(1\)
\(on\)
Construct a binary tree such that there are \(|V |\) leaf nodes each corresponding to one word in the vocabulary
There exists a unique path from the root node to a leaf node.
Let \(l(w_1), l(w_2), . . ., l(w_p)\) be the nodes on the path from root to \(w\)
Let \(π(w)\) be a binary vector such that:
\(π(w)_k = 1\) path branches left at node \(l(w_k)\)
\(= 0\) otherwise
Finally each internal node is associated with a vector \(u_i\)
So the parameters of the module are \(\textbf W_{context}\) and \(u_1, u_2, . . . , u_v\) (in effect, we have the same number of parameters as before
. . .
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
\(h = v_c \)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
\(1\)
\(u_ V\)
\(u_ 1\)
\(u_ 2\)
\(π(on)_1 = 1\)
\(π(on)_2 = 0\)
\(π(on)_3 = 0\)
\(on\)
For a given pair \((w, c)\) we are interested in the probability \(p(w|v_c)\)
We model this probability as
\(p(w|v_c) =\displaystyle \prod_{k} (π(w_k)|v_c)\)
For example
\(P(on|v_{sat}) = P(π(on)_1 = 1|v_{sat})\)
\(* P(π(on)_2 = 0|v_{sat})\)
\(* P(π(on)_3 = 0|v_{sat})\)
In effect, we are saying that the probability of predicting a word is the same as predicting the correct unique path from the root node to that word
. . .
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
\(h = v_c \)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
\(1\)
\(u_ V\)
\(u_ 1\)
\(u_ 2\)
\(π(on)_1 = 1\)
\(π(on)_2 = 0\)
\(π(on)_3 = 0\)
\(on\)
We model
\(P(π(on)_i = 1) =\)
\(P(π(on)_i = 0) = 1 − P(π(on)_i = 1)\)
\(P(π(on)_i = 0) =\)
The above model ensures that the repres-entation of a context word \(v_c\) will have a high(low) similarity with the representation of the node \(u_i\) if \(u_i\) appears and the path branches to the left(right) at \(u_i\)
Again, transitively the representations of contexts which appear with the same words will have high similarity
. . .
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
\(h = v_c \)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
\(1\)
\(u_ V\)
\(u_ 1\)
\(u_ 2\)
\(π(on)_1 = 1\)
\(π(on)_2 = 0\)
\(π(on)_3 = 0\)
\(on\)
\(P(w|v_c) = \displaystyle \prod_{k=1}^{ |π(w)|} P(π(w_k)|v_c)\)
Note that \(p(w|v_c)\) can now be computed using \(|π(w)|\) computations instead of \(|V |\) required by softmax
How do we construct the binary tree\(?\)
Turns out that even a random arrangement of the words on leaf nodes does well in practice
. . .
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
\(sat\)
\(h = v_c \)
. | . | . | . | . | . | . | . |
---|
. | . | . | . | . | . | . | . | . | . |
---|
\(1\)
\(u_ V\)
\(u_ 1\)
\(u_ 2\)
\(π(on)_1 = 1\)
\(π(on)_2 = 0\)
\(π(on)_3 = 0\)
\(on\)
Module 10.7: Glove Representations
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra
Count based methods (SVD) rely on global co-occurrence counts from the corpus for computing word representations
Predict based methods learn word representations using co-occurrence information
Why not combine the two \((\)count and learn \() ?\)
\(X_{ij}\) encodes important global information about the co-occurrence between \(i\) and \(j\) (global: because it is computed for the entire corpus)
Why not learn word vectors which are faithful to this information\(?\)
For example, enforce
\(v_i^T v_j = \) \(log\) \(P(j|i)\)
\(=\) \(log\) \(X_{ij} − \)\(log\) \((X_i)\)
Similarly,
\(v_j^T v_i = \) \(log\) \(X_{ij} − \)\(log\) \((X_j)\) \((X_{ij} = X_{ji})\)
Essentially we are saying that we want word vectors \(v_i\) and \(v_j\) such that \(v_i^T v_j\) is faithful to the globally computed \(P(j|i)\)
Corpus:
Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
2.01 2.01 0.23 2.14 . . . 0.43 |
2.01 2.01 0.23 2.14 . . . 0.43 |
0.23 0.23 1.17 0.96 . . . 1.29 |
2.14 2.14 0.96 1.87 . . . -0.13 |
... ... ... ... . . . ... |
0.43 0.43 1.29 -0.13 . . . 1.71 |
\(X = \)
\(P(j|i) = \) \(\frac{X_{ij}}{\sum X_{ij}} = \frac{X_{ij}}{X_i}\)
\(X_{ij} = X_{ji}\)
Adding the two equations we get
\(2\)\(v_i^T v_j = \) \(2\) \(log\) \(X_{ij} - \) \(log\) \(X_i\) \(-\) \(log\) \(X_j\)
\(v_i^T v_j = \) \(log\) \(X_{ij} - \) \(\frac{1}{2}\) \(log\) \(X_i\) \(-\) \(\frac{1}{2}\) \(log\) \(X_j\)
Note that log \(X_i\) and log \(X_j\) depend only on the words \(i\) \(\&\) \(j\) and we can think of them as word specific biases which will be learned
\(v_i^T v_j = \) \(log\) \(X_{ij}\) \(-\) \(b_i - b_j\)
\(v_i^T v_j \) \(+\) \(b_i + b_j\) \(= \) \(log\) \(X_{ij}\)
We can then formulate this as the following optimization problem
\(min\)
\(v_i,\) \(v_j ,\) \(b_i ,\) \(b_j\)
\(\displaystyle \sum_{i, j} (v_i^T v_j\) \(+ b_i + b_j - \) \(log\) \(X_{ij}\) \()^2\)
predicted value using model parameters
actual value computed from the given corpus
Corpus:
Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
\(X = \)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
2.01 2.01 0.23 2.14 . . . 0.43 |
2.01 2.01 0.23 2.14 . . . 0.43 |
0.23 0.23 1.17 0.96 . . . 1.29 |
2.14 2.14 0.96 1.87 . . . -0.13 |
... ... ... ... . . . ... |
0.43 0.43 1.29 -0.13 . . . 1.71 |
\(P(j|i) = \) \(\frac{X_{ij}}{\sum X_{ij}} = \frac{X_{ij}}{X_i}\)
\(X_{ij} = X_{ji}\)
\(\underbrace{\quad \qquad \qquad }\)
\(\underbrace{\quad \qquad }\)
Drawback: weighs all co-occurrences equally
\(\displaystyle \sum_{i, j} (v_i^T v_j\) \(+ b_i + b_j - \) \(log\) \(X_{ij}\)\()^2\)
\(min\)
\(v_i,\) \(v_j ,\) \(b_i ,\) \(b_j\)
Solution: add a weighting function
\(\displaystyle \sum_{i, j} f(X_{ij}) (v_i^T v_j\) \(+ b_i + b_j - \) \(log\) \(X_{ij}\)\()^2\)
\(min\)
\(v_i,\) \(v_j ,\) \(b_i ,\) \(b_j\)
Wishlist: \(f(X_{ij})\) should be such that neither rare nor frequent words are over weighted.
\(f(x) = \)\(\Bigg\{\)
\(\Big(\) \(\frac{x}{x_{max}}\) \(\Big)^\alpha\)
\(,\) if \(x < x_{max}\)
\(1 ,\) otherwise
\(\Bigg\}\)
where \(α\) can be tuned for a given dataset
Corpus:
Human machine interface for computer applications
User opinion of computer system response time
User interface management system
System engineering for improved response time
\(X = \)
human | machine | system | for | ... | user | |
---|---|---|---|---|---|---|
human machine system for . . . user |
2.01 2.01 0.23 2.14 . . . 0.43 |
2.01 2.01 0.23 2.14 . . . 0.43 |
0.23 0.23 1.17 0.96 . . . 1.29 |
2.14 2.14 0.96 1.87 . . . -0.13 |
... ... ... ... . . . ... |
0.43 0.43 1.29 -0.13 . . . 1.71 |
\(P(j|i) = \) \(\frac{X_{ij}}{\sum X_{ij}} = \frac{X_{ij}}{\sum X_i}\)
\(X_{ij} = X_{ji}\)
Module 10.9: Evaluating word representations
How do we evaluate the learned word representations \(?\)
Semantic Relatedness
Ask humans to judge the relatedness between a pair of words
Compute the cosine similarity between the corresponding word vectors learned by the model
Given a large number of such word pairs, compute the correlation between \(S_{model}\) \(\&\) \(S_{human},\) and compare different models
Model \(1\) is better than Model \(2\) if
\(correlation(S_{model1}, S_{human})\)
\(>\) \(correlation(S_{model2}, S_{human})\)
\(S_{human}(cat, dog) = 0.8\)
\(S_{model}(cat, dog) =\) \(\dfrac{v_{cat}^T v_{dog}}{\| v_{cat} \| \| v_{dog}\|}\) \(= 0.7\)
Synonym Detection
Given: a term and four candidate synonyms
Pick the candidate which has the largest cosine similarity with the term
Compute the accuracy of different models and compare
Term : levied
Candidates : \(\{\) unposed,
believed, requested, correlated\(\}\)
Synonym : \(=\) \(argmax \) \(cosine(v_{term}, v_c)\)
\(c \in C\)
Analogy
Semantic Analogy: Find nearest neighbour of \(v_{sister} − v_{brother} + v_{grandson}\)
Syntactic Analogy: Find nearest neighbour of \(v_{works} − v_{work} + v_{speak}\)
brother : sister :: grandson : \(?\)
work : works :: speak : \(?\)
So which algorithm gives the best result \(?\)
Boroni et.al [2014] showed that predict models consistently outperform count models in all tasks.
Levy et.al [2015] do a much more through analysis (IMO) and show that good old SVD does better than prediction based models on similarity tasks but not on analogy tasks.
Module 10.10: Relation between SVD \(\&\) word\(2\)Vec
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word\(2\)vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine\(!!\)
Recall that SVD does a matrix factorization of the co-occurrence matrix
Levy et.al \([\)2015\(]\) show that word\(2\)vec also implicitly does a matrix factorization
What does this mean \(?\)
Recall that word\(2\)vec gives us \(W_{context}\) \(\&\) \(W_{word}\).
Turns out that we can also show that
\(M = \) \(W_{context} ∗ W_{word}\)
where
\(M_{ij} = \) \(PMI(w_i, c_i) − \) \(log(k)\)
\(k =\) number of negative samples
So essentially, word\(2\)vec factorizes a matrix M which is related to the PMI based co-occurrence matrix (very similar to what SVD does)
0 | 1 | 0 | . . . | 0 | 0 | 0 |
---|
x \(\in \R^{|V |} \)
\(W_{context} \in\)
\(W_{word} \in \R^{k \times |V |}\)
h \(\in \R^{|k|} \)
. | . | . | . | . | . | . | . |
---|
\(\R^{k \times |V |}\)
\(he\)
\(sat\)
\(a\)
\(chair\)
CS6910:Lecture-11
By Arun Prakash
CS6910:Lecture-11
- 738