LSTM
Model
A stunning story with
a
lots
of
twists and..
Word Representation
Learnable Components
LSTM
Model
The
trimmer
is
noisy
, not
customiz-able
Word Representation
Learnable Components
(Millions of Samples)
(thousands of samples)
Transfer Learning
(as weight initializer)
A stunning story with
a
lots
of
twists and..
LSTM
Model
Word2Vec Embedding
Image of \(Bert\)
he sat on the bank of the river and watched the currents
he went to the bank to check balance in his current account
The mouse is too bad and the response is very poor
Task Specific Model
Word2Vec
(Computer)mouse
(lab)mouse
keyborad
cat
A mouse chases a cat
A cat chases a mouse
Conv Layers as Feature Extractor
Fine-Tune for any
downstream tasks
Supervised Pre-training
Word2Vec
or
Glove
Task-specific model
unsupervised pre-training
Bert is a matured character in the comic series whereas Elmo is mischievous
Features
Embeddings
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
rasiththen
<Go>
I
enjoyed
the
film
transformers
CoVe
A stunning story with
a
lots
of
twists and..
Task Specific
Model
GloVe
Embedding
biLSTM
I am ___ excited to watch the film as it is directed by Rajamouli
I
went
to
the
bank
I
went
to
the
bank
Assume these two LSTMs are computing the hidden states (with initial states, (\(\overrightarrow{h_0},\overleftarrow{h_0}\))) independently on a different machine for a same input sequence
We can use two independent LSTM units for bi-LMs
For convenience we can group these two independent LSTM units into a block and call it a single Layer
I
went
to
the
bank
I
went
to
the
bank
I
went
to
the
bank
We can build as many bi-LSTM layers as we want to get deeper representations :-)
I
went
to
the
bank
I
went
to
the
bank
Concatenate hidden states of forward and backward LSTMs at time step \(t\),that is,{\(\overrightarrow{h_t}:\overleftarrow{h_t}\)}
Softmax
went
to
the
bank
\(\cdot\)
A stunning story with
a
lots
of
twists and..
Task Specific
Model
CNN over Characters
biLM
A stunning story with
a
lots
of
twists and..
Task Specific
Model
CNN over Characters
biLM
A mouse chases a cat
A cat chases a mouse
Context
Question
Recall the examples we used earlier
he sat on the bank of the river and watched the currents
he went to the bank to check balance in his current account
Now, we expect that the embedding from ELMO for the word bank will be different based on the context
he sat on the bank of the river and watched the currents
he went to the bank to check balance in his current account
How different are these representations ?
he sat on the bank of the river and watched the currents
he went to the bank to check balance in his current account
The figure on the right shows two possible representations for the word bank appeared in different contexts
The top one gives different representations for the word bank which are closer to each other
Whereas the bottom one gives different representations for the word bank which are far from each other
The distribution of words in the embedding space of word2vec model is shown on the left.
We can clearly see that all (10K) words are distributed in all directions (it looks almost spherical)
The distribution is isotropic
What happens if we contextualize words? Does the representation still remain isotropic?
Computing cosine similarity between any two random words will result in low value (on an average)
Higher cosine similarity between any two words implies they are semantically closer.
Therefore, a set of close words forma a conic section in the sphere
Surprisingly, contextual embeddings are conic for all words and therefore they are anisotropic!
Moreover, the top layer captures the contextual information better than the bottom layers (the cone becomes narrow in the top layer), Ref
That is, two random words on average have high cosine similarity!
LM-Pretraining
There are three stages of training
WikiText
The first stage is a usual general domain pre-training with a large text corpus.
LM contains multiple LSTM layers
The learning rate is same for all parameters in the model
LM-Pretraining
There are three stages of training
WikiText
LM-Finetuning
IMDB
In the second stage, LM model is finetuned on the target vocabulary
This is required as data from target task may come from different distribution
The crucial step is to use different learning rate across layers as different layers capture different type of information
This is termed as discriminative fine tuning
LM-Pretraining
LM-Finetuning
Classifier-Finetuning
There are three stages of training
WikiText
IMDB
IMDB
Labels
Two liner layers with Relu activation function are added on top the last LSTM layer
Instead of fine tuning entire model all at once, the layers gradually unfrozen from last to first layer and fine tuned.
With this approach, the performance of the model with 100 labelled samples is comparable to that of model trained with 100x samples from scratch
WikiText
IMDB
IMDB
Labels
Embedding layer
Embedding layer
Embedding layer
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
FFN + Softmax
LM-Pretraining
LM-Finetuning
WikiText
Amazon Review
Amazon Review
Labels
Embedding layer
Embedding layer
Embedding layer
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
FFN + Softmax
LM-Pretraining
LM-Finetuning