Transfer learning: Pre-training and Fine tuning
Arun Prakash A
Movie Sentiment Classification Task
LSTM
Model
A stunning story with
a
lots
of
twists and..
Supervised learning: Train the entire model and learn the word representation by randomly initializing the weights
It is a legitimate attempt. However, what could be drawbacks of this approach?
Word Representation
Learnable Components
Product Sentiment Classification Task
LSTM
Model
The
trimmer
is
noisy
, not
customiz-able
The knowldge gained in the previou task is not transferable! Though the underlying task remains same
We have to train the model from scratch again (that is, randomly initialize weights) using new dataset.
Word Representation
Learnable Components
Moreover, the vocabulary may not be the same (say, product name, customization, ..)
Is transfer learning possible?
That is, can we (somehow) use the parameters from a model trained on a smilar task?
Fine tune those paramters on the new task
Motivation from vision
Suppose we train a CNN model to recognize a cat
ConvNet
We can collect thousands of samples for cat images and thousands for not-cat images and train the model.
Suppose we change the task to object detection. That is, detect the boundary of cat in an image
In other words, can we transfer the learning?
This is a related problem, only difference is that the label is now the coordinates of the bounding box
Can we make use of the ConvNet trained to classify the cat image?
Feature Extractor
Discriminator
Transfer learning
Feature Extractor
Regressor
Transfer learning
Fine tune the parameters of (Multiple) conv layers
random initialize the
parameters for new layers
CNN Trained on
ImageNet
(Millions of Samples)
Finetune for other downstream
Vision tasks
(thousands of samples)
Transfer Learning
(as weight initializer)
(Supervised) Pre-training
Fine Tuning
Outputs from Conv layers act as feature representations.
This can be further fine-tuned for similar tasks like classification, object detection..
These feature representations are learned in a supervised setting
Does it work for text data?
One solution we can think of is using word2vec embedding and fine-tune it as follows
Fine Tune
Train from scratch
Learnable Components
A stunning story with
a
lots
of
twists and..
LSTM
Model
Word2Vec Embedding
Yes. It (almost) solves the problem of change in vocabulary in down stream tasks.
How does self-supervised pre-training help?
It makes use of a large scale unlabelled data.
Getting labelled data is costly and time consuming
-
100 billion words
-
3 million words in the vocabulary
Limitations:
The context of a word in a sentence is not captured in the representation
The transfer learning happens only in the embedding (first) layer.
A common thread
Image of \(Bert\)
An Image is composed of pixels at an elementary level.
An Image is composed of features at a higher level
A sentence is composed of characters at an elementary level.
A sentence is composed of words at a higher level
The feature representations learned by CNN can be transferred and fine-tuned for other tasks
The word representation learned by word2vec can also be transferred and fine-tuned
However, there is some important differences.
he sat on the bank of the river and watched the currents
he went to the bank to check balance in his current account
30 nearest nighbours to the word bank in word2vec embedding space is show in the figure
Let's consider example sentences to illustrate the need for contexual representation
The vector representation for the word bank doesn't change based on the context
Therefore, the model should take the responsibility to understand the words in its context.
Leraning the context in the supervised setting is once again a limitation.
Limitations of Word2Vec
The mouse is too bad and the response is very poor
Task Specific Model
Word2Vec
(Computer)mouse
(lab)mouse
keyborad
cat
word2vec can't have two representations for a word.However, a word can have multiple meanings based on its sentence-context.
Recall, word2vec learns representation using context of a word w.r.t its surrounding words within a fixed size window (independent of order of words)
A mouse chases a cat
Moreover, it doesn't include the positional information. Think of question-answering system where the position of words is important
A cat chases a mouse
Who is chasing?
Conv Layers as Feature Extractor
Fine-Tune for any
downstream tasks
Supervised Pre-training
Word2Vec
or
Glove
Task-specific model
unsupervised pre-training
Bert is a matured character in the comic series whereas Elmo is mischievous
Features
Embeddings
A quick comparison of pre-training
Wishlist
A model that learns a representation for a word in its context
Learned in Translation: Contextualize word Vectors (CoVe)
Naan
transfarmar
I
enjoyed
the
film
transformers
padaththai
rasiththen
<Go>
Recall the encoder-decoder models used in MT
The entire input sentence is embedded in the hidden states of the encoder.
Can we make use of those hidden states to create a context vector?
However, \(h_1\) has encoded only the first word in the sentence, it has no idea about the rest of the words in the sentence.
How do we fixt it?
I
enjoyed
the
film
transformers
Use birectional LSTM model (biLSTM).
CoVe
The vector \(h_i\) is called CoVe, contextual word vector.
Capturing the context of words w.r.t a sentence is enabled by two hidden states computed in forward \((\vec{h_i})\) and backward \((\overleftarrow{h_i})\) direction.
After training the model , we can use the encoder part alone to get contextual -embeddings for a word
Learnable Components
A stunning story with
a
lots
of
twists and..
Task Specific
Model
Using it in the downstream task
Here, \(x\) is a sequence of words and \(v\) is the contextual embedding of those words,
However, it is still a supervised pre-training approach
GloVe
Embedding
biLSTM
Wishlist
A model that learns a representation for a word in its context
Approach that make use of a large unlabelled data (that is, learn the context in an unsupervised setting)
Language modelling
Given a sequence of words/tokens (\(x_1,x_2,\cdots,x_T\)), the language model predicts the probability of the sequence
For example
Recall, while training the word2vec model, we computed the conditional probabilities for a word.
Is that different from the language modelling then?
Yes, there we computed the probability of a word based on the concurrency not based on the context in which it appeared.
Approximation: Bigram LMs
N-gram (approximation of) LMs are not a good choice for contextual embeddings as we know words can have a long-range dependency.
Therefore, language models predict a word given the entire history of words in the sentence it appeared.
The history can be either in forward direction or backward direction
Forward
Backward
Bi-Directional Language modelling
The history (?) can be in both the directions (forward and backward) and often it is helpful to look into the future
I am ___ excited to watch the film as it is directed by Rajamouli
If we use only unidirectional models, then the prediction could be either not or Very.
However, in the case of bi-LMs the word not is less likely and the word Very is highly likely
Once again we can use LSTM for the modelling and we refer to that as bi-LMs hereafter.
I
went
to
the
bank
Forward LSTM
Backward LSTM
I
went
to
the
bank
Assume these two LSTMs are computing the hidden states (with initial states, (\(\overrightarrow{h_0},\overleftarrow{h_0}\))) independently on a different machine for a same input sequence
We can use two independent LSTM units for bi-LMs
For convenience we can group these two independent LSTM units into a block and call it a single Layer
I
went
to
the
bank
I
went
to
the
bank
I
went
to
the
bank
We can build as many bi-LSTM layers as we want to get deeper representations :-)
I
went
to
the
bank
We just use a single layer (instead of two) to illustrate the core idea of ELMO
Embeddings from Language Models (ELMO)
The representation for inputs is learned by a character level CNN based LM instead of word2vec.
I
went
to
the
bank
We just use a single layer (instead of two) to illustrate the core idea of ELMO
Embeddings from Language Models (ELMO)
The representation for inputs is learned by CNN over characters instead of word embeddings.
Concatenate hidden states of forward and backward LSTMs at time step \(t\),that is,{\(\overrightarrow{h_t}:\overleftarrow{h_t}\)}
Softmax
went
to
the
bank
\(\cdot\)
The Softmax layer is shared across time steps
The cross entropy loss is accumulated across time steps and the loss is minimized in both directions
How do we use ELMO embeddings in downstream tasks?
Learnable Components
A stunning story with
a
lots
of
twists and..
Task Specific
Model
CNN over Characters
biLM
The representation for \(i^{th}\)word:
The input to a downstream task:
What if we have multiple bi-LSTM layers?
Take a combination of representation across layers
Freeze the state of biLM layers
Learnable Components
A stunning story with
a
lots
of
twists and..
Task Specific
Model
CNN over Characters
biLM
How do we know that contextual embeddings are helping?
Considering the input as layer zero
Where \(s_i^{task}\) is a task specific soft-max normalized weights.
One way is to look at whether it improves the performance of the models used in downstream tasks
For example, Question-Answering system
A mouse chases a cat
A cat chases a mouse
Who is chasing?
ELMO embeddings improved performance not only in QA task but also in named entity recognition, sentiment classification and a few more NLP tasks
Context
Question
Who is chasing?
How Contextual are Contextualized Word Representations?
Recall the examples we used earlier
he sat on the bank of the river and watched the currents
he went to the bank to check balance in his current account
Now, we expect that the embedding from ELMO for the word bank will be different based on the context
he sat on the bank of the river and watched the currents
he went to the bank to check balance in his current account
How different are these representations ?
he sat on the bank of the river and watched the currents
he went to the bank to check balance in his current account
The figure on the right shows two possible representations for the word bank appeared in different contexts
The top one gives different representations for the word bank which are closer to each other
Whereas the bottom one gives different representations for the word bank which are far from each other
The distribution of words in the embedding space of word2vec model is shown on the left.
We can clearly see that all (10K) words are distributed in all directions (it looks almost spherical)
The distribution is isotropic
What happens if we contextualize words? Does the representation still remain isotropic?
Static Embeddings
Computing cosine similarity between any two random words will result in low value (on an average)
Higher cosine similarity between any two words implies they are semantically closer.
Therefore, a set of close words forma a conic section in the sphere
Contextual Embeddings
Surprisingly, contextual embeddings are conic for all words and therefore they are anisotropic!
Moreover, the top layer captures the contextual information better than the bottom layers (the cone becomes narrow in the top layer), Ref
That is, two random words on average have high cosine similarity!
Wishlist
A model that learns a representation for a word in its context
Approach that make use of a large unlabelled data (that is, learn the context in an unsupervised setting)
Fine tuning the language model for the end task (instead of training task specific model from scratch)
Universal Language Model Fine-tuning for Text Classification
LM-Pretraining
There are three stages of training
ULMFiT
WikiText
The first stage is a usual general domain pre-training with a large text corpus.
LM contains multiple LSTM layers
The learning rate is same for all parameters in the model
LM-Pretraining
There are three stages of training
ULMFiT
WikiText
LM-Finetuning
IMDB
In the second stage, LM model is finetuned on the target vocabulary
This is required as data from target task may come from different distribution
The crucial step is to use different learning rate across layers as different layers capture different type of information
This is termed as discriminative fine tuning
LM-Pretraining
LM-Finetuning
Classifier-Finetuning
There are three stages of training
ULMFiT
WikiText
IMDB
IMDB
Labels
Two liner layers with Relu activation function are added on top the last LSTM layer
Instead of fine tuning entire model all at once, the layers gradually unfrozen from last to first layer and fine tuned.
With this approach, the performance of the model with 100 labelled samples is comparable to that of model trained with 100x samples from scratch
ULMFiT
WikiText
IMDB
IMDB
Labels
Embedding layer
Embedding layer
Embedding layer
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
FFN + Softmax
LM-Pretraining
LM-Finetuning
ULMFiT
WikiText
Amazon Review
Amazon Review
Labels
Embedding layer
Embedding layer
Embedding layer
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
FFN + Softmax
LM-Pretraining
LM-Finetuning
Pre-Training_and_fine-tuning
By Arun Prakash
Pre-Training_and_fine-tuning
- 292