Transfer learning: Pre-training and Fine tuning

Arun Prakash A

Movie Sentiment Classification Task

LSTM

Model

A stunning story with

a

lots

of

twists and..

\begin{bmatrix} 0 \\ 0\\ 1\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 1 \\ 0\\ 0\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 0 \\ 0\\ 0\\ 0\\ 1\\ \end{bmatrix}
\hat{y}=+

Supervised learning: Train the entire model and learn the word representation by randomly initializing the weights

It is a legitimate attempt. However, what could be drawbacks of this approach?

Word Representation

Learnable Components

Product Sentiment Classification Task

LSTM

Model

The

trimmer 

is 

noisy 

, not 

customiz-able

\begin{bmatrix} 0 \\ 0\\ 1\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 1 \\ 0\\ 0\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 0 \\ 0\\ 0\\ 0\\ 1\\ \end{bmatrix}
\hat{y}=-

The knowldge gained in the previou task is not transferable! Though the underlying task remains same

We have to train the model from  scratch again (that is, randomly initialize weights) using new dataset.

Word Representation

Learnable Components

Moreover, the vocabulary may not be the same (say, product name, customization, ..)

Is transfer learning possible?

That is, can we (somehow) use the parameters from a model trained on a smilar task?

Fine tune those paramters on the new task 

Motivation from vision

Suppose we train a CNN model to recognize a cat 

98 \%
2 \%
cat
not \ cat

ConvNet

We can collect thousands of samples for cat images and thousands for not-cat images and train the model.

 Suppose we change the task to object detection. That is, detect the boundary of cat in an image

In other words, can we transfer the learning?

 This is a related problem, only difference is that the label is now the coordinates of the bounding box

Can we make use of the ConvNet trained to classify the cat image?

(x_1,y_1)
(x_2,y_2)

Feature Extractor

Discriminator

Transfer learning

98 \%
2 \%
cat
not \ cat

Feature Extractor

Regressor

Transfer learning

Fine tune the parameters of (Multiple) conv layers

 random initialize the

parameters for new layers

(x_1,y_1)
(x_2,y_2)
x_1
x_2
y_1
y_2

CNN Trained on

ImageNet

(Millions of Samples)

Finetune for other downstream 

Vision tasks

(thousands of samples)

Transfer Learning

(as weight initializer)

(Supervised) Pre-training

Fine Tuning

Outputs from Conv layers act as feature representations.

This can be further fine-tuned for similar tasks like classification, object detection..

These feature representations are learned in a supervised setting 

Does it work for text data?

One solution we can think of is using  word2vec embedding and fine-tune it as follows

Fine Tune

Train from scratch

Learnable Components

A stunning story with

a

lots

of

twists and..

\begin{bmatrix} 0 \\ 0\\ 1\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 1 \\ 0\\ 0\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 0 \\ 0\\ 0\\ 0\\ 1\\ \end{bmatrix}

LSTM

Model

\hat{y}=+

Word2Vec Embedding

Yes. It (almost) solves the problem of change in vocabulary in down stream tasks.

How does self-supervised pre-training help?

It makes use of a large scale unlabelled data.  

Getting labelled data is costly and time consuming

  • 100 billion words

  • 3 million words in the vocabulary

Limitations:

The context of a word in a sentence is not captured in the representation

The transfer learning happens only in the embedding (first) layer.

A common thread

Image of  \(Bert\)

256
256

An Image is composed of pixels at an elementary level. 

An Image is composed of  features at a higher level

A sentence is composed of characters at an elementary level. 

A sentence is composed of  words  at a higher level

The feature representations learned by CNN can be transferred and fine-tuned for other tasks

The word representation learned by word2vec can also be transferred and fine-tuned

However, there is some important differences.

he sat on the bank of the river and watched the currents

he went  to the bank  to check balance in his current account

30 nearest nighbours to the word bank in word2vec embedding space is show in the figure

Let's consider example sentences to illustrate the need for contexual representation

The vector representation for the word bank doesn't change based on the context

Therefore, the model should take the responsibility to understand the words in its context.

Leraning the context in the supervised setting is once again a limitation.

Limitations of Word2Vec

\cdots

The mouse is too bad and the response is very poor

Task Specific Model

Word2Vec

\cdots
\cdots

(Computer)mouse

(lab)mouse

keyborad

cat

word2vec can't have two representations for a word.However, a word can have multiple meanings based on its sentence-context.

Recall, word2vec learns representation using context of a word w.r.t  its surrounding words within a fixed  size window (independent of order of words)

A mouse chases a cat

Moreover, it doesn't include the positional information. Think of question-answering system where the position of words is important

A cat chases a mouse

Who is chasing?

Conv Layers as Feature Extractor

Fine-Tune for any

downstream tasks

Supervised Pre-training 

Word2Vec

or

Glove

Task-specific model

unsupervised pre-training

Bert is a matured character in the comic series whereas Elmo is mischievous

Features

Embeddings

A quick comparison of pre-training 

Wishlist

A model that learns a representation for a word in its context

Learned in Translation: Contextualize word Vectors (CoVe)

h_1
h_2
h_3
h_4
h_5
\sum
\alpha_{41}
\alpha_{42}
\alpha_{43}
\alpha_{44}
\alpha_{45}
c_4
s_1

Naan

s_2

transfarmar

I

enjoyed

the

film

transformers

s_3

padaththai

s_4

rasiththen

s_0

<Go>

Recall the encoder-decoder models used in MT

The entire input sentence is embedded in the hidden states of the encoder.

Can we make use of those hidden states to create a context vector?

However, \(h_1\) has encoded only the first word in the sentence, it has no idea about the rest of the words in the sentence.

How do we fixt it?

h_1
h_2
h_3
h_4
h_5
h_1
h_2
h_3
h_4
h_5

I

enjoyed

the

film

transformers

Use birectional LSTM model (biLSTM).

h_i=[\vec{h_i}:\overleftarrow{h_i}]
\rbrace

CoVe

The vector \(h_i\) is called CoVe, contextual word vector.

Capturing the context of words w.r.t a sentence is enabled by two hidden states computed in forward \((\vec{h_i})\) and backward \((\overleftarrow{h_i})\) direction.

After training the model , we can use the encoder part alone to get contextual -embeddings for a word

\overleftarrow{h_i}
\overrightarrow{h_i}

Learnable Components

A stunning story with

a

lots

of

twists and..

\begin{bmatrix} 0 \\ 0\\ 1\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 1 \\ 0\\ 0\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 0 \\ 0\\ 0\\ 0\\ 1\\ \end{bmatrix}

Task Specific 

Model

Using it in the downstream task

Here, \(x\) is a sequence of words and \(v\) is the contextual embedding of those words,

x
GloVe(x)
CoVe(x)
v

However, it is still a supervised pre-training approach

GloVe

Embedding

biLSTM

v = [GloVe(x):CoVe(x)]

Wishlist

A model that learns a representation for a word in its context

Approach that make use of a large unlabelled data (that is, learn the context in an unsupervised setting)

Language modelling

Given a sequence of words/tokens  (\(x_1,x_2,\cdots,x_T\)), the language model predicts the probability of the sequence

P(x_1,x_2,\cdots,x_T)=\prod \limits_{i=1}^T P(x_i|x_1,\cdots,x_{i-1})

For example

P(the,story,so,far,is,good)=P(the)P(story|the)\cdots P(good|the,story,so,far,is)
P(the,story,so,far,is,good)= 0.99
P(story,the,far,so,good,is)= 0.00001

Recall, while training the word2vec model, we computed the conditional probabilities for a word.

Is that different from the language modelling then?

Yes, there we computed the probability of a word based on the concurrency not based on the context in which it appeared.

Approximation: Bigram LMs

P(the,story,so,far,is,good)=P(the)P(story|the)P(so|story)\cdots P(good|is)

N-gram (approximation of) LMs are not a good choice for contextual embeddings as  we know words can have a long-range dependency.

Therefore, language models predict a word given the entire history of words in the sentence it appeared.

The history can be either in forward direction or backward direction

P(x_1,x_2,\cdots,x_T)=\prod \limits_{i=1}^T P(x_i|x_1,\cdots,x_{i-1})
P(x_1,x_2,\cdots,x_T)=\prod \limits_{i=1}^T P(x_i|x_{i+1},\cdots,x_{T})

Forward

Backward

Bi-Directional Language modelling

The history (?) can be in both the directions (forward and backward) and often it is helpful to look into the future

I am ___ excited to watch the film as it is directed by Rajamouli

If we use only unidirectional models, then the prediction could be either not or Very.

However, in the case of bi-LMs the word not is less likely and the word  Very is highly likely

Once again we can use LSTM for the modelling and we refer to that as bi-LMs hereafter.

I

went 

to

the

bank

Forward LSTM

Backward LSTM

I

went 

to

the

bank

Assume these two LSTMs are computing the hidden states (with initial states, (\(\overrightarrow{h_0},\overleftarrow{h_0}\))) independently on a different machine for a same input sequence

\overrightarrow{h_1}
\overrightarrow{h_2}
\overrightarrow{h_3}
\overrightarrow{h_4}
\overrightarrow{h_5}
\overleftarrow{h_1}
\overleftarrow{h_2}
\overleftarrow{h_3}
\overleftarrow{h_4}
\overleftarrow{h_5}

We can use two independent LSTM units   for bi-LMs

For convenience we can group these two independent LSTM units into a block and call it a single Layer

I

went 

to

the

bank

I

went 

to

the

bank

\overrightarrow{h_1}
\overrightarrow{h_2}
\overrightarrow{h_3}
\overrightarrow{h_4}
\overrightarrow{h_5}
\overleftarrow{h_1}
\overleftarrow{h_2}
\overleftarrow{h_3}
\overleftarrow{h_4}
\overleftarrow{h_5}

I

went 

to

the

bank

\overrightarrow{h_1}
\overrightarrow{h_2}
\overrightarrow{h_3}
\overrightarrow{h_4}
\overrightarrow{h_5}
\overleftarrow{h_1}
\overleftarrow{h_2}
\overleftarrow{h_3}
\overleftarrow{h_4}
\overleftarrow{h_5}
\overrightarrow{h_1}
\overrightarrow{h_2}
\overrightarrow{h_3}
\overrightarrow{h_4}
\overrightarrow{h_5}
\overleftarrow{h_1}
\overleftarrow{h_2}
\overleftarrow{h_3}
\overleftarrow{h_4}
\overleftarrow{h_5}

We can build as many bi-LSTM layers as we want to get deeper representations :-)

I

went 

to

the

bank

\overrightarrow{h_1}
\overrightarrow{h_2}
\overrightarrow{h_3}
\overrightarrow{h_4}
\overrightarrow{h_5}
\overleftarrow{h_1}
\overleftarrow{h_2}
\overleftarrow{h_3}
\overleftarrow{h_4}
\overleftarrow{h_5}
\overrightarrow{h_1}
\overrightarrow{h_2}
\overrightarrow{h_3}
\overrightarrow{h_4}
\overrightarrow{h_5}
\overleftarrow{h_1}
\overleftarrow{h_2}
\overleftarrow{h_3}
\overleftarrow{h_4}
\overleftarrow{h_5}

We just use a single layer (instead of two) to illustrate the core idea of ELMO

Embeddings from Language Models (ELMO)

The representation for inputs is learned by a character level  CNN based LM instead of word2vec. 

I

went 

to

the

bank

\overrightarrow{h_1}
\overrightarrow{h_2}
\overrightarrow{h_3}
\overrightarrow{h_4}
\overrightarrow{h_5}
\overleftarrow{h_1}
\overleftarrow{h_2}
\overleftarrow{h_3}
\overleftarrow{h_4}
\overleftarrow{h_5}

We just use a single layer (instead of two) to illustrate the core idea of ELMO

Embeddings from Language Models (ELMO)

The representation for inputs is learned by   CNN  over characters instead of word embeddings. 

Concatenate hidden states of forward and backward LSTMs at time step \(t\),that is,{\(\overrightarrow{h_t}:\overleftarrow{h_t}\)}

Softmax

went 

to

the

bank

\(\cdot\)

The Softmax layer is shared across time steps

The cross entropy loss is accumulated across time steps and the loss is minimized in both directions

How do we use ELMO embeddings in downstream tasks?

Learnable Components

A stunning story with

a

lots

of

twists and..

\begin{bmatrix} 0 \\ 0\\ 1\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 1 \\ 0\\ 0\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 0 \\ 0\\ 0\\ 0\\ 1\\ \end{bmatrix}

Task Specific 

Model

x
\hat{x}
v

CNN over Characters

biLM

\overrightarrow{h}
\overleftarrow{h}
R_{i}=\{\hat{x_i},\overrightarrow{h_{i}},\overleftarrow{h_{i}}\}
v_i=\gamma^{task}R_i

The representation for \(i^{th}\)word:

The input to a downstream task:

What if we have multiple bi-LSTM layers?

Take a combination of representation across layers

Freeze the state of biLM layers

Learnable Components

A stunning story with

a

lots

of

twists and..

\begin{bmatrix} 0 \\ 0\\ 1\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 1 \\ 0\\ 0\\ 0\\ 0\\ \end{bmatrix}
\begin{bmatrix} 0 \\ 0\\ 0\\ 0\\ 1\\ \end{bmatrix}

Task Specific 

Model

x
\hat{x}
v

CNN over Characters

biLM

\overrightarrow{h_{1:L}}
\overleftarrow{h_{1:L}}
R_{i}=\{\hat{x_{ij}},\overrightarrow{h_{ij}},\overleftarrow{h_{ij}}\}

How do we know that contextual embeddings are helping?

Considering the input as  layer zero

j=0,1,\cdots,L
v_i=\gamma^{task}\sum \limits_{j=0}^L s_i^{task}h_{ij}

Where  \(s_i^{task}\) is a task  specific soft-max normalized weights.

=\{h_{ij}\}

One way is to look at whether it improves the performance of the models used in downstream tasks

For example, Question-Answering system

A mouse chases a cat
A cat chases a mouse

Who is chasing?

ELMO embeddings improved performance  not only in QA task but also in named entity recognition, sentiment classification and a few more NLP tasks

Context

Question

Who is chasing?

How Contextual are Contextualized Word Representations?

Recall the examples we used earlier

he sat on the bank of the river and watched the currents

he went  to the bank  to check balance in his current account

Now, we expect that the  embedding from ELMO for the word bank will be different based on the context

he sat on the bank of the river and watched the currents

he went  to the bank  to check balance in his current account

How different are these representations ?

he sat on the bank of the river and watched the currents

he went  to the bank  to check balance in his current account

The figure on the right shows two possible representations for the word bank appeared in different contexts

The top one gives different representations for the word bank which are closer to each other

Whereas the bottom one gives different representations for the word bank which are far from each other

The distribution of words in the embedding space of word2vec model is shown on the left.

We can clearly see that  all (10K) words are distributed in all directions (it looks almost spherical)

The distribution is isotropic

What happens if we contextualize words? Does the representation still remain isotropic?

Static Embeddings

Computing cosine similarity between any two random words will result in low value (on an average)

Higher cosine similarity between any two words implies they are semantically closer.

Therefore, a set of close words forma a conic section in the sphere

Contextual Embeddings

Surprisingly, contextual embeddings are conic for all words and therefore they are anisotropic!

Moreover, the top layer captures the contextual information better than the bottom layers (the cone becomes narrow in the top layer), Ref

That is, two random words on average have high cosine similarity!

Wishlist

A model that learns a representation for a word in its context

Approach that make use of a large unlabelled data (that is, learn the context in an unsupervised setting)

Fine tuning the language model for the end task (instead of training task specific model from scratch)

Universal Language Model Fine-tuning for Text Classification

LM-Pretraining

There are three stages of training

ULMFiT

WikiText

\eta

The first stage is a usual general domain pre-training with a large text corpus.

LM contains multiple LSTM layers 

The learning rate is same for all parameters in the model

LM-Pretraining

There are three stages of training

ULMFiT

WikiText

\eta

LM-Finetuning

IMDB

\eta^l_t

In the second stage, LM model is finetuned on the target vocabulary

This is required as data from target task may come from different distribution

The crucial step is to use different learning rate across layers as different layers capture different type of information

This is termed as discriminative fine tuning

LM-Pretraining

LM-Finetuning

Classifier-Finetuning

There are three stages of training

ULMFiT

WikiText

IMDB

IMDB

Labels

\eta
\eta^l_t
\eta^l_t

Two liner layers with Relu activation function are added on top the last LSTM layer

Instead of fine tuning entire model all at once, the layers gradually unfrozen from last to first layer and fine tuned.

With this approach, the performance of the model with  100 labelled samples is comparable to that of model trained with 100x samples  from scratch

ULMFiT

WikiText

IMDB

IMDB

Labels

Embedding layer
Embedding layer
Embedding layer
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3

FFN + Softmax

LM-Pretraining

LM-Finetuning

\eta
\eta
\eta
\eta
\eta
\eta^4_t
\eta^3_t
\eta^2_t
\eta^1_t
\eta^5_t
\eta^4_t
\eta^3_t
\eta^2_t
\eta^1_t

ULMFiT

WikiText

Amazon Review

Amazon Review

Labels

Embedding layer
Embedding layer
Embedding layer
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3
LSTM-1
LSTM-2
LSTM-3

FFN + Softmax

LM-Pretraining

LM-Finetuning

\eta
\eta
\eta
\eta
\eta
\eta^4_t
\eta^3_t
\eta^2_t
\eta^1_t
\eta^5_t
\eta^4_t
\eta^3_t
\eta^2_t
\eta^1_t