Deep Learning Practice - NLP

Adapting to Downstream Tasks:

Fine-tuning, Prompting, Instruction Tuning and Preference tuning

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Pre-Training

%\mathscr{L}=-\sum \limits_{\mathbb{x}\in V:X \in \mathcal{X}} \sum \limits_{i=1}^T y_i\log (\hat{y_i})) \mathscr{L}=-\sum \limits_{\mathbb{D}} \sum \limits_{i=1}^T y_i\log (\hat{y_i}))

Minimize

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 12

\hat{y_1}
\hat{y_4}
\cdots
h_1
h_2
h_3
h_{12}
W_v
W_v
\vdots
\vdots
\vdots
\vdots

Embedding Matrix

x_1
x_2
x_3
\langle \ go \rangle

We trained the GPT-2  model with the CLM (Causal Language Modelling) training objective

Text generation

Classification

Conversation

However, how can we adapt it to different downstream tasks ?

sentiment, NER,..

QA, translate, summarize

Chat bot

One approach for using the pre-trained models for downstream tasks is to independently fine-tune the parameters of the model for each task 

BERT, GPT,BART

 Dataset

BERT, GPT,BART

BERT, GPT,BART

BERT, GPT,BART

That is, we make a copy of the pre-trained model for each task and fine-tune it on the dataset specific to that task

Pre-training

Fine-tuning

 Standford Sentiment Tree Bank (SST)

\lbrace+,- \rbrace

SNLI

\lbrace \ entail,\\ contradict, \\neutral\rbrace

LAMBADA

Predict the last word of a Long sentence

\nabla_\theta L
\nabla_\theta L
\nabla_\theta L

Fine-tuning for Text Classfication

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 12

\hat{y}
h_1
h_2
h_3
h_{12}
W_y
\vdots
\vdots
\vdots
\vdots

Embedding Matrix

x_1
x_2
\langle s \rangle
\langle e \rangle

Each sample in a labelled data set \(\mathcal{C}\) consists of a sequence of tokens \(x_1,x_2,\cdots,x_m\) with the label \(y\)

Initialize the parameters with the parameters learned by solving the pre-trianing objective

At the input side, add additional tokens based on the type of downstream task. For example, start  \(\langle s \rangle\) and end \(\langle e \rangle\) tokens for classification tasks

At the output side, replace the pre-training LM head with the classification head (a linear layer \(W_y\))

Fine-tuning involves adapting a model for various downstream tasks (with a minimal or no change in the architecture) 

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 12

\hat{y}
h_1
h_2
h_3
h_{12}
W_y
\vdots
\vdots
\vdots
\vdots

Embedding Matrix

Now our objective is to predict the label of the input sequence

\mathscr{L}=-\sum \limits_{(x,y)} \log (\hat{y_i})
\hat{y}=P(y|x_1,\cdots,x_m)
=softmax(W_yh_{l}^m)

Note that we take the output representation at the last time step of the last layer \(h_l^m\).

It makes sense as the entire sentence is encoded only at the last time step due to causal masking.

Then we can minimize the following objective 

Note that \(W_y\) is randomly intialized. Padding or truncation is applied if the length of input sequence is less or greater than the context length

h_{_{12}}^m
x_1
x_2
\langle s \rangle
\langle e \rangle

Fine-tuning for Text Classfication

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 12

\hat{y}
h_1
h_2
h_3
h_{12}
W_y
\vdots
\vdots
\vdots
\vdots

Embedding Matrix

We can freeze the pre-trained model parameters and randomly initialize the weights of the classification head (\(W_y\)) while training the model 

h_{_{12}}^m
x_1
x_2
\langle s \rangle
\langle e \rangle

In this case, the pre-trained model acts as a feature extractor and the classification head act as a simple classifier.

Fine-tuning for Text Classfication

The other option is to train all the parameters of the model which is called full fine-tuning

In general, the latter approach provides better performance on downstream tasks than the former.

However, there is a catch.

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6
10^9
10^{12}
10^{13}
10^{15}

#  Synapses

124M

GPT-2 Small

We have trained GPT-2 small that has about 124 million parameters

It requires at least 3 to 4 GB of GPU memory to train the model with a batch of size 1 (assuming 4 bytes per parameter and adam optimizer)

(T4 has 16 GB of Memory)

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6
10^9
10^{12}
10^{13}
10^{15}

#  Synapses

124M

GPT-2 Small

What about the memory required to fine-tune GPT-2 Extra Large?

It requires at least 22 to 24 GB of GPU memory to train the model with a batch of size 1 (assuming 4 bytes per parameter and adam optimizer)

(V100 has 32 GB of Memory)

1.5G

GPT-2 Extra Large

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6
10^9
10^{12}
10^{13}
10^{15}

#  Synapses

1.5B

GPT-2

10B

Megatron LM

175B

GPT-3

124M

GPT-2 Small

90B

Llama-3.2

What about the memory required to fine-tune huge models?

A single node with multiple GPUs for models having a few Billion parameters

Multi node GPU clusters for models having  Billions of  parameters

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6
10^9
10^{12}
10^{13}
10^{15}

#  Synapses

1.5B

GPT-2

3B
175B

GPT-3

124M

GPT-2 Small

90B

Llama-3.2

Suppose that we have a single V100 GPU 

Llama-3.2 light

Can we at least fine-tune a 3B parameters model which requires about 48 GB of Memory?

Is gradient-based fine-tuning the only approach?

If yes, what if we do not have enough samples for supervised fine-tuning?

(V100 has 32 GB of Memory)

How do we finetune the model for text interactions (like chatbot)?

Can we at least fine-tune 3B parameters models which require about 48 GB of Memory?

Many approaches have been developed to tackle the memory requirement to fine-tune large models. 

Of these Parameter Efficient Fine Tuning (PEFT) techniques, LoRa, QLoRa and AdaLoRa are most commonly used (optionally, in combination with quantization)

Emerging Abilities

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 12

\hat{y}
h_1
h_2
h_3
h_{12}
W_y
\vdots
\vdots
\vdots
\vdots

Embedding Matrix

x_1
x_{2}
x_5
\langle s \rangle
\langle e \rangle
x_3
x_4

Fine-tuning large models is costly, as it still requires thousands of samples to perform well in a downstream task [Ref]

Moreover, this is not how humans adapt to different tasks once they understand the language.

We can give them a book to read and  "prompt" them to summarize it or find an answer for a specific question.

That is, we do not need "supervised fine-tuning" at all (for most of the tasks)

In a nutshell, we want a single model that learns to do multiple tasks with zero or few examples (instead of thousands)! 

Some tasks may not have enough labelled samples

Can we adapt a pre-trained language model for downstream tasks without any explicit supervision (called zero-shot transfer)?

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 12

h_1
h_2
h_3
h_{12}
\vdots
\vdots
\vdots
\vdots

Embedding Matrix

\langle s \rangle
\langle e \rangle
\cdots
sentence:task
task\ specific \ output

Note that, the prompts (or instructions) are words. Therefore,we just need to change the single task (LM) formulation

p(output|input)

to  multi task

p(output|input,task)

where the task is just an instruction in plain  words that is prepended (appended) to the input sequence during inference

Yes, with a simple tweak to the input text!

Surprisingly, this induces a model to output a task specific response for the same input

For example,

Input text: I enjoyed watching the movie transformer along with my .....

Task:  summarize (or TL;DR:)

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 12

h_1
h_2
h_3
h_{12}
\vdots
\vdots
\vdots
\vdots

Embedding Matrix

\langle s \rangle
\langle e \rangle
\cdots

summarize: ip text

Watched the transformer movie

To get a good performance, we need to scale up both the model size and the data size

Transformer Block 1

Transformer Block 2

Transformer Block 12

\vdots
\vdots
\vdots
\vdots

Embedding Matrix

\langle e \rangle
\cdots
\langle s \rangle

GPT with 110 Million Parameters

GPT-2 with 1.5 Billion Parameters

Layers: \(4X\)

Parameters: \(10X\)

Transformer Block 1

Transformer Block 2

Transformer Block 3

Transformer Block 48

\vdots
\vdots
\vdots
\vdots

Embedding Matrix

\langle e \rangle
\cdots
\langle s \rangle

Pushing the limits: 1.5B to  175B

The ability to learn from a few examples improves as model size increases

For certain tasks, the performance is comparable to that achieved through full task-specific fine-tuning.

Since the model learns a new task from samples within the context window, this approach is called 'in-context' learning.

This remarkable ability enables the adaptation of the model to downstream tasks without the need for gradient-based fine-tuning.

Adaptation happens on-the-fly in inference mode (which consumes far less memory)

Prompting

This new ability paved the way for fine-tuning the model for specific tasks by Prompting

There are many ways of prompting the model. For example,

However, there is a catch

  1. Zero-shot

  2. Few-shot (in-context)

  3. Chain of Thought 

  4. Prompt Chaining

Since adaptation occurs during inference, there is no need to share model weights for fine-tuning.

This approach enables the model's deployment across a variety of use cases through simple API calls (making it more accessible) 

guide the user to reach the destination 

Text:how to reach Marina Beach from IIT Madras
by train
Classify the text into neutral, negative or positive. 

Text: I enjoyed watching the transformers movie.
Sentiment:
positive

The model is unable to follow the user's intent and instead just completes the text coherently

Instruction Tuning

Zero-shot learning performance is often poor, despite the model size (say, GPT 3 175B), especially in following the user intent (instructions).

Refer to the FLAN paper for more details

How do we improve the zero-shot learning performance?

Fine-tune the model on the instructions (one approach is to reformat the available datasets into instruction sets)

Preference Tuning via RLHF

"Making language models bigger does not inherently make them better at following a user’s intent." -[Instruct GPT paper]

This requires  human labelled demonstrations for a collection of prompts (a labour intensive task)

Language Modelling objective is "misaligned" with user intent

Therefore, we must fine-tune the model to align with the user intent.

Use these collections to fine-tune (Supervised Fine Tuning , SFT ) the model using Reinforcement Leanring from Human Feedback (RLHF)

Preference Tuning via RLHF

An imagined example 

Text: Guide me on how to reach Marina Beach from IIT Madras

GPT 

by train

Text: Guide me on how to reach Marina Beach from IIT Madras

Instruct GPT 

*Actual response from ChatGPT

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6
10^9
10^{12}
10^{13}
10^{15}

#  Synapses

1.5B

GPT-2

3B
175B

GPT-3

124M

GPT-2 Small

90B

Llama-3.2

Suppose that we have a single V100 GPU 

Llama-3.2 light

Can we at least fine-tune 3B parameters models which require about 48 GB of Memory?

Is gradient-based fine-tuning the only approach?

If yes, what if we do not have enough samples for supervised fine-tuning?

(V100 has 32 GB of Memory)

How do we finetune the model for text interactions (like chatbot)?

Yes

Prompting

No. Prompting works during inference

Instruction tuning and preference tuning (DPO,RLHF,IPO,..)

Fine-tuning 

Supervised Fine-tuning (SFT)

(task agnostic) Instruction-tuning 

Task-specific Full Fine-tuning 

1. PEFT (Parameter Efficient Fine-tuning): LoRA, QLoRA, AdaLora

Prompt tuning

Zero-shot,

Memory efficient fine-tuning*

few-shot

Chain of Thought

Prompt chaining

Meta prompting

\vdots

2. Quantization

Now we try grouping the approaches broadly into two categories

Preference Tuning: RLHF, DPO, IPO, KTO..

*general techniques  to reduce memory requirement, suitable for any fine-tuning schemes 

 Dataset

Initialize Model

Tokenizer

from datasets import load_dataset
from transformers import AutoTokenizer

DataLoader

The list of modules we used so far

Train the model

from transformers import DataCollatorForLanguageModeling
from transformers import GPT2Config, GPT2LMHeadModel
from transformers import TrainingArguments, Trainer

 Dataset

Initialize Model with pre-trained weights

Tokenizer

DataLoader

Finetune the model

from transformers import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained()
from transformers import TrainingArguments, Trainer
from peft import LoraConfig

Let's dive in

Additional Modules:

  1. peft,

  2. trl,SFTTrainer (for preference tuning)

  3. bitsandbytes (quantization)

  4. Unsloth (for single-gpu, 2.5x faster training)

Notebooks: Github Link

Experimental Setup

Outline

Supervised Fine Tuning

Continual PreTraining

Instruction Fine Tuning

How does LLM generate a factual answer for a question?

How does LLM generate a factual answer for a question?

InstructGPT

RAG

Agent

Indexing

Get the learned embedding