Large Language Models:

Challenges and Opportunities

Mitesh M. Khapra, Arun Prakash A

Evolution of How Information is Stored and Retrieved !

Stone/Iron Age

Industrial Age

Digital Age

Carved in Stones

Written on papers

Digitized

Parameterized

The Age of AI [has begun]*

*From Gatesnotes

Store and Retrieve

Store and Generate!

Magic Box

Creative Text Generation

"Any sufficiently Advanced Technology is Indistinguishable from Magic"

Simple Sentiment Classification

Magic Box

"Any sufficiently Advanced Technology is Indistinguishable from Magic"

Logical Reasoning

Magic Box

"Any sufficiently Advanced Technology is Indistinguishable from Magic"

Doing arithmetic

Magic Box

"Any sufficiently Advanced Technology is Indistinguishable from Magic"

Who is inside the Magic box?

I'm sure there must be a few expert dwarves in the box!

That's why we get convincing responses for all questions

Who is inside the Magic box?

Well, It generates an Image from a textual description!

Image: Children suspects some people are inside a radio or television set back in 1970's, India.

There must be a dwarf inside the box..

Who is inside the Magic box?

Image: Children suspects some people are inside a radio or television set back in 1970's, India.

Well, It generates an Image from a textual description!

There must be a dwarf inside the box..

Magic Box

"Any sufficiently Advanced Technology is indistinguishable from Magic"

Magic Box

Multi-head Masked Attention

N \times

tell

joke

about

idli

\leftarrow prompt \rightarrow

\leftarrow response \rightarrow

\leftarrow prompt \rightarrow

W_v

why

W_v

did

the

did

\leftarrow response \rightarrow

\cdots

Multi-head Masked Attention

tell

joke

about

idli

W_v

why

W_v

did

the

did

\cdots

The Magic:

Train the models to predict next word given all previous words

W_v

idli

the

“The magician takes the ordinary something and makes it do something extraordinary.”

Traditional NLP Models

Large Language Models

Input text

Predict the class/sentiment

Input text

Summarize

Question

Answer

Input text

LLMs

Prompt: Input text

Output response conditioned on prompt

Prompt: Predict sentiment, summarize, fill in the blank, generate story

Labelled data for task-1

Labelled data for task-2

Labelled data for task-3

Raw text data

(cleaned)

Model-1

Model-2

Model-3

Trillions of

Tokens

Billions of

Parameters

Zetta FLOPS

of Compute

LLMs

Three Stages

Pre-training

Fine tuning

Inference

Trident of LLMs

Trillions of Tokens

LLMs

W_v

Next token

“The magician takes the ordinary something and makes it do something extraordinary.”

Something Ordinary:

To Extraordinary:

Predict next token

and next token, next token, .........

Sourcing billions of tokens from the Internet is a massive engineering effort!!

Pre-Training

By doing this, the model eventually learns language structure, grammar and world knowledge !

Trillions of

Tokens

10 GB

5GB

21GB

40GB

1 GB

100 GB

1 TB

BookCorpus

Wikipedia

WebText(closed)

110GB

RealNews

800GB

The Pile

1.6 TB

ROOTS

2015

2019

2023

2020

Falcon

2.6TB

10 TB

5TB

RedPajama

11TB

DOLMA

2024

C4

Opportunity:

Build one

Challenge:

Inadequate quality datasets for Indic Languages

C4

ROOTS

RefinedWeb (Falcon)

RedPajama-v1/v2

The Pile

DOLMA

mC4

Stack

Sangraha

Dataset Name

# of tokens

~156 Billion

Diversity

Webpage

~170 Billion

22 sources

> 1 Trillion

380 Programing languages

5 Trillion (600B in public)

Webpage

1.2/30 Trillion

Webpage, Books, Arxiv, Wiki, StackExch

3 Trillion

Webpage, Books, Wiki, The Stack, STEM

~418 Billion

Webpage

~341 Billion

natural and programming languages

251 Billion

Web, videos, digitized pdf,synthetic

Languages

English

Code

English

English/Multi

English

Multi

Multi

Effort by AI4Bharat

English data

Capture all India specific knowledge in all Indian Languages!

Billions of

Parameters

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6

10^9

10^{12}

10^{13}

10^{15}

# Synapses

400M

Transformer

1.5B

GPT-2

10B

Megatron LM

175B

GPT-3

GShard

1.1 T

1.6 T

Switch

Zetta FLOPS

of Compute

Training LLMs having more than 70 Billion Parameters is affordable only for a few organizations around the world

Requires a cluster of A100 (or) H100 GPUs that requires millions of dollars

Then, how do we adapt those models for diverse Indian culture and languages

Way to go: Language Adaptation ?

Trillions of

Tokens

Billions of

Parameters

Zetta FLOPS

of Compute

Pre-Trained open sourced LLM

Way to go: Language Adaptation ?

Trillions of

Tokens

Billions of

Parameters

Zetta FLOPS

of Compute

Pre-Trained open sourced LLM

Billions of

Tokens

Billions of

Parameters

Peta FLOPS

of Compute

Fully fine-tuned

Sangraha

Billions of

Parameters

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6

10^9

10^{12}

10^{13}

10^{15}

# Synapses

400M

Transformer

1.5B

GPT-2

10B

Megatron LM

175B

GPT-3

GShard

1.1 T

1.6 T

Switch

Affordable for inference

Opportunity:

Use Instruction Fine-tuning and build datasets for the same

Challenge:

(full) Fine-Tuning of LLMs on Indic datasets still requires a lot of compute and expensive

Way to go: Instruction Fine-Tuning?

Billions of

Tokens

Billions of

Parameters

Peta FLOPS

of Compute

Fully fine-tuned

Sangraha

LLMs exhibit a remarkable learning ability called "in context learning".

It means, we can instruct them to respond in certain way by giving them a set of examples about the task during inference [the cheapest option]

Way to go: Instruction Fine-Tuning

Millions of

Tokens

Billions of

Parameters

Tera FLOPS

of Compute

Instruction-tuned

Indic-Align

Goal:

Improve the model’s ability to understand and follow human instructions and ensure response is aligned with human expectations and values.

How it works:

Training the model on a set (relatively small) of high quality and diverse instruction and answer pairs.

How do we source the data?

From all the places where a conversation happens!

100 K

52K

64K

84K

<10K

500 K

Alpaca

Unnatural

Self-Instruct

143K

Evolved Instruct

What should be the size?

10M

534K

Guanaco

620K

Natural Inst

1.5M

Ultra Chat

12M

P3

15M

FLAN

Significantly lesser

But more high quality!!

For Indian Languages ?

Existing English Data

Synthetic India-centric conversations

Indic-Align

Capture all different ways in which people can ask!!

Evaluation

How do we compare the performance of one model to the other?

How good is the model at solving a given task?

Are there any more hidden skills we dont know about?

Still a lot of open questions to explore here.

How good is the model in other languages?

Is the model biased? Is it Toxic? Is it Harmful?

There are hundreds of models in the market (ChatGPT, Llama, Gemma, Sutra ..)

What is the next big direction?

If a single architecture works for Text, Image, sound and video, then why not train the architecture on all these modalities?

...and it is already happening

That's called multi-modal LLM

Again, What about the data for Indian Context?

It is both a challenge and an opportunity!

Opportunities are Plenty

We now know that

Cost for Pre-training is prohibitive !
Cost for Fine-tuning is still expensive !

but running the model in inference mode is cheaper

Requires far less computing than pre-training and fine-tuning
One can access via APIs [No need to setup anything]

In inference mode, we can

Tune a model to do a new task via In-Context Learning (few-shot prompting)
Build apps that combine the power of LLMs with other tools to solve other problems!

We can build numerous applications !

Workshop Activity Details

Whats next?

Building an end to end voice-enabled Chatbot
- Integrate the ASR, NMT and TTS models together with ChatGPT.
- RAG-based pipeline to build a Govt. Scheme bot.

Help us build a real-world benchmark for evaluating LLMs in the Indian context and languages.
- Take home activity
- Just 25 prompts per person.

Certificates for the workshop will be given to attendees upon successfully
1. Completing the assigned 25 prompts
2. Building your own voice-enabled chatbot.

Image Credits

flaticon.com

https://designer.microsoft.com/image-creator