Stone/Iron Age
Industrial Age
Digital Age
Carved in Stones
Written on papers
Digitized
Parameterized
The Age of AI [has begun]*
Store and Retrieve
Store and Retrieve
Store and Retrieve
Store and Generate!
Image: Children suspects some people are inside a radio or television set back in 1970's, India.
Image: Children suspects some people are inside a radio or television set back in 1970's, India.
Multi-head Masked Attention
tell
me
a
joke
about
idli
why
why
did
the
did
Multi-head Masked Attention
tell
me
a
joke
about
idli
why
why
did
the
did
idli
the
Input text
Predict the class/sentiment
Input text
Summarize
Question
Answer
Input text
Prompt: Input text
Output response conditioned on prompt
Prompt: Predict sentiment, summarize, fill in the blank, generate story
Labelled data for task-1
Labelled data for task-2
Labelled data for task-3
Raw text data
(cleaned)
Build one
Inadequate quality datasets for Indic Languages
Dataset Name
# of tokens
~156 Billion
Diversity
Webpage
~170 Billion
22 sources
> 1 Trillion
380 Programing languages
5 Trillion (600B in public)
Webpage
1.2/30 Trillion
Webpage, Books, Arxiv, Wiki, StackExch
3 Trillion
Webpage, Books, Wiki, The Stack, STEM
~418 Billion
Webpage
~341 Billion
natural and programming languages
251 Billion
Web, videos, digitized pdf,synthetic
Languages
English
English
Code
English
English/Multi
English
Multi
Multi
Multi
English data
Use Instruction Fine-tuning and build datasets for the same
(full) Fine-Tuning of LLMs on Indic datasets still requires a lot of compute and expensive
Existing English Data
Synthetic India-centric conversations
Indic-Align
Capture all different ways in which people can ask!!