Introduction to Large Language Models

Lecture 7: Scaling data size, Data sources, Data Pre-processing pipeline, Data mixing

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

We know that the performance of DL models scales with data. The datasets that we have seen so far are listed in the table

Model Dataset Size
GPT Book Corpus 0.8 Billion words
BERT Book Corpus, English Wikipedia 3.3 Billion Words
GPT-2 WebText ~19 Billion tokens
T5 C4 ~175 Billion tokens

GPT-2 (1.5 Billion parameters) and T5 (10 Billion parameters) were trained on 19 Billion and  175 Billion tokens, respectively. 

Is it necessary to scale the size of the dataset even further?

What are all the aspects (duplication, toxicity,bias) that we need to consider while building a larger dataset?

If yes, how do we source the data?

Module 1: Motivation for Scaling the data size

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Scaling Law: The more the data, better the model

The other factor that motivated scaling the data size was the study on Scaling Laws

(published a few months after T5) 

L(N,D)= \bigg[\big(\frac{N_c}{N}\big)^{\frac{\alpha_N}{\alpha_D}}+\frac{D_c}{D}\bigg]^{\alpha_D}

Given the number of non-embedding parameters \(N\) and the size of the dataset \(D\) (in tokens), the test loss 

where,

\(N_c = 8.8 \times 10^{13}\) : 88 Trillion parameters

\(D_c=5.4 \times 10^{13}\) : 54 Trillion tokens

\(\alpha_N = 0.076 \)

\(\alpha_D = 0.095 \)

The study on T5 demonstrated that the model overfits if the data is repeated for many epochs during training. This suggests that we should scale the data size

Suppose we have a model with 100 million (non-embedding) parameters

and a dataset with 10 million tokens

What will be the training loss

Of course, the model will drive the training loss to zero!

However, the test loss changes according to the scaling law

L(N,D)= \bigg[\big(\frac{N_c}{N}\big)^{\frac{\alpha_N}{\alpha_D}}+\frac{D_c}{D}\bigg]^{\alpha_D}
L(N,D)= \bigg[\big(\frac{88 \times 10^{12}}{100 \times 10^{6}}\big)^{0.8}+\frac{54 \times 10^{12}}{10 \times 10^6}\bigg]^{0.095}
L(N,D)= (56962+54 \times 10^5)^{0.095}
L(N,D)= 4.36

What happens if we increase the size of the dataset? Let's see

What happens if we increase the size of the dataset? Let's see

Click on the slider to increase the value of D by 10 million tokens (or use right/left arrows in the keyboard)

Test loss
Index

Increasing the number of tokens reduces the model's test loss 

This motivated the community to scale both the model and data size [Ref]

Fruit Fly

Honey Bee

Mouse

Cat

Brain

>10^6
10^9
10^{12}
10^{13}
10^{15}

#  Synapses

400M

Transformer

1.5B

GPT-2

10B

Megatron LM

175B

GPT-3

GShard

1.1 T
1.6 T

Scaling Parameters

of the model requires?

10 GB
5GB
21GB
40GB
1 GB
100 GB
1 TB

BookCorpus

Wikipedia

WebText (Closed)

110GB

RealNews

800GB

C4

  Scaling Dataset

proportionately 

Scaling Parameters

of the model requires..

2015
2019
2020
10 TB

While data has been growing......

we show that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data

we propose to explore how web data can be better processed to significantly improve its quality, resulting in models as capable, if not more capable, than models trained on curated corpora

Open source data is good enough

Data Quality Is super important

Emphasizing diversity

we explore the improvement that can be obtained along a different axis: the quality of the data.

10 GB
5GB
21GB
40GB
1 GB
100 GB
1 TB

BookCorpus

Wikipedia

WebText(closed)

110GB

RealNews

800GB

 The Pile

1.6 TB

ROOTS

  Scaling Dataset

proportionately 

Scaling Parameters

of the model requires..

2015
2019
2023
2020

Falcon

2.6TB
10 TB
5TB

RedPajama

11TB

DOLMA

2024

C4

Scale

Quality

Diversity

Three axes of Data

The Pipeline

Sourcing Data

(Webpages, Code, Arxiv,..)

Cleaning Data

(deduplication, quality filtering,..)

Datasets:

C4, mC4, Stack, The Pile, DOLMA, WebText, MassiveWeb (not public), ..

LLM

Greater than 100 petabytes (PB)

100 TB

1 to 11 TB

1 to 11 TB

1 to 3 T tokens

Module 2: Data Sourcing and Cleaning Pipeline

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Disclaimer

Detailed information about the data-cleaning pipeline is available publicly for the following datasets

  • C4 from T5 (2019)

  • The Pile (2020)

  • Falcon (2023)

  • ROOTS (2023)

  • Redpajama-v1/v2 (2023)

  • DOLMA(2024)

  • Sangraha (2024)

Often, datasets used in pre-training large models lack proper documentation. Technical reports typically only provide very basic information

For example, MassiveWeb used in training Gopher model uses a custom HTML scraper using Googles SafeSearch signal for filtering

In this lecture, we attempt to give you a comprehensive idea about various data processing pipelines.

Almost all the pre-training datasets were constructed by scraping/crawling the web

One can write a custom HTML scraper or use Common Crawl (CC) for fetching the data 

Each month they release a snapshot of the web that contains about 3 to 5 billion web pages 

the raw data contains duplicate, toxic, low-quality content and also personally identifiable information

Garbage in - Garbage out (GIGO)

Recall the saying

So it is important to clean the raw text  

Well, the web contains more than 100 petabytes of data. 

Let us understand the data format of the CC snapshots 

Data Format

Therefore, it is a challenging task  to build an automatic pipeline that takes in the raw text and outputs clean text

The raw CC data is available in three types

  • WARC (Web ARChive, 86 TiB)

  • WAT ( Web Archive Transformation, 21 TiB)

  • WET (WARC Encapsulated Text , Plain texts, 8.69TB). Informally, Web Extracted Text

  • To know more

The pipeline takes in the raw data in any one of these formats and produces  clean text

To build a large dataset, researchers often take CC snapshots of multiple months or years

It is also challenging to directly measure how clean the output text is.

Pre-processing pipeline

To get high quality text from the raw data, the pipeline requires mechanisms 

to detect toxicity level (bad words, hate speech, stereotypes,..) of the content in the pages

to detect the language of the web pages 

to detect quality of the content in the  pages 

to detect and remove Personally Identifiable Information (PII) in the pages

to deduplicate the content (line, paragraph and document level)

To get high quality text from the raw data, the pipelines require mechanisms 

(FastText,langdetect,pycld2,IndicLID, or combination of these)

(Similarity measure using hashing/ml techniques )

(Simple heuristics or LM Model,Perspective AI)

(LM model trained with reference text like Wikipedia or books, Simple heuristics)

(Regex Patterns)

to detect toxicity level (bad words, hate speech, stereotypes,..) of the content in the pages

to detect the language of the web pages 

to detect quality of the content in the  pages 

to detect and remove Personally Identifiable Information (PII) in the pages

to deduplicate the content 

Pre-processing pipeline

Types of Deduplication

Exact duplication 

The orbit of Aditya-L1 spacecraft is a periodic Halo orbit

The orbit of Aditya-L1 spacecraft is a periodic Halo orbit

webpage-1
webpage-2

Fuzzy (near) duplication

The orbit of Aditya-L1 spacecraft is a periodic Halo orbit which is located roughly 1.5 million km from earth on the continuously moving Sun  

The orbit of Aditya-L1 spacecraft is a periodic Halo orbit . The orbit is located roughly 1.5 million km from earth on the continuously moving Sun  

Exact deduplication  Algorithm

Fuzzy deduplication  Algorithm

Suffix array

  • MinHash

  • SimHash

  • Locality Sensitve Hash (LSH)

A Basic Pipeline (fastText)

Language Detection

hi
ta
en

Deduplication

(using hashing)

Deduplication

(using hashing)

Deduplication

(using hashing)

.WET
(8.6 TiB) compressed

It satisfies only two requirements

Uses line level deduplication

The filtered text could still contain toxic, low-quality and PII

CCNet Pipeline

Designed for extracting high-quality monolingual pre-training data

This is also helpful for low-resource languages as datasets like Bookcorpus and Wikipedia are only for English

Even if we wish to use  Wikipedia dumps for other languages, the size won't be sufficient 

For example, Hindi has only 1.2 million pages,Tamil has only 0.5 million pages where as English has 59 million pages [source]

Therefore, we need to use CC dumps (years of dumps)

This is a compute-intensive task and requires massive parallelization with carefully written code to execute them

CCNet Pipeline

Uses paragraph level deduplication

Divide each webpage into three portions: head, middle and tail . Obtain quality score for each portion using a pre-trained language model on a targeted domain (say, Wikipedia)

Please enable JavaScript to use our site.

 

Home

Products

Shipping

Contact

FAQ

 

Dried Lemons, $3.59/pound

 

Organic dried lemons from our farm in California.

Lemons are harvested and sun-dried for maximum flavor.

Good in soups and on popcorn.

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste

ML Classifier

ML Classifier

ML Classifier

quality score

quality score

quality score

CCNet Pipeline

Deduplication

(using hashing)

.WET

Uses paragraph level deduplication

Language Detection

Targeted Quality filtering

Divide each webpage into three portions: head, middle and tail . Obtain quality score for each portion using a pre-trained language model on a targeted domain (say, Wikipedia)

Remove low-quality pages based on a threshold

.JSON

It satisfies three requirements

By following this pipeline, they have extracted documents for 130 languages (that has at least 1000 documents)

CCNet Pipeline

Number of tokens (using the sentence piece tokenizer) in each language after deduplication (of Feb 2019 snapshot)

Recall, Tamil has 0.5 million Wikipedia pages, whereas by using  CCNet pipeline on Feb 2019 snapshot of common crawl, one could get 0.9 million documents for the language (Note, there may be a overlap)

\(\approx\)550 Billion tokens

\(\approx 1.5\) Billion tokens

Colossal Clean Common Crawl (C4) Pipeline

It is an extension of CCNet with additional filtering heuristics. 

.WET

CC

20 TB

En

C3

6 TB

C4

750 GB

RealNews like

35 GB

Web-Text like

17 GB

We can get diverse data such as news, stories, webtext by domain filtering

They released the clean dataset (named C4) in the public domain

Common Crawled "Web Extracted Text"  

Menu

 

Lemon

 

Introduction

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste.

 

Article

 

The origin of the lemon is unknown, though lemons are thought to have first grown in Assam (a region in northeast India), northern Burma or China. A genomic study of the lemon indicated it was a hybrid between bitter orange (sour orange) and citron.

Please enable JavaScript to use our site.

 

Home

Products

Shipping

Contact

FAQ

 

Dried Lemons, $3.59/pound

 

Organic dried lemons from our farm in California.

Lemons are harvested and sun-dried for maximum flavor.

Good in soups and on popcorn.

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Curabitur in tempus quam. In mollis et ante at consectetur.

Aliquam erat volutpat. Donec at lacinia est.

Duis semper, magna tempor interdum suscipit, ante elit molestie urna, eget efficitur risus nunc ac elit.

Fusce quis blandit lectus. Mauris at mauris a turpis tristique lacinia at nec ante.

Aenean in scelerisque tellus, a efficitur ipsum.

Integer justo enim, ornare vitae sem non, mollis fermentum lectus.

Mauris ultrices nisl at libero porta sodales in ac orci.

 

function Ball(r){

this.radius = r;

this.area = pi * r ** 2;

this.show = function()

{ drawCircle(r);

}

}

has a lot of gibberish

boiler-plate text like menus

offensive language, placeholder text, source code, etc.

Make clean corpus by filtering out low quality text that is suitable for NLP tasks

Let us illustrate the processes with an example 

Menu

 

Lemon

 

Introduction

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste.

 

Article

 

The origin of the lemon is unknown, though lemons are thought to have first grown in Assam (a region in northeast India), northern Burma or China. A genomic study of the lemon indicated it was a hybrid between bitter orange (sour orange) and citron.

Please enable JavaScript to use our site.

 

Home

Products

Shipping

Contact

FAQ

 

Dried Lemons, $3.59/pound

 

Organic dried lemons from our farm in California.

Lemons are harvested and sun-dried for maximum flavor.

Good in soups and on popcorn.

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Curabitur in tempus quam. In mollis et ante at consectetur.

Aliquam erat volutpat. Donec at lacinia est.

Duis semper, magna tempor interdum suscipit, ante elit molestie urna, eget efficitur risus nunc ac elit.

Fusce quis blandit lectus. Mauris at mauris a turpis tristique lacinia at nec ante.

Aenean in scelerisque tellus, a efficitur ipsum.

Integer justo enim, ornare vitae sem non, mollis fermentum lectus.

Mauris ultrices nisl at libero porta sodales in ac orci.

 

function Ball(r){

this.radius = r;

this.area = pi * r ** 2;

this.show = function()

{ drawCircle(r);

}

}

Remove any page that contains "lorem ipsum" or curly bracket 

Menu

 

Lemon

 

Introduction

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste.

 

Article

 

The origin of the lemon is unknown, though lemons are thought to have first grown in Assam (a region in northeast India), northern Burma or China. A genomic study of the lemon indicated it was a hybrid between bitter orange (sour orange) and citron.

Please enable JavaScript to use our site

 

Home

Products

Shipping

Contact

FAQ

 

Dried Lemons, $3.59/pound

 

Organic dried lemons from our farm in California.

Lemons are harvested and sun-dried for maximum flavor.

Good in soups and on popcorn.

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Curabitur in tempus quam. In mollis et ante at consectetur.

Aliquam erat volutpat. Donec at lacinia est.

Duis semper, magna tempor interdum suscipit, ante elit molestie urna, eget efficitur risus nunc ac elit.

Fusce quis blandit lectus. Mauris at mauris a turpis tristique lacinia at nec ante.

Aenean in scelerisque tellus, a efficitur ipsum.

Integer justo enim, ornare vitae sem non, mollis fermentum lectus.

Mauris ultrices nisl at libero porta sodales in ac orci.

 

function Ball(r){

this.radius = r;

this.area = pi * r ** 2;

this.show = function()

{ drawCircle(r);

}

}

Remove lines which do not have terminal punctuation marks

Remove any page that contains "lorem ipsum" or curly bracket 

Menu

 

Lemon

 

Introduction

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste.

 

Article

 

The origin of the lemon is unknown, though lemons are thought to have first grown in Assam (a region in northeast India), northern Burma or China. A genomic study of the lemon indicated it was a hybrid between bitter orange (sour orange) and citron.

Please enable JavaScript to use our site

 

Home

Products

Shipping

Contact

FAQ

 

Dried Lemons, $3.59/pound

 

Organic dried lemons from our farm in California.

Lemons are harvested and sun-dried for maximum flavor.

Good in soups and on popcorn.

 

The lemon, Citrus Limon (l.) Osbeck, is a species of small evergreen tree in the flowering plant family rutaceae. The tree's ellipsoidal yellow fruit is used for culinary and non-culinary purposes throughout the world, primarily for its juice, which has both culinary and cleaning uses. The juice of the lemon is about 5% to 6% citric acid, with a ph of around 2.2, giving it a sour taste

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Curabitur in tempus quam. In mollis et ante at consectetur.

Aliquam erat volutpat. Donec at lacinia est.

Duis semper, magna tempor interdum suscipit, ante elit molestie urna, eget efficitur risus nunc ac elit.

Fusce quis blandit lectus. Mauris at mauris a turpis tristique lacinia at nec ante.

Aenean in scelerisque tellus, a efficitur ipsum.

Integer justo enim, ornare vitae sem non, mollis fermentum lectus.

Mauris ultrices nisl at libero porta sodales in ac orci.

 

function Ball(r){

this.radius = r;

this.area = pi * r ** 2;

this.show = function()

{ drawCircle(r);

}

}

Deduplicate the text

Remove lines which do not have terminal punctuation marks

Remove any page that contains "lorem ipsum" or curly bracket 

Remove bad words by using a manual list of bad words (any page that contains a bad word will be removed)

In general, all pre-processing data pipelines follow CCNet pipeline with modifications in the components of the pipeline

For example, pile-CC uses the WARC files (instead of WET) which contain raw HTML pages then applies filtering (using jusText) to extract text content

jusText

(Extract text data)

.WARC

Language Detection

(Pycld2)

Deduplication and Targeted Quality filtering

It uses fuzzy deduplication over exact deduplication algorithm as in C4

Pile-CC

(227 GiB)

Significant duplicates in CC are exact duplicates [Ref]. Therefore, one can use fuzzy deduplication first followed by exact deduplication.

RefinedWeb (RW)

50% of CC was removed after language identification

.WARC

Trafilatura-LID

RW-Raw

RefinedWeb (RW)

50% of CC was removed after language identification 

.WARC

Trafilatura-LID

Quality Filtering

RW-Raw

24% of RW-Raw was removed after Quality filtering

RW-Filtered

It used both fuzzy and exact deduplication techniques.

What is the impact of having duplicate content?

In total, this stringent approach removes almost 90% of data from a single snapshot of the common crawl

RefinedWeb (RW)

.WARC

Trafilatura-LID

Quality Filtering

Deduplication

(Fuzzy-Exact)

50% of CC was removed after language identification 

24% of RW-Raw was removed after Quality filtering

12% of RW-filtered was removed after deduplication

RW-Raw

RW-Filtered

Here is a quick comparison of the pipelines we have seen so far

So far we have seen that the pre-processing pipeline takes in different forms of text data. Why not use Audio (using ASR), Scanned documents (using OCR)?

Using these  modalities are helpful for low-resource languages

SETU Pipeline

Data Sources: 

  • Web (Curated and manually verified)
  • Videos 
  • Digitized PDF

Language Detection: 

  • INDICLID, CLD3, NLLB

Scraping: 

  • Trafilatura 

Deduplication: 

  • Fuzzy (MinHash)

Name of the dataset:

Sangraha Verified

Deduplication Deserves some attention!

The RefinedWeb Dataset for Falcon LLM:

the Pythia suite of models found that deduplicating The Pile had a limited impact on zero-shot performance, questioning whether deduplication is as relevant for curated corpora as it for predominantly web-based datasets.

Perfect Deduplication is Difficult

For example, it is found that a single sentence with 61 words was repeated 60000 times in C4 [Ref]

Not only in C4 but also in other datasets such as RealNews [Ref]

Therefore, the natural question is, Does duplicate content degrade the performance of the model? 

Including the Pile: "We found the Pile was particularly full of duplicate documents, and advise future researchers using the Pile to perform additional de-duplication processing" [Ref]*

* however, the Pythia suite of models found that deduplicating The Pile had a limited impact on zero-shot performance

Why Deduplication is important?

Duplication

  • allows the model to memorize the training data

  • gives a wrong picture of test performance

  • susceptible to privacy risks

The graph on the right shows the number of (exact) duplicates of 400-character sequence.

We can see that  millions of sequences were repeated at least 10 times in the dataset

For instance, a sequence that is present 10 times in the training data is on average generated 1000x more often than a sequence that is present only once [Ref]

for a 1B parameters model, a hundred duplicates are harmful; at 175B, even a few duplicates could have a disproportionate effect [Ref]

A Pretrainer’s Guide

A “high-quality” Reddit post does not look like a “high-quality“ academic paper; and even with academic papers, quality measured by peer review has high variance

All pipelines are leaky! 

Insights from training Gopher:

We do not attempt to filter out low quality documents by training a classifier based on a “gold” set of text, such as English Wikipedia or pages linked from Reddit , as this could inadvertently bias towards a certain demographic or erase certain dialects or sociolects from representation

Open Pre-trained Transformer LM

We found the Pile was particularly full of duplicate documents, and advise future researchers using the Pile to perform additional de-duplication processing.

 

Learning from Repeated Data:

performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique

Module 3: Pre-training Datasets

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Pre-training Datasets

Monolingual

Multilingual

We can also group based on the diversity of data sources (web,books,wiki,code..)

en

en,gu,hi,sa,ta,te..

For convenience, the source code of 300+ programming languages can be considered as monolingual (english) datasets

code

Some publically released pre-training datasets (for ex, BookCorpus) are no longer available

10 GB
5GB
21GB
40GB
1 GB
100 GB
1 TB

BookCorpus

Wikipedia

WebText

110GB

RealNews

800GB

C4, The Pile

1.6 TB

ROOTS

2015
2019
2023
2020

Falcon

2.6TB
10 TB
5TB

RedPajama

750GB

So researchers build their own versions (say, OpenWebText) by scraping the web by following similar steps in the original paper.

Some datasets are private, for example, webtext

BookCorpus

Wikipedia

OpenWebText

RealNews

C4

ROOTS

RefinedWeb (Falcon)

The Pile

DOLMA

mC4

Stack

Pre-training Datasets

Monolingual

Multi-lingual

Sangraha

Of these, we will be focusing on..

RedPajama-v2

RedPajama-v1

C4

ROOTS

RefinedWeb (Falcon)

RedPajama-v1/v2

The Pile

DOLMA

mC4

Stack

Sangraha

One way to compare the size of the datasets is to use the (uncompressed) storage memory

The other useful way is to compare the size of the datasets in number of tokens (related to scaling law)

Datasets such as The Pile, and DOLMA emphasize diversity in the pre-training data. 

Datasets such as RefinedWeb suggest that one could get good performance with web only data 

C4

ROOTS

DOLMA

mC4

Sangraha

Dataset Name

# of tokens

~156 Billion

Is Curated

No

Diversity

Webpage

~170 Billion

Yes

22 sources

> 1 Trillion

Yes

380 Programing languages

5 Trillion (600B in public)

No

Webpage

 1.2/30 Trillion

Yes

Webpage, Books, Arxiv, Wiki, StackExch

3 Trillion

Yes

Webpage, Books,  Wiki, The Stack, STEM

~418 Billion

No

Webpage

~341 Billion

Yes

natural and programming languages

251 Billion

Yes

Web, videos, digitized pdf,synthetic

Languages

English

English

Code

English

English/Multi

English

Multi

Multi

Multi

Composite Datasets

The Pile (2020)

RedPajama-V1* (2022)

DOLMA (2024)

~170 Billion

Observe the increase in token counts for each year 

*After extensive deduplication, the number of tokens reduces to 670B. The resulting dataset is called SlimPajama

Data Diversity

Note, however, that the recent RefinedWeb pipeline demonstrates that sourcing a large scale, high quality data from web alone is sufficient to get a better performance[Ref] 

Suppose we decided to train a model for 15 million training steps  (\(2^{16}\) tokens per training step), resulting in 1 Trillion tokens

Books(10%)

Wikipedia (10%)

OpenWebText* (15%)

RealNews (10%)

Stories (5%)

Codes (10%)

Sampling

Model

C4(40%)

Then we take these \(2^{16}\) tokens by sampling from all the available data sources according to the pre-defined distribution

Utilizing high-quality and diverse datasets is crucial for achieving optimal downstream performance

Books(10%)

Wikipedia (10%)

OpenWebText* (15%)

RealNews (10%)

Stories (5%)

Codes (10%)

Sampling

Model

C4(40%)

*OpenWebText is an open version of WebText

Almost all LLM models predominately used web data (from The Pile, C4, CC, Custom scraping)

WiKi and Books are the next common source

Models like ALPHA CODE uses 100% code data

Models like LaMDA and PALM gave more weightage to conversation data

Some of these models were trained on multilingual datasets, as shown in the next slide

Source: Pre-traininer's Guide

M-L: Multi-Lingual

H: Heuristics

C: Classifiers

Source: Pre-traininer's Guide

Are we reaching the limit on the size of the data?

Could we use LLMs like GPT-4 to generate synthetic data?

Would small (30 B to 100B) but high-quality datasets be ever helpful in building a good model ?

Is mixing code data (non-ambiguous) and text data (ambiguous) helpful or harmful ?

Some Questions to Ponder

Synthetic Dataset

TextBook are all you need

Textbook quality refers to content that has educational value (for teaching purpose)

AT WHICH TRAINING STAGE DOES CODE DATA HELP LLMS REASONING? (ICLR 2024)

References

INDICLLMSUITE: A Blueprint for Creating Pre-training and Fine-Tuning
Datasets for Indian Languages (will be published soon)

Gopher: Massive Text

Data source Data size Documents Tokens Sampling proportion
Massive web 1.9 TB 604 M 506 B 48%
Books 2.1 TB 4M 560 B 27%
C4 0.75TB 361M 182 B 10%
News 2.7 TB 1.1 B 676 B 10%
GitHub 3.1 TB 142 M 422B 3%
Wikipedia 0.001 TB 6 M 4B 2%

Private, used custom scrapper

Number of training tokens: 300 Billion tokens

Diversity of data sources

LLaMA (use only open data)

Data source Data size (TB) Documents Tokens Sampling proportion
Common Crawl 3.3  - - 67%
C4 0.78  361 M 182 B 15%
GitHub 0.3  - - 4.5%
Wikipedia 0.08  6 M 4B 4.5%
Books 0.08  - - 4.5%
Arxiv 0.09 2.5%
StackExchange 0.07 2%

Data Quality Measure

Fix the model and just change the input data source, see how much the performance increase in the down stream tasks.

Information Finetuning

Benchmarking Evaluation Tasks

Benchmarking Evaluation Tasks

Challenges

Test-set leakage

Scaling: Macro vs Micro

Draw cartoon: Dropping stone, An old man got puzzled