Introduction to Large Language Models

Lecture 6: The Bigger Picture and the road ahead

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

In the previous lecture, we dived deep into T5 and the choices behind it

Many Large Language Models have been released since then, and the number is growing

In this lecture, we attempt to group some of these prominent models and see how they differ in terms of using pre-training datasets and modifying various components of the architecture

One way is to group the models based on the type of architecture (Encoder, Encoder-Decoder,Decoder) used.

Image source: Mooler

Encoder-only models (BERT and its variants) dominated the field for a short time soon after the introduction of BERT

Decoder-only models (GPT) soon emerged for the task of Language modeling

Encoder only models , BERT and its variants dominated the field for a short time since the introduction of BERT

GPT-2 and T5 advocated for decoder only models and encoder-decoder models, respectively.

Model

Dataset

Scale the model or dataset or both?
What is the proportion of scaling?

Encoder only models , BERT and its variants dominated the field for a short time since the introduction of BERT

It did not last long, GPT-2 and T5 advocated for the decoder only model and encoder-decoder model, respectively.

"Scaling laws" lead to a race to build larger and larger models!

Scaling laws

Model

Dataset

GPT-3 with 175 Billion parameters

"Scaling laws" lead to a race to build larger and larger models!

GPT-3 with 175 Billion parameters (proprietary)

Scaling law stirred the researchers to build really large models!

GPT-J with 6 Billion parameters (open sourced)

Jurassic-1 with 178 Billion parameters (proprietary)

Gopher with 280 Billion parameters (proprietary)

GLaM with 1200 Billion parameters (proprietary)

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

The number of (open and closed ) LLMs is still growing

LLaMA-3 is planned for 2024!

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

Even if we select only the decoder branch in the evaluation tree

there would still be dozens of models

So, what do they differ in despite following the same architecture?

Even if we select the branch of decoder only models in the evaluation tree

Position Encoding

Absolute
Relative
RoPE,NoPE
AliBi

Attention

Full/sparse
Multi/grouped query
Flash
Paged

Normalization

Layer norm
RMS norm
DeepNorm

Activation

ReLU
GeLU
Swish
SwiGLU
GeGLU

FFN
MoE

PostNorm
Pre-Norm
Sandwich

Multi-Head Masked Attention

Feed forward NN

Add&Norm

Objective
Optimizer

there would still be dozens of models

So, what do they differ in despite following the same architecture?

The dataset used to train the model (common crawl being the source for majority of the datasets)

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

Various design choices to reduce computational budget, increase the performance...

Training steps

DOLMA

Position Encoding

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

Unifying Diagram

Attention

MLP

Add&Norm

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep

FFNN

MoE

Add&Norm

RMS

Layer

Deep

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : X

WebText

Position Encoding

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

GPT-1

Attention

MLP

Add&Norm

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep

FFNN

MoE

Add&Norm

RMS

Layer

Deep

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : 12

WebText

Position Encoding

BookCorpus

OpenWebText

CC

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

GPT-3 175B

Absolute

Relative

RoPE

NoPE

AliBi

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : 96

Norm & add

RMS

Layer

Deep

MLP

FFNN

MoE

ReLU

GeLU

Swish

SwiGLU

GiGLU

Attention

MHA

MQA

GQA

Flash

Paged

Sparse

Norm&add

RMS

Layer

Deep

WebText

Position Encoding

BookCorpus

OpenWebText

mC4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS*

CC-NEWS*

StarCoder*

RedPajama

PaLM 540B

Absolute

Relative

RoPE

NoPE

AliBi

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : 118

Add&Norm

RMS

Layer

Deep

MLP

FFNN

MoE

ReLU

GeLU

Swish

SwiGLU

GiGLU

Attention

MHA

MQA

GQA

Flash

Paged

Sparse

Add&Norm

RMS

Layer

Deep

WebText

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS*

CC-NEWS*

StarCoder*

RedPajama

mC4

Filtering Rules

Road ahead

Data Sources, Data Pipelines, Datasets

DOLMA

Position Encoding

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

Attention

MLP

Add&Norm

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep

FFNN

MoE

Add&Norm

RMS

Layer

Deep

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : X

DOLMA

Road ahead

Position Encoding

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

Attention

MLP

Add&Norm

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep

FFNN

MoE

Add&Norm

RMS

Layer

Deep

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : X

Road ahead

Position Encoding

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

Attention

MLP

Add&Norm

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep

FFNN

MoE

Add&Norm

RMS

Layer

Deep

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : X

Road ahead

q_0k_0^T

q_0k_0^T

i=0

i=1

q_1k_0^T

q_1k_1^T

q_0k_0^T

i=2

q_1k_0^T

q_1k_1^T

q_2k_0^T

q_2k_1^T

q_2k_2^T

v_0

v_1

v_2

v_0

Lecture-6-BigPicture

By Arun Prakash

Lecture-6-BigPicture

1,108

Arun Prakash

Introduction to Large Language Models

Lecture 6: The Bigger Picture and the road ahead

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

In the previous lecture, we dived deep into T5 and the choices behind it

Many Large Language Models have been released since then, and the number is growing

In this lecture, we attempt to group some of these prominent models and see how they differ in terms of using pre-training datasets and modifying various components of the architecture

One way is to group the models based on the type of architecture (Encoder, Encoder-Decoder,Decoder) used.

Image source: Mooler

Encoder-only models (BERT and its variants) dominated the field for a short time soon after the introduction of BERT

Decoder-only models (GPT) soon emerged for the task of Language modeling

Encoder only models , BERT and its variants dominated the field for a short time since the introduction of BERT

GPT-2 and T5 advocated for decoder only models and encoder-decoder models, respectively.

Model

Dataset

Scale the model or dataset or both?

What is the proportion of scaling?

Encoder only models , BERT and its variants dominated the field for a short time since the introduction of BERT

It did not last long, GPT-2 and T5 advocated for the decoder only model and encoder-decoder model, respectively.

"Scaling laws" lead to a race to build larger and larger models!

Scaling laws

Model

Dataset

GPT-3 with 175 Billion parameters

"Scaling laws" lead to a race to build larger and larger models!

GPT-3 with 175 Billion parameters (proprietary)

Scaling law stirred the researchers to build really large models!

GPT-J with 6 Billion parameters (open sourced)

Jurassic-1 with 178 Billion parameters (proprietary)

Gopher with 280 Billion parameters (proprietary)

GLaM with 1200 Billion parameters (proprietary)

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

The number of (open and closed ) LLMs is still growing

LLaMA-3 is planned for 2024!

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

Even if we select only the decoder branch in the evaluation tree

there would still be dozens of models

So, what do they differ in despite following the same architecture?

Even if we select the branch of decoder only models in the evaluation tree

there would still be dozens of models

So, what do they differ in despite following the same architecture?

The dataset used to train the model (common crawl being the source for majority of the datasets)

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

Various design choices to reduce computational budget, increase the performance...

DOLMA

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

Unifying Diagram

Absolute

Relative

RoPE