Introduction to Large Language Models

Lecture 6: The Bigger Picture and the road ahead

Mitesh M. Khapra

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

In the previous lecture, we dived deep into T5 and the choices behind it

Many Large Language Models have been released since then, and the number is growing

In this lecture, we attempt to group some of these prominent models and see how they differ in terms of using pre-training datasets and modifying various components of the architecture

One way is to group the models based on the type of architecture (Encoder, Encoder-Decoder,Decoder) used.

 Image source: Mooler

Encoder-only models (BERT and its variants) dominated the field for a short time soon after the introduction of BERT

Decoder-only models (GPT) soon emerged for the task of Language modeling

Encoder only models , BERT and its variants dominated the field for a short time since the introduction of BERT

GPT-2 and T5 advocated for decoder only models and encoder-decoder models, respectively.

Model

 Dataset

  • Scale the model or dataset or both?

  • What is the proportion of scaling?

Encoder only models , BERT and its variants dominated the field for a short time since the introduction of BERT

It did not last long, GPT-2 and T5 advocated for the decoder only model and encoder-decoder model, respectively.

"Scaling laws" lead to a race to build larger and larger models!

Scaling laws

Model

 Dataset

GPT-3 with 175 Billion parameters

"Scaling laws" lead to a race to build larger and larger models!

GPT-3 with 175 Billion parameters (proprietary) 

Scaling law stirred the researchers to build really large models!

GPT-J with 6 Billion parameters (open sourced)

Jurassic-1 with 178 Billion parameters (proprietary) 

Gopher with 280 Billion parameters (proprietary) 

GLaM with 1200 Billion parameters (proprietary) 

Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

The number of (open and closed ) LLMs is still growing 

LLaMA-3 is planned for 2024!

 Instruct GPT proposed RLHF based alignment objective for conversational models

Decoder-only models dominated the field 

Meta release OPT with 175 Billion parameters (open sourced)

ChatGPT built on top of the ideas in InstructGPT

Even if we select only the decoder branch in the evaluation tree

there would still be  dozens of models

So,  what do they differ in despite following the same architecture?

Even if we select the branch of decoder only models in the evaluation tree

Position Encoding

  • Absolute
  • Relative
  • RoPE,NoPE
  • AliBi

Attention

  • Full/sparse
  • Multi/grouped query
  • Flash
  • Paged

Normalization

  • Layer norm
  • RMS norm
  • DeepNorm

Activation

  • ReLU
  • GeLU
  • Swish
  • SwiGLU
  • GeGLU
  • FFN
  • MoE
  • PostNorm
  • Pre-Norm
  • Sandwich

Multi-Head Masked Attention

Feed forward NN

Add&Norm

Add&Norm

  • Objective
  • Optimizer

there would still be  dozens of models

So,  what do they differ in despite following the same architecture?

  • The dataset used to train the model (common crawl being the source for majority of the datasets)

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

  • Various design choices to reduce computational budget, increase the performance...

+
  • Training steps

DOLMA

Position Encoding

 

 

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

+

Unifying Diagram

Attention

 

MLP

 

Add&Norm

 

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep 

FFNN

MoE

Add&Norm

 

RMS

Layer

Deep 

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : X

WebText

Position Encoding

 

 

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

+

GPT-1

Attention

 

MLP

 

Add&Norm

 

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep 

FFNN

MoE

Add&Norm

 

RMS

Layer

Deep 

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : 12

WebText

Position Encoding

 

 

BookCorpus

OpenWebText

CC

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

+

GPT-3 175B

Absolute

Relative

RoPE

NoPE

AliBi

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : 96

Norm & add

 

RMS

Layer

Deep 

MLP

 

FFNN

MoE

ReLU

GeLU

Swish

SwiGLU

GiGLU

Attention

 

MHA

MQA

GQA

Flash

Paged

Sparse

Norm&add

 

RMS

Layer

Deep 

WebText

Position Encoding

 

 

BookCorpus

OpenWebText

mC4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS*

CC-NEWS*

StarCoder*

RedPajama

+

PaLM 540B

Absolute

Relative

RoPE

NoPE

AliBi

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : 118

Add&Norm

 

RMS

Layer

Deep 

MLP

 

FFNN

MoE

ReLU

GeLU

Swish

SwiGLU

GiGLU

Attention

 

MHA

MQA

GQA

Flash

Paged

Sparse

Add&Norm

 

RMS

Layer

Deep 

WebText

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS*

CC-NEWS*

StarCoder*

RedPajama

mC4

Filtering Rules

Road ahead

Data Sources, Data Pipelines, Datasets 

DOLMA

Position Encoding

 

 

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

+

Attention

 

MLP

 

Add&Norm

 

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep 

FFNN

MoE

Add&Norm

 

RMS

Layer

Deep 

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : X

DOLMA

Road ahead

Position Encoding

 

 

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

+

Attention

 

MLP

 

Add&Norm

 

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep 

FFNN

MoE

Add&Norm

 

RMS

Layer

Deep 

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : X

Road ahead

Position Encoding

 

 

BookCorpus

OpenWebText

C4

CC-Stories

WikiPedia

Pushift.io

BigQuery

The Pile

ROOTS

REALNEWS

CC-NEWS

StarCoder

RedPajama

+

Attention

 

MLP

 

Add&Norm

 

Absolute

Relative

RoPE

NoPE

AliBi

MHA

MQA

GQA

Flash

Paged

Sparse

RMS

Layer

Deep 

FFNN

MoE

Add&Norm

 

RMS

Layer

Deep 

ReLU

GeLU

Swish

SwiGLU

GiGLU

Objective:

Optimizer:

CLM

MLM

Denoising

AdamW

Adafactor

LION

N : X

Road ahead

q_0k_0^T
q_0k_0^T
i=0
i=1
q_1k_0^T
q_1k_1^T
q_0k_0^T
i=2
q_1k_0^T
q_1k_1^T
q_2k_0^T
q_2k_1^T
q_2k_2^T
v_0
v_0
v_1
v_1
v_2
v_0

Lecture-6-BigPicture

By Arun Prakash