Lecture 4: Tokenizers: BPE,WordPiece and SentencePiece

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Introduction to Large Language Models

Introduction

Recall that language modelling involves computing probabilities over a sequence of tokens

P(x_1)\prod \limits_{i=2}^T P(x_i|x_1,\cdots,x_{i-1})

=P(x_1)P(x_2|x_1) \cdots P(x_T|x_{T-1},\cdots,x_1)

This requires us to split the input text into tokens in a consistent way.

Let's see where we fit in a tokenizer in the training and inference pipeline

Tokenizer

Token Ids

embeddings

Language Model

ids to tokens

How are you?

\{How,are, you\}

\{How:1,\\ are:2, \\ you:3\}

1:[0.1,0.35 \cdots,0.08],\\ 2:[0.01,1.00,\cdots,0.8], \\ 3:[0.12,0.00,\cdots,0.46]\\

A simple approach is to use whitespace for splitting the text.

The first step is to build (learn) the vocabulary \(\mathcal{V}\) that contains unique tokens (\(x_i\)).

Therefore, the fundamental question is how do we build (learn) a vocabulary \(\mathcal{V}\) from a large corpus that contains billions of sentences?

We also include special tokens such as <go>,<stop>,<mask>,<sep>,<cls> and others to the vocabulary based on the type of downstream tasks and the architecture (GPT/BERT) choice

enjoyed

the

movie

transformers

Feed Forward Network

Multi-Head Attention

<go>

<stop>

We can split the text into words using whitespace (called pre-tokenization) and add all unique words in the corpus to vocbulary.

Is spliting the input text into words using whitespace a good approach?

What about languages like Japanese, which do not use any word delimiters like space?

Why not treat each individual character in a language as a vocabulary?

Some Questions

In that case, Do we treat the words "enjoy" and "enjoyed" as separate tokens?

Finally, what are good tokens?

enjoyed

the

movie

transformers

Feed Forward Network

Multi-Head Attention

<go>

<stop>

映画『トランスフォーマー』を楽しく見ました

Challenges in building a vocabulary

What should be the size of vocabulary?

Larger the size, larger the size of embedding matrix and greater the computation of softmax probabilities. What is the optimal size?

Out-of-vocabulary

If we limit the size of the vocabulary (say, 250K to 50K) , then we need a mechanism to handle out-of-vocabulary (OOV) words. How do we handle them?

Handling misspelled words in corpus

Often, the corpus is built by scraping the web. There are chances of typo/spelling errors. Such erroneous words should not be considered as unique words.

Open Vocabulary problem

A new word can be constructed (say,in agglutinative languages) by combining the existing words . The vocabulary, in principle, is infinite (that is, names,numbers,..) which makes a task like machine translation challenging

Module 4.1 : Types of tokenization

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

It would be interesting to measure the size of vocabulary of native (non-native) speakers

If we treat two words are different if they differ by meaning, then words like bank, air have multiple meanings

If we treat words as group of characters arranged in order, then the words in the list [ "learn", "learns", "learning", "learnt" ] are all different.

This increases the size of vocabulary which in turn increases the computational requirement.

In practice, the size of vocabulary varies based on the corpus. It is often in orders of tens of thousands

Source: lemongrad

What constitutes a word?

One way to reduce the size of the vocabulary is to consider individual charecters in the input text as tokens

Hmm.. I know, I don't know

[H, m, m, ., ., I, k, n, o, w, ,, I, d, o, n, ', t, k, n, o, w]

Number of tokens

Character Level Tokenization

Encoder

\(\mathcal{V}\): {H,m,.,I,k,n,o,w,d,',t}

\(|\mathcal{V}|\): 11

The size of the vocabulary is small and finite even for a large corpus of text,

It solves both open vocabulary and out-of-vocabulary problems

However, it loses semantic meaning of words

The number of characters in a sentence is, on an average, \(5x\) higher than the number of words in a sentence.

This increases the computational complexity of the models as the context window size increases 5 times (on average)

Therefore, computing softmax probabilities is not a bottleneck.

Let's look at the word level tokenizer once again

[Hmm.., I, know,, I, don't, know]

Number of tokens

Word Level Tokenization

\(\mathcal{V}\): {Hmm..,I, know, don't}

\(|\mathcal{V}|\): 4

In general, the size of the vocabulary grows based on the number of unique words in a corpus of text.

In practice, It could range anywhere from 30000 to 250000 based on the size of the corpus

Therefore, computing softmax probabilities to predict the token at each time step becomes expensive

One approach to limit the size of the vocabulary is to consider the words with a minimum frequency of occurrence, say, at least 100 times in the corpus.

Hmm.. I know, I don't know

The words which are not in the vocabulary (Out-of-vocabulary problem) are substituted by a special token: <unk> during training and inference. Doing this gives suboptimal performance for tasks like text generation and translation

Moreover, it is difficult to deal with misspelled words in a corpus (i.e., each misspelled word may be treated as new word in the vocabulary)

So we need a tokenizer that does not blow up the size of the vocabulary as in a character level tokenizer and also preserves the meaning of words as in a word level tokenizer.

[Hmm.., I, know,, I, don't, know]

Number of tokens

\(\mathcal{V}\): {Hmm..,I, know, don't}

\(|\mathcal{V}|\): 4

Hmm.. I know, I don't know

Word Level Tokenization

Hmm.. I know, I don't know

[Hmm.., I, know,, I, do, n't, know]

Number of tokens

Sub-Word Level Tokenization

\(\mathcal{V}\): {Hmm..,I, know, do, n't}

\(|\mathcal{V}|\): 5

The size of the vocabulary is carefully built based on the frequency of occurrence of subwords

For ex, the subword level tokenizer breaks "don't" into "do" and "n't" by learning that "n't" occurs frequently in a corpus.

we have a representation for "do" and a representation for "n't"

Therefore, subword level tokenizers are preferred for LLMs.

The size of the vocabulary is moderate

A middle ground between character level and word level tokenizers

The most frequently occuring words are preserved as is and rare words are split into subword units.

c h a r a c t e r l e v e l

word level

sub-word level

* There are many categories, these three are most commonly used

WordPiece

SentencePiece

BytePairEncoding (BPE)

Wishlist

Moderate-sized Vocabulary

Efficiently handle unknown words during inference

Be language agnostic

Module 4.2 : Byte Pair Encoding

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

General Pre-processing steps

Hmm.., I know I Don't know

Input text

hmm.., i know i don't know

normalization

Pre-tokenization

[hmm.., i, know, i , don't,know]

Subword Tokenization

[hmm.., I, know, i, do, #n't,know]

Splitting the text by whitespace was traditionally called tokenization. However, when it is used with a sub-word tokenization algorithm, it is called pre-tokenization.

First the text is normalized, which involves operations such as treating cases, removing accents, eliminating multiple whitespaces, handling HTML tags, etc.

The input text corpus is often built by scraping web pages and ebooks.

Learn the vocabulary (called training) using these words

The tokenization schemes follow a sequence of steps to learn the vocabulary.

Text Corpus

Learn the vocabulary

Pre-process

Post-process

Model

\mathcal{V}

I enjoyed the movie

[I enjoy #ed the movie]

[I enjoy #ed the movie .]

I enjoyed the movie.

Preprocess

In translation one has to deal with open-vocabulary (names,numbers,units..) and therefore it is more challenging to handle unknown tokens

Motivation for subwords

Encoder

Decoder

Tokens from Source language

Tokens from Target language

Consider a problem of machine translation

Restricting the size of vocabulary introduces more number of unknown tokens both in source and target languages and therefore results in poor translation.

Often, OOV tokens are the rare words (with a minimum frequency of occurrence) that were excluded from the vocabulary while constructing it. For example, the word "plural" may be present in the vocabulary but not the "plurify".

An effective approach would relate this word to the existing words in the vocabulary by breaking the word into subword units (for ex, "plur+ify").

Encoder

Decoder

Tokens from Source language

Tokens from Target language

This is motivated by observing how the unknown tokens like names are translated

Various word classes (for example, name) are translatable via

English:

Chandrayan

Observation

Tamil:

Various word classes (for example, name) are translatable via smaller units than words

Chandrayan

ச

ந்தி

ரா

ய

ன்

Why not apply the same idea to rare words in the corpus?

Let's see how

English:

Tamil:

Algorithm

import re, collections

def get_stats(vocab):
  pairs = collections.defaultdict(int)
  for word, freq in vocab.items():
    symbols = word.split()
    for i in range(len(symbols)-1):
      pairs[symbols[i],symbols[i+1]] += freq
  return pairs

def merge_vocab(pair, v_in):
  v_out = {}
  bigram = re.escape(' '.join(pair))
  p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
  for word in v_in:
  	w_out = p.sub(''.join(pair), word)
  	v_out[w_out] = v_in[word]
  return v_out

vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
'n e w e s t </w>':6, 'w i d e s t </w>':3}
num_merges = 10

for i in range(num_merges):
  pairs = get_stats(vocab)
  best = max(pairs, key=pairs.get)
  vocab = merge_vocab(best, vocab)
print(best)

Start with a dictionary that contains words and their count

Append a special symbol </w> at the end of each word in the dictionary

Set required number of merges (a hyperparameter)

Get the frequency count for a pair of characters

Merge pairs with maximum occurrence

Initialize the character-frequency table (a base vocabulary)

Example

knowing the name of something is different from knowing something. knowing something about everything isn't bad

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

</w> identifies the word boundary.

Objective: Find most frequently occurring byte-pair

Example

knowing the name of something is different from knowing something. knowing something about everything isn't bad

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

</w> identifies the word boundary.

Objective: Find most frequently occurring byte-pair

Let's count the word frequencies first.

We can count character frequencies from the table

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Character	Frequency
'k'	3

Initial token count

Word count

\vdots

We could infer that "k" has occurred three times by counting the frequency of occurrence of words having the character "k" in it.

In this corpus, the only word that contains "k" is the word "knowing" and it has occurred three times.

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Character	Frequency
'k'	3

Initial token count

Word count

Character	Frequency
'k'	3
'n'	13

\vdots

Vocab size:13

2*3=6

1*1=1

1*2=2

1*1=1

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Character	Frequency
'k'	3

Initial token count

Word count

Character	Frequency
'k'	3
'n'	13

Character	Frequency
'k'	3
'n'	13
'o'	9

Character	Frequency
'k'	3
'n'	13
'o'	9

'</w>'	3+1+1+1+2+...+1=16

\vdots

Vocab size:13

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Character	Frequency
'k'	3

Initial tokens and count

Word count

Character	Frequency
'k'	3
'n'	13

Character	Frequency
'k'	3
'n'	13
'o'	9

Character	Frequency
'k'	3
'n'	13
'o'	9

'</w>'	3+1+1+1+2+...+1=16

\vdots

Character	Frequency
'k'	3
'n'	13
'o'	9

'</w>'	16

"'"	1

\vdots

Vocab size:13

Initial Vocab Size :22

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3

Byte-Pair count

Word count

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3
('n', 'o')	3

Byte-Pair count

Word count

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3
('n', 'o'	3
('o', 'w')	3

Byte-Pair count

Word count

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3
('n', 'o'	3
('o', 'w')	3
('w', 'i')	3
('i', 'n')

Byte-Pair count

Word count

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3
('n', 'o'	3
('o', 'w')	3
('w', 'i')	3
('i', 'n')	7
('n', 'g')	7
('g', '</w>')	6
('t', 'h')	5
('h', 'e')	1
('e', '</w>'	2

('a', 'd')	1
('d', '</w>')	1

Byte-Pair count

\vdots

Word count

Character	Frequency
'k'	3
'n'	13
'o'	9
'i'	10
'</w>'	16

"'"	1

Initial Vocabulary

Merge the most commonly occurring pair : \((i,n) \rightarrow in\)

\vdots

"in"

Word	Frequency
'k n o w i n g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h i n g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h i n g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h i n g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3
('n', 'o'	3
('o', 'w')	3
('w', 'i')	3
('i', 'n')	7
('n', 'g')	7
('g', '</w>')	6
('t', 'h')	5
('h', 'e')	1
('e', '</w>'	2

('a', 'd')	1
('d', '</w>')	1

Byte-Pair count

\vdots

Word count

Character	Frequency
'k'	3
'n'	13 -7 = 6
'o'	9
'i'	10-7 = 3
'</w>'	16

"'"	1

Updated Vocabulary

Merge the most commonly occurring pair

\vdots

"in"

Update token count

Added new token

Word	Frequency
'k n o w in g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h in g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h in g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h in g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3
('n', 'o'	3
('o', 'w')	3
('w', 'i')	3
('i', 'n')	7
('n', 'g')	7
('g', '</w>')	6
('t', 'h')	5
('h', 'e')	1
('e', '</w>'	2

('a', 'd')	1
('d', '</w>')	1

Byte-Pair count

\vdots

Word count

Character	Frequency
'k'	3
'n'	6
'o'	9
'i'	3
'</w>'	16

"'"	1
'g': 7	7

Updated vocabulary

\vdots

"in"

Now, treat "in" as a single token and repeat the steps.

Word	Frequency
'k n o w in g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h in g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h in g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h in g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3
('n', 'o'	3
('o', 'w')	3
~~('w', 'i')~~	3
('w','in')	3
('in', 'g')	7
('g', '</w>')	6
('t', 'h')	5
('h', 'e')	1
('e', '</w>'	2

('a', 'd')	1
('d', '</w>')	1

Byte-Pair count

\vdots

Word count

Character	Frequency
'k'	3
'n'	6
'o'	9
'i'	3
'</w>'	16

"'"	1
'g': 7	7

Updated vocabulary

\vdots

"in"

"ing"

Therefore, the new byte pairs are (w,in):3,(in,g):7, (h,in):4

Note, at iteration 4, we treat (w,in) as a pair instead of (w,i)

Word	Frequency
'k n o w in g </w>'	3
't h e </w>	1
'n a m e </w>	1
'o f </w>'	1
's o m e t h in g </w>	2
'i s </w>'	1
'd i f f e r e n t </w>'	1
'f r o m </w>'	1
's o m e t h in g . </w>'	1
'a b o u t </w>'	1
'e v e r y t h in g </w>'	1
"i s n ' t </w>	1
'b a d </w>'	1

Word	Frequency
('k', 'n')	3
('n', 'o'	3
('o', 'w')	3
('w', 'in')	3
('h', 'in')	4
('in', 'g')	7
('g', '</w>')	6
('t', 'h')	5
('h', 'e')	1
('e', '</w>'	2

('a', 'd')	1
('d', '</w>')	1

Byte-Pair count

\vdots

Word count

Character	Frequency
'k'	3
'n'	6
'o'	9
'i'	3
'</w>'	16

"'"	1
'g': 7	7

Updated vocabulary

\vdots

"in"

"ing"

Of all these pairs, merge most frequently occurring byte-pairs

which turns out to be "ing"

Now, treat "ing" as a single token and repeat the steps

After 45 merges

'k n o w i n g </w>'

't h e </w>

'n a m e </w>

'o f </w>'

's o m e t h i n g </w>

'i s </w>'

'd i f f e r e n t </w>'

'f r o m </w>'

's o m e t h i n g . </w>'

'a b o u t </w>'

'e v e r y t h'

"i s n ' t </w>

'b a d </w>'

The final vocabulary contains initial vocabulary and all the merges (in order). The rare words are broken down into two or more subwords

At test time, the input word is split into a sequence of characters, and the characters are merged into a larger and known symbols

everything

tokenizer

everyth

ing

The pair ('i','n') is merged first and follwed by the pair ('in','g')

For a larger corpus, we often end up with vocabulary of size smaller than considering individual words as tokens

everything

bad

tokenizer

bad

Tokens
'k'
'n'
'o'
'i'
'</w>'

'in'
'ing'

\vdots

The algorithm offers a way to adjust the size of vocabulary as a function of number of merges.

\vdots

BPE for non-segmented languages

Languages such as Japanese and Korean are non-segmented

However, BPE requires space-separated words

How do we apply BPE for non-segmented languages then?

In practice, we can use language specific morphology based word segmenters such as Juman++ for Japanese (as shown in the figure on the right)

映画『トランスフォーマー』を楽しく見ました

Input text

Pre-tokenization

Word level Segmenter

Juman++ (or) Mecab

Run BPE

[Eiga, toransufōmā, o, tanoshi,#ku, mi,#mashita]

[Eiga, toransufōmā, o, tanoshiku, mimashita]

However, in the case of multi-lingual translation, having a language agnostic tokenizer is desirable.

Example:

\(\mathcal{V}=\{a,b,c,\cdots,z,lo,he\}\)

Tokenize the text "hello lol"

[hello],[lol]

search for the byte-pair 'lo', if present merge

Yes. Therefore, Merge

[h e l lo], [lo l]

search for the byte-pair 'he', if present merge

Yes. Therefore, Merge

[he l lo], [lo l]

return the text after merge

he #l #lo, lo #l

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]
    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1
            splits[idx] = split

    return sum(splits, [])

Search for byte-pair in order inserted in the vocab through entire input text

Merge if found

first, 2-grams, 3-grams,...

Module 4.3 : WordPiece

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

In BPE we merged a pair of tokens which has the highest frequency of occurence.

Take the frequency of occurrence of individual tokens in the pair into an account

Word	Frequency
('k', 'n')	3
('n', 'o')	3
('o', 'w')	3
('w', 'i')	3
('i', 'n')	7
('n', 'g')	7
('g', '.')	1
('t', 'h')	5
('h', 'e')	1
('e', '</w>')	2

('a', 'd')	1

\vdots

What if there are more than one pair that is occurring with the same frequency, for example ('i','n') and ('n','g')?

score=\frac{count(\alpha,\beta)}{count(\alpha)count(\beta)}

Now we can select a pair of tokens where the individual tokens are less frequent in the vocabulary

The WordPiece algorithm uses this score to merge the pairs.

Word	Frequency
'k n o w i n g'	3
't h e '	1
'n a m e '	1
'o f '	1
's o m e t h i n g '	2
'i s '	1
'd i f f e r e n t '	1
'f r o m '	1
's o m e t h i n g. '	1
'a b o u t '	1
'e v e r y t h i n g '	1
"i s n ' t '	1
'b a d '	1

Word count

Character	Frequency
'k'	3
'#n'	13
'#o'	9

't'	16
'#h'	5
"'"	1

Initial Vocab Size :22

\vdots

knowing

k #n #o #w #i #n #g

Subwords are identified by prefix ## (we use single # for illustration)

Word count

Word	Frequency
'k n o w i n g'	3
't h e '	1
'n a m e '	1
'o f '	1
's o m e t h i n g '	2
'i s '	1
'd i f f e r e n t '	1
'f r o m '	1
's o m e t h i n g. '	1
'a b o u t '	1
'e v e r y t h i n g '	1
"i s n ' t '	1
'b a d '	1

Word	Frequency
('k', 'n')	3
('n', 'o')	3
('o', 'w')	3
('w', 'i')	3
('i', 'n')	7
('n', 'g')	7
('g', '.')	1
('t', 'h')	5
('h', 'e')	1
('e', '</w>')	2

('a', 'd')	1

\vdots

ignoring the prefix #

Freq of 1st element	Freq of 2nd element	score
'k':3	'n':13	0.076
'n':13	'o':9	0.02
'o':9	'w':3	0.111

'i':10	'n':13	0.05
'n':13	'g':7	0.076

't':8	'h':5	0.125



'a':3	'd':2	0.16

score=\frac{count(\alpha,\beta)}{count(\alpha)count(\beta)}

Now, merging is based on the score of each byte pair.

Merge the pair with highest score

Small Vocab

Larger Vocab

Medium

Vocab

N \quad Merges

More \quad Merges

We start with a character level vocab

1-112

30K-50K

100K-250K

and keep merging until a desired vocabulary size is reached

Small Vocab

Larger Vocab

Medium

Vocab

More \quad Elimination

N \quad eliminations

We start with word level vocab

1-112

30K-50K

100K-250K

and keep eliminating words until a desired vocabulary size is reached

Well, we can do the reverse as well.

That's what we see next.

Module 4.4 : SentencePiece

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

\mathbf{x_1}=he,ll,o

\mathbf{x_3}=he,l,lo

\mathbf{x_2}=h,el,lo

\mathbf{x_4}=hell,o

A given word can have numerous subwords.

\(\mathcal{V}\)={h,e,l,l,o,he,el,lo,ll,hell}

however, following the merge rule, BPE outputs

:he,l,lo

On the other hand, if

\(\mathcal{V}\)={h,e,l,l,o,el,he,lo,ll,hell}

then BPE outputs:

h,el,lo

Therefoe, we say BPE is greedy and deterministic (we can use BPE-Dropout [Ref] to make it stochastic)

For instance, the word "\(X\)=hello" can be segmented in multiple ways (by BPE) even with the same vocabulary

The probabilistic approach is to find the subword sequence \(\mathbf{x^*} \in \{\mathbf{x_1},\mathbf{x_2},\cdots,\mathbf{x_k}\}\) that maximizes the likelihood of the word \(X\)

The word \(X\) in sentencepiece means a sequence of characters or words (without spaces)

Therefore, it can be applied to languages (like Chinese and Japanese) that do not use any word delimiters in a sentence.

All possible subwords of \(X\)

hidden

observed

Let \(\mathbf{x}\) denote a subword sequence of length \(n\).

P(\mathbf{x}) = \prod\limits_{i=1}^{n} P(x_i)

\mathbf{x}^* = \argmax_{\mathbf{x} \in S(X)} P(\mathbf{x})

\sum \limits_{x \in \mathcal{V}} p(x)=1

\mathbf{x}=(x_1, x_2, \dots, x_n)

then the probability of the subword sequence (with unigram LM) is simply

the objective is to find the subword sequence for the input sequence \(X\) (from all possible segmentation candidates of \(S(X)\)) that maximizes the (log) likelihood of the sequence

Recall that the subwords \(p(x_i)\) are hidden (latent) variables.

Then, for all the sequences in the dataset \(D\), we define the likelihood function as

\mathcal{L}= \sum \limits_{s=1}^{|D|}\log(P(X^s))

= \sum \limits_{s=1}^{|D|}\log \Big(\sum \limits_{\mathbf{x} \in S(X^s)} P(\mathbf{x}) \Big)

Therefore, given the vocabulary \(\mathcal{V}\), Expectation-Maximization (EM) algorithm could be used to maximize the likelihood

We can use Viterbi decoding to find \(\mathbf{x}^*\).

Word	Frequency
'k n o w i n g'	3
't h e '	1
'n a m e '	1
'o f '	1
's o m e t h i n g '	2
'i s '	1
'd i f f e r e n t '	1
'f r o m '	1
's o m e t h i n g. '	1
'a b o u t '	1
'e v e r y t h i n g '	1
"i s n ' t '	1
'b a d '	1

Let \(X=\) "knowing" and a few segmentation candidates be \(S(X)=\{`k,now,ing`,`know,ing`,`knowing`\}\)

p(\mathbf{x_1}=k,now,ing)

=p(k)p(now)p(ing)

=\frac{3}{16} \times \frac{3}{16} \times \frac{7}{16}

=\frac{63}{4096}

p(\mathbf{x_2}=know,ing)

=p(know)p(ing)

=\frac{21}{256}=\frac{336}{4096}

p(\mathbf{x_3}=knowing)

=p(knowing)

=\frac{3}{16}=\frac{768}{4096}

Unigram model favours the segmentation with least number of subwords

Given the unigram language model we can calculate the probabilities of the segments as follows

\mathbf{x}^* = \argmax_{\mathbf{x} \in S(X)} P(\mathbf{x})=\mathbf{x_3}

In practice, we use Viterbi decoding to find \(\mathbf{x}^*\) instead of enumerating all possible segments

Construct a reasonably large seed vocabulary using BPE or Extended Suffix Array algorithm.
E-Step:

Estimate the probability for every token in the given vocabulary using frequency counts in the training corpus
M-Step:

Use Viterbi algorithm to segment the corpus and return optimal segments that maximizes the (log) likelihood.
Compute the likelihood for each new subword from optimal segments
Shrink the vocabulary size by removing top \(x\%\) of subwords that have the smallest likelihood.
Repeat step 2 to 5 until desired vocabulary size is reached

Algorithm

Set the desired vocabulary size

Let us consider segmenting the word "whereby" using Viterbi decoding

now

bow

out

eve

win

some

bad

owi

ing

hing

thing

\mathcal{V}

Token	log(p(x))
b	-4.7
e	-2.7
h	-3.34
r	-3.36
w	-4.29
wh	-5.99
er	-5.34
where	-8.21
by	-7.34
he	-6.02
ere	-6.83
here	-7.84
her	-7.38
re	-6.13

w\\-4.29

t=1

whereby

Forward algorithm

Iterate over every position in the given word

output the segment which has highest likelihood

Token	log(p(x))
b	-4.7
e	-2.7
h	-3.34
r	-3.36
w	-4.29
wh	-5.99
er	-5.34
where	-8.21
by	-7.34
he	-6.02
ere	-6.83
here	-7.84
her	-7.38
re	-6.13

-7.63

-5.99

whereby

(w,h)

At this position, the posible segmentations of the slice "wh" are (w,h) and (wh)

Compute the log-likelihood for both and output the best one.

Token	log(p(x))
b	-4.7
e	-2.7
h	-3.34
r	-3.36
w	-4.29
wh	-5.99
er	-5.34
where	-8.21
by	-7.34
he	-6.02
ere	-6.83
here	-7.84
her	-7.38
re	-6.13

whereby

-7.63

-5.99

-10.33

-10.31

-8.69

We do not need to compute the likelihood of (w,h,e) as we already ruled out (w,h) to (wh). We display it for completeness

whe

-\infty

Of these, (wh,e) is the best segmentation that maximizes the likelihood.

Token	log(p(x))
b	-4.7
e	-2.7
h	-3.34
r	-3.36
w	-4.29
wh	-5.99
er	-5.34
where	-8.21
by	-7.34
he	-6.02
ere	-6.83
here	-7.84
her	-7.38
re	-6.13

whereby

-5.99

-10.31

-8.69

whe

-\infty

\times

her

-11.67

-11.33

-12.05

wher

-\infty

\times

-7.63

Token	log(p(x))
b	-4.7
e	-2.7
h	-3.34
r	-3.36
w	-4.29
wh	-5.99
er	-5.34
where	-8.21
by	-7.34
he	-6.02
ere	-6.83
here	-7.84
her	-7.38
re	-6.13

whereby

-5.99

-10.31

-8.69

whe

-\infty

\times

her

-11.67

-11.33

-12.05

wher

\times

-\infty

\times

where

-8.21

\times

here

-10.54

\times

-14.03

ere

-12.82

\times

-7.63

Token	log(p(x))
b	-4.7
e	-2.7
h	-3.34
r	-3.36
w	-4.29
wh	-5.99
er	-5.34
where	-8.21
by	-7.34
he	-6.02
ere	-6.83
here	-7.84
her	-7.38
re	-6.13

whereby

-5.99

-10.31

-8.69

whe

-\infty

\times

her

-11.67

-11.33

-12.05

wher

\times

-\infty

\times

where

-8.21

\times

here

-10.54

\times

-14.03

ere

-12.82

\times

-7.63

whereb

-\infty

-12.91

Token	log(p(x))
b	-4.7
e	-2.7
h	-3.34
r	-3.36
w	-4.29
wh	-5.99
er	-5.34
where	-8.21
by	-7.34
he	-6.02
ere	-6.83
here	-7.84
her	-7.38
re	-6.13

whereby

-5.99

-10.31

-8.69

whe

-\infty

\times

her

-11.67

-11.33

-12.05

wher

\times

-\infty

\times

where

-8.21

\times

here

-10.54

\times

-14.03

ere

-12.82

\times

-7.63

whereb

-\infty

-12.91

-15.51

\times

Token	log(p(x))
b	-4.7
e	-2.7
h	-3.34
r	-3.36
w	-4.29
wh	-5.99
er	-5.34
where	-8.21
by	-7.34
he	-6.02
ere	-6.83
here	-7.84
her	-7.38
re	-6.13

Backtrack

whereby

-5.99

-10.31

-8.69

whe

-\infty

\times

her

-11.67

-11.33

-12.05

wher

\times

-\infty

\times

where

-8.21

\times

here

-10.54

\times

-14.03

ere

-12.82

\times

-7.63

whereb

-\infty

-12.91

-15.51

The best segmentation of the word "whereby" that maximizes the likelihood is "where,by"

\times

We can follow the same procedure for languages that do not use any word delimiters in a sentence.

Morfessor

For example, Consider a word

Sesquipedalophobia (fear of long words)

and suppose the sentence piece returns one best segmentation

Ses-qui-ped-alo-phobia

Recall that the unigram models used Maximim Likelihood Estimation (MLE) as an objective function.

However, for some reason, we want the segmentation to be

Ses-qui-pedalo-phobia

In general, Morfessor allows us to incorporate the linguistically motivated a prior knowledge into the tokenization procedure.

Another algorithm that is related to sentence piece is Morfessor [Ref] which uses Maximum a posterior (MAP) Estimation as an objective function that allows us to incorporate a prior

Also, allows us to add a small annotated data while training the model

Has been used for large vocabulary continuous speech recognition

References

Rust implementation of Sentence piece by Guillaume

On subword units by Kyle Chung

Neural Machine Translation of Rare Words with Subword Units

Morfessor

Wordpiece model uses a likelihood instead of frequency.

sentence piece in rust

Sentencepiece is called sentencepiece because it treats the entire sentence (corpus) as single input and finds the subwords for it.

substring and subwords array not the one and the same

The suffix array A of S is now defined to be an array of integers providing the starting positions of suffixes of S in lexicographical order.

Counting subwords: https://math.stackexchange.com/questions/104258/counting-subwords

Lecture-4-Tokenizers

By Arun Prakash

Lecture-4-Tokenizers

Covers Different Types of tokenizers used in NLP

1,754

Lecture 4: Tokenizers: BPE,WordPiece and SentencePiece

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Introduction to Large Language Models

Introduction

Recall that language modelling involves computing probabilities over a sequence of tokens

This requires us to split the input text into tokens in a consistent way.

Let's see where we fit in a tokenizer in the training and inference pipeline

A simple approach is to use whitespace for splitting the text.

The first step is to build (learn) the vocabulary \(\mathcal{V}\) that contains unique tokens (\(x_i\)).

Therefore, the fundamental question is how do we build (learn) a vocabulary \(\mathcal{V}\) from a large corpus that contains billions of sentences?

We also include special tokens such as <go>,<stop>,<mask>,<sep>,<cls> and others to the vocabulary based on the type of downstream tasks and the architecture (GPT/BERT) choice

We can split the text into words using whitespace (called pre-tokenization) and add all unique words in the corpus to vocbulary.

Is spliting the input text into words using whitespace a good approach?

What about languages like Japanese, which do not use any word delimiters like space?

Why not treat each individual character in a language as a vocabulary?

Some Questions

In that case, Do we treat the words "enjoy" and "enjoyed" as separate tokens?

Finally, what are good tokens?

Challenges in building a vocabulary

What should be the size of vocabulary?

Larger the size, larger the size of embedding matrix and greater the computation of softmax probabilities. What is the optimal size?

Out-of-vocabulary

If we limit the size of the vocabulary (say, 250K to 50K) , then we need a mechanism to handle out-of-vocabulary (OOV) words. How do we handle them?

Handling misspelled words in corpus

Often, the corpus is built by scraping the web. There are chances of typo/spelling errors. Such erroneous words should not be considered as unique words.

Open Vocabulary problem

A new word can be constructed (say,in agglutinative languages) by combining the existing words . The vocabulary, in principle, is infinite (that is, names,numbers,..) which makes a task like machine translation challenging

Module 4.1 : Types of tokenization

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

It would be interesting to measure the size of vocabulary of native (non-native) speakers

If we treat two words are different if they differ by meaning, then words like bank, air have multiple meanings

If we treat words as group of characters arranged in order, then the words in the list [ "learn", "learns", "learning", "learnt" ] are all different.

This increases the size of vocabulary which in turn increases the computational requirement.

In practice, the size of vocabulary varies based on the corpus. It is often in orders of tens of thousands

What constitutes a word?

One way to reduce the size of the vocabulary is to consider individual charecters in the input text as tokens

Character Level Tokenization

The size of the vocabulary is small and finite even for a large corpus of text,

It solves both open vocabulary and out-of-vocabulary problems

However, it loses semantic meaning of words

The number of characters in a sentence is, on an average, \(5x\) higher than the number of words in a sentence.

This increases the computational complexity of the models as the context window size increases 5 times (on average)

Therefore, computing softmax probabilities is not a bottleneck.

Let's look at the word level tokenizer once again

Word Level Tokenization

In general, the size of the vocabulary grows based on the number of unique words in a corpus of text.

In practice, It could range anywhere from 30000 to 250000 based on the size of the corpus

Therefore, computing softmax probabilities to predict the token at each time step becomes expensive

One approach to limit the size of the vocabulary is to consider the words with a minimum frequency of occurrence, say, at least 100 times in the corpus.

The words which are not in the vocabulary (Out-of-vocabulary problem) are substituted by a special token: <unk> during training and inference. Doing this gives suboptimal performance for tasks like text generation and translation

Moreover, it is difficult to deal with misspelled words in a corpus (i.e., each misspelled word may be treated as new word in the vocabulary)

So we need a tokenizer that does not blow up the size of the vocabulary as in a character level tokenizer and also preserves the meaning of words as in a word level tokenizer.

Word Level Tokenization

Sub-Word Level Tokenization

The size of the vocabulary is carefully built based on the frequency of occurrence of subwords

For ex, the subword level tokenizer breaks "don't" into "do" and "n't" by learning that "n't" occurs frequently in a corpus.

we have a representation for "do" and a representation for "n't"

Therefore, subword level tokenizers are preferred for LLMs.

The size of the vocabulary is moderate

A middle ground between character level and word level tokenizers

The most frequently occuring words are preserved as is and rare words are split into subword units.

Categories

c h a r a c t e r l e v e l

word level

sub-word level

Wishlist

Moderate-sized Vocabulary

Efficiently handle unknown words during inference

Be language agnostic

Module 4.2 : Byte Pair Encoding

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

General Pre-processing steps

Splitting the text by whitespace was traditionally called tokenization. However, when it is used with a sub-word tokenization algorithm, it is called pre-tokenization.

First the text is normalized, which involves operations such as treating cases, removing accents, eliminating multiple whitespaces, handling HTML tags, etc.

The input text corpus is often built by scraping web pages and ebooks.

Learn the vocabulary (called training) using these words

The tokenization schemes follow a sequence of steps to learn the vocabulary.