Lecture 4: Tokenizers: BPE,WordPiece and SentencePiece

Mitesh M. Khapra, Arun Prakash A

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Introduction to Large Language Models

Introduction

Recall that language modelling involves computing probabilities over a sequence of tokens

P(x_1)\prod \limits_{i=2}^T P(x_i|x_1,\cdots,x_{i-1})
=P(x_1)P(x_2|x_1) \cdots P(x_T|x_{T-1},\cdots,x_1)

This requires us to split the input text into tokens in a consistent way.

Let's see where we fit in a tokenizer in the training and inference  pipeline

Tokenizer

Token Ids

embeddings

Language Model

ids to tokens

How are you?

\{How,are, you\}
\{How:1,\\ are:2, \\ you:3\}
1:[0.1,0.35 \cdots,0.08],\\ 2:[0.01,1.00,\cdots,0.8], \\ 3:[0.12,0.00,\cdots,0.46]\\

 A simple approach is to use whitespace for splitting the text.

The first step is to build (learn) the vocabulary \(\mathcal{V}\) that contains unique tokens (\(x_i\)).

Therefore, the fundamental question is how do we build (learn) a vocabulary \(\mathcal{V}\) from a large corpus that contains billions of sentences?

We also include special tokens such as <go>,<stop>,<mask>,<sep>,<cls> and others to the vocabulary based on the type of downstream tasks and the architecture (GPT/BERT)  choice

I

enjoyed

the

movie

transformers

Feed Forward Network

Multi-Head Attention

<go>

<stop>

We can split the text into words using whitespace (called pre-tokenization) and add all unique words in the corpus to vocbulary.

Is spliting the input text into words using whitespace a good approach?

What about languages like Japanese, which do not use any word delimiters like space?

Why not treat each individual character in a language as a vocabulary?

Some Questions

In that case, Do we treat the words "enjoy" and "enjoyed" as separate tokens?

Finally, what are good tokens?

I

enjoyed

the

movie

transformers

Feed Forward Network

Multi-Head Attention

<go>

<stop>

映画『トランスフォーマー』を楽しく見ました

Challenges in building a vocabulary

What should be the size of vocabulary?

Larger the size, larger the size of embedding matrix and greater the computation of softmax probabilities. What is the optimal size?

Out-of-vocabulary

If we limit the size of the vocabulary (say, 250K to 50K) , then we need a mechanism to handle out-of-vocabulary (OOV) words. How do we handle them?

 Handling misspelled words in corpus

Often, the corpus is built by scraping the web. There are chances of typo/spelling errors. Such erroneous words should not be considered as unique words.

Open Vocabulary problem 

A new word can be constructed (say,in agglutinative languages) by combining the existing words . The vocabulary, in principle, is infinite (that is, names,numbers,..) which makes a task like machine translation challenging 

Module 4.1 : Types of tokenization

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

It would be interesting to measure the size of vocabulary of native (non-native) speakers

If we treat two words are different if they differ by meaning, then words like bank, air have multiple meanings

If we treat words as group of characters arranged in order, then the words in the list [ "learn", "learns", "learning", "learnt" ] are all different. 

This increases the size of vocabulary which in turn increases the computational requirement.

In practice, the size of vocabulary varies based on the corpus. It is often in orders of tens of thousands

Source: lemongrad

What constitutes a word? 

One way to reduce the size of the vocabulary is to consider individual charecters in the input text as tokens

Hmm.. I know, I don't know

[H, m, m, ., ., I, k, n, o, w, ,, I, d, o, n, ', t, k, n, o, w]   

21

Number of tokens

Character Level Tokenization

Encoder

 \(\mathcal{V}\): {H,m,.,I,k,n,o,w,d,',t}

 \(|\mathcal{V}|\): 11

The size of the vocabulary is small and finite even for a large corpus of text,

It solves both open vocabulary and out-of-vocabulary problems 

However, it loses semantic meaning of words

The number of characters in a sentence is, on an average, \(5x\) higher than the number of words in a sentence.

This increases the computational complexity of the models as the context window size increases 5 times (on average)

Therefore, computing softmax probabilities is not a bottleneck.

Let's look at the word level tokenizer once again

[Hmm.., I, know,, I, don't, know]   

6

Number of tokens

Word Level Tokenization

\(\mathcal{V}\): {Hmm..,I, know, don't}

\(|\mathcal{V}|\): 4

In general, the size of the vocabulary grows based on the number of unique words in a corpus of text.

In practice, It could range anywhere from 30000 to 250000 based on the size of the corpus

Therefore, computing softmax probabilities to predict the token at each time step becomes expensive 

One approach to limit the size of the vocabulary is to consider the words with a minimum frequency of occurrence, say, at least 100 times in the corpus.

Hmm.. I know, I don't know

The words which are not in the vocabulary (Out-of-vocabulary problem) are substituted by a special token: <unk> during training and inference. Doing this gives suboptimal performance for tasks like text generation and translation

Moreover, it is difficult to deal with misspelled words in a corpus (i.e., each misspelled word may be treated as new word in the vocabulary)

So we need a tokenizer that does not blow up the size of the vocabulary as in a character level tokenizer and also preserves the meaning of words as in a word level tokenizer.

[Hmm.., I, know,, I, don't, know]   

6

Number of tokens

\(\mathcal{V}\): {Hmm..,I, know, don't}

\(|\mathcal{V}|\): 4

Hmm.. I know, I don't know

Word Level Tokenization

Hmm.. I know, I don't know

[Hmm.., I, know,, I, do, n't, know]   

7

Number of tokens

Sub-Word Level Tokenization

\(\mathcal{V}\): {Hmm..,I, know, do, n't}

\(|\mathcal{V}|\): 5

The size of the vocabulary is carefully built based on the frequency of occurrence of subwords

For ex, the subword level tokenizer breaks "don't" into "do" and "n't" by learning that "n't" occurs frequently in a corpus.

we have a representation for "do" and a representation for "n't"

Therefore, subword level tokenizers are  preferred for LLMs.

The size of the vocabulary is moderate

A middle ground between character level and word level tokenizers

The most frequently occuring words are preserved as is and rare words are split into subword units.

Categories

c h a r a c t e r  l e v e l

word level

sub-word level

* There are many categories, these three are most commonly used

WordPiece

SentencePiece

BytePairEncoding (BPE)

Wishlist 

Moderate-sized Vocabulary 

Efficiently handle unknown words during inference

Be language agnostic

Module 4.2 : Byte Pair Encoding

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

General Pre-processing steps

Hmm.., I know I Don't know

Input text

hmm.., i know i don't know

normalization 

Pre-tokenization

[hmm.., i, know, i , don't,know]

Subword Tokenization

[hmm.., I, know, i, do, #n't,know]

Splitting the text by whitespace was traditionally called tokenization. However, when it is used with a sub-word tokenization algorithm, it is called pre-tokenization.

First the text is normalized, which involves operations such as treating cases, removing accents, eliminating multiple whitespaces, handling HTML tags, etc.

The input text corpus is often built by scraping web pages and ebooks.

Learn the vocabulary (called training) using these words

The tokenization schemes follow a sequence of steps to learn the vocabulary.

Text Corpus

Learn the vocabulary

Pre-process

Post-process

Model

\mathcal{V}

I enjoyed the movie

[I enjoy #ed the movie]

[I enjoy #ed the movie .]

I enjoyed the movie.

Preprocess

In translation one has to deal with open-vocabulary (names,numbers,units..) and therefore it is more challenging to handle unknown tokens

Motivation for subwords

Encoder

Decoder

Tokens from Source language

Tokens from Target language

Consider a problem of machine translation 

Restricting the size of vocabulary  introduces more number of unknown tokens both in source and target languages and therefore results in poor translation.

Often, OOV tokens are the rare words (with a minimum frequency of occurrence) that were excluded from the vocabulary while constructing it. For example, the word "plural" may be present in the vocabulary but not the "plurify".

An effective approach would relate this word to the existing words in the vocabulary by breaking the word into subword units (for ex, "plur+ify").

Encoder

Decoder

Tokens from Source language

Tokens from Target language

This is motivated by observing how the unknown tokens like names are translated 

Various word classes (for example, name) are translatable via 

English:

Chandrayan

Observation

Tamil:

?

Various word classes (for example, name) are translatable via smaller units than words

Chandrayan

ந்தி

ரா

ய 

ன்  

Why not apply the same idea to rare words in the corpus?

Let's see how

English:

Tamil:

Algorithm

import re, collections

def get_stats(vocab):
  pairs = collections.defaultdict(int)
  for word, freq in vocab.items():
    symbols = word.split()
    for i in range(len(symbols)-1):
      pairs[symbols[i],symbols[i+1]] += freq
  return pairs

def merge_vocab(pair, v_in):
  v_out = {}
  bigram = re.escape(' '.join(pair))
  p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
  for word in v_in:
  	w_out = p.sub(''.join(pair), word)
  	v_out[w_out] = v_in[word]
  return v_out

vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
'n e w e s t </w>':6, 'w i d e s t </w>':3}
num_merges = 10

for i in range(num_merges):
  pairs = get_stats(vocab)
  best = max(pairs, key=pairs.get)
  vocab = merge_vocab(best, vocab)
print(best)

Start with a dictionary that contains words and their count

Append a special symbol </w> at the end of each word in the dictionary

Set required number of merges (a hyperparameter)

Get the frequency count for a pair of characters

Merge pairs with maximum occurrence

Initialize the character-frequency table (a base vocabulary)

Example

knowing the name of something is different from knowing something. knowing something about everything isn't bad

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1

</w> identifies the word boundary.

Objective: Find most frequently occurring byte-pair 

Example

knowing the name of something is different from knowing something. knowing something about everything isn't bad

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1

</w> identifies the word boundary.

Objective: Find most frequently occurring byte-pair 

Let's count the word frequencies first. 

We can count character frequencies from the table

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Character Frequency
'k' 3

Initial token count

Word count

\vdots
\vdots

We could infer that "k" has occurred three times by counting the frequency of occurrence of words having the character "k" in it.

In this corpus, the only word that contains "k" is the word "knowing" and it has occurred three times.

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Character Frequency
'k' 3

Initial token count

Word count

Character Frequency
'k' 3
'n' 13
\vdots
\vdots

Vocab size:13

2*3=6
1*1=1
1*2=2
1*1=1
1*1=1
1*1=1
1*1=1
Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Character Frequency
'k' 3

Initial token count

Word count

Character Frequency
'k' 3
'n' 13
Character Frequency
'k' 3
'n' 13
'o' 9
Character Frequency
'k' 3
'n' 13
'o' 9
'</w>' 3+1+1+1+2+...+1=16
\vdots
\vdots

Vocab size:13

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Character Frequency
'k' 3

Initial tokens and count

Word count

Character Frequency
'k' 3
'n' 13
Character Frequency
'k' 3
'n' 13
'o' 9
Character Frequency
'k' 3
'n' 13
'o' 9
'</w>' 3+1+1+1+2+...+1=16
\vdots
\vdots
Character Frequency
'k' 3
'n' 13
'o' 9
'</w>' 16
"'" 1
\vdots
\vdots
\vdots
\vdots

Vocab size:13

Initial Vocab Size :22

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3

Byte-Pair count

Word count

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3
('n', 'o') 3

Byte-Pair count

Word count

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3
('n', 'o' 3
('o', 'w') 3

Byte-Pair count

Word count

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3
('n', 'o' 3
('o', 'w') 3
('w', 'i') 3
('i', 'n')

Byte-Pair count

3
+2
+1
+1
=7

Word count

Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3
('n', 'o' 3
('o', 'w') 3
('w', 'i') 3
('i', 'n') 7
('n', 'g') 7
('g', '</w>') 6
('t', 'h') 5
('h', 'e') 1
('e', '</w>' 2
('a', 'd') 1
('d', '</w>') 1

Byte-Pair count

\vdots
\vdots

Word count

Character Frequency
'k'
'n' 13 
'o' 9
'i' 10
'</w>' 16
"'" 1

Initial Vocabulary

Merge the most commonly occurring pair : \((i,n) \rightarrow in\)

\vdots
\vdots
"in" 7
Word Frequency
'k n o w i n g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h i n g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h i n g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h i n g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3
('n', 'o' 3
('o', 'w') 3
('w', 'i') 3
('i', 'n') 7
('n', 'g') 7
('g', '</w>') 6
('t', 'h') 5
('h', 'e') 1
('e', '</w>' 2
('a', 'd') 1
('d', '</w>') 1

Byte-Pair count

\vdots
\vdots

Word count

Character Frequency
'k'
'n' 13 -7 = 6
'o' 9
'i' 10-7 = 3
'</w>' 16
"'" 1

Updated Vocabulary

Merge the most commonly occurring pair 

\vdots
\vdots
"in" 7

Update token count

Added new token

Word Frequency
'k n o w in g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h in g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h in g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h in g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3
('n', 'o' 3
('o', 'w') 3
('w', 'i') 3
('i', 'n') 7
('n', 'g') 7
('g', '</w>') 6
('t', 'h') 5
('h', 'e') 1
('e', '</w>' 2
('a', 'd') 1
('d', '</w>') 1

Byte-Pair count

\vdots
\vdots

Word count

Character Frequency
'k'
'n' 6
'o' 9
'i' 3
'</w>' 16
"'" 1
'g': 7 7

Updated vocabulary

\vdots
\vdots
"in" 7

Now, treat "in" as a single token and repeat the steps.

Word Frequency
'k n o w in g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h in g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h in g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h in g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3
('n', 'o' 3
('o', 'w') 3
('w', 'i') 3
('w','in') 3
('in', 'g') 7
('g', '</w>') 6
('t', 'h') 5
('h', 'e') 1
('e', '</w>' 2
('a', 'd') 1
('d', '</w>') 1

Byte-Pair count

\vdots
\vdots

Word count

Character Frequency
'k'
'n' 6
'o' 9
'i' 3
'</w>' 16
"'" 1
'g': 7 7

Updated vocabulary

\vdots
\vdots
"in" 7
"ing" 7

Therefore, the new byte pairs are (w,in):3,(in,g):7, (h,in):4

Note, at iteration 4, we treat (w,in) as a pair instead of (w,i)

Word Frequency
'k n o w in g </w>' 3
't h e </w> 1
'n a m e </w> 1
'o f </w>' 1
's o m e t h in g </w> 2
'i s </w>' 1
'd i f f e r e n t </w>' 1
'f r o m </w>' 1
's o m e t h in g . </w>' 1
'a b o u t </w>' 1
'e v e r y t h in g </w>' 1
"i s n ' t </w> 1
'b a d </w>' 1
Word Frequency
('k', 'n') 3
('n', 'o' 3
('o', 'w') 3
('w', 'in') 3
('h', 'in') 4
('in', 'g') 7
('g', '</w>') 6
('t', 'h') 5
('h', 'e') 1
('e', '</w>' 2
('a', 'd') 1
('d', '</w>') 1

Byte-Pair count

\vdots
\vdots

Word count

Character Frequency
'k'
'n' 6
'o' 9
'i' 3
'</w>' 16
"'" 1
'g': 7 7

Updated vocabulary

\vdots
\vdots
"in" 7
"ing" 7

Of all these pairs, merge most frequently occurring byte-pairs

which turns out to be "ing"

Now, treat "ing" as a single token and repeat the steps

After 45 merges

'k n o w i n g </w>'
't h e </w>
'n a m e </w>
'o f </w>'
's o m e t h i n g </w>
'i s </w>'
'd i f f e r e n t </w>'
'f r o m </w>'
's o m e t h i n g . </w>'
'a b o u t </w>'
'e v e r y t h'
"i s n ' t </w>
'b a d </w>'

The final vocabulary contains initial vocabulary and all the merges (in order). The rare words are broken down into two or more subwords

At test time, the input word is split into a sequence of characters, and the characters are merged into a larger and known symbols 

everything

tokenizer

everyth

ing

The pair ('i','n') is merged first and follwed by the pair ('in','g')

For a larger corpus, we often end up with vocabulary of size smaller than considering individual words as tokens

everything

bad

tokenizer

b

d

bad

a

Tokens
'k'
'n'
'o'
'i'
'</w>'
'in'
'ing'
\vdots

The algorithm offers a way to adjust the size of vocabulary as a function of number of merges.

\vdots

BPE for non-segmented languages

Languages such as Japanese and Korean are non-segmented 

However, BPE requires space-separated words

How do we apply BPE for non-segmented languages then?

In practice, we can use language specific morphology based word segmenters such as Juman++ for Japanese (as shown in the figure on the right)

映画『トランスフォーマー』を楽しく見ました

Input text

Pre-tokenization

Word level Segmenter

Juman++ (or) Mecab

Run BPE

[Eiga, toransufōmā, o, tanoshi,#ku, mi,#mashita]
[Eiga, toransufōmā, o, tanoshiku, mimashita]

However, in the case of multi-lingual translation, having a language agnostic tokenizer is desirable. 

Example:

\(\mathcal{V}=\{a,b,c,\cdots,z,lo,he\}\)

Tokenize the text "hello lol"

[hello],[lol]

search for the byte-pair 'lo', if present merge

Yes. Therefore, Merge

[h e l lo], [lo l]

search for the byte-pair 'he', if present merge

Yes. Therefore, Merge

[he l lo], [lo l]

return the text after merge

he #l #lo, lo #l

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]
    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1
            splits[idx] = split

    return sum(splits, [])

Search for byte-pair in order inserted in the vocab through entire input text

Merge if found

first, 2-grams, 3-grams,...

Module 4.3 : WordPiece

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

In BPE we merged a pair of tokens which has the highest frequency of occurence.

Take the frequency of occurrence of individual tokens in the pair into an account

Word Frequency
('k', 'n') 3
('n', 'o') 3
('o', 'w') 3
('w', 'i') 3
('i', 'n') 7
('n', 'g') 7
('g', '.') 1
('t', 'h') 5
('h', 'e') 1
('e', '</w>') 2
('a', 'd') 1
\vdots
\vdots

What if there are more than one pair that is occurring with the same frequency, for example ('i','n') and ('n','g')? 

score=\frac{count(\alpha,\beta)}{count(\alpha)count(\beta)}

Now we can select a pair of tokens where the individual tokens are less frequent in the vocabulary

The WordPiece algorithm uses this score to merge the pairs.

Word Frequency
'k n o w i n g' 3
't h e ' 1
'n a m e ' 1
'o f ' 1
's o m e t h i n g ' 2
'i s ' 1
'd i f f e r e n t ' 1
'f r o m ' 1
's o m e t h i n g. ' 1
'a b o u t ' 1
'e v e r y t h i n g ' 1
"i s n ' t ' 1
'b a d ' 1

Word count

Character Frequency
'k' 3
'#n' 13
'#o' 9
't' 16
'#h' 5
"'" 1

Initial Vocab Size :22

\vdots
\vdots
\vdots
\vdots

knowing

k  #n  #o  #w  #i  #n  #g

Subwords are identified by prefix ## (we use single # for illustration)

Word count

Word Frequency
'k n o w i n g' 3
't h e ' 1
'n a m e ' 1
'o f ' 1
's o m e t h i n g ' 2
'i s ' 1
'd i f f e r e n t ' 1
'f r o m ' 1
's o m e t h i n g. ' 1
'a b o u t ' 1
'e v e r y t h i n g ' 1
"i s n ' t ' 1
'b a d ' 1
Word Frequency
('k', 'n') 3
('n', 'o') 3
('o', 'w') 3
('w', 'i') 3
('i', 'n') 7
('n', 'g') 7
('g', '.') 1
('t', 'h') 5
('h', 'e') 1
('e', '</w>') 2
('a', 'd') 1
\vdots
\vdots

ignoring the prefix #

Freq of 1st element Freq of 2nd element score
'k':3 'n':13 0.076
'n':13 'o':9 0.02
'o':9 'w':3 0.111
'i':10 'n':13 0.05
'n':13 'g':7 0.076
't':8 'h':5 0.125
'a':3 'd':2 0.16
score=\frac{count(\alpha,\beta)}{count(\alpha)count(\beta)}

Now, merging is based on the score of each byte pair.

Merge the pair with highest score

Small Vocab

Larger Vocab

Medium

 Vocab

N \quad Merges
More \quad Merges

We start with a character level vocab

1-112
30K-50K
100K-250K

                                                                         and keep merging until a desired vocabulary size is reached

Small Vocab

Larger Vocab

Medium

 Vocab

More \quad Elimination
N \quad eliminations

We start with word level vocab

1-112
30K-50K
100K-250K

                                                             and keep eliminating words until a desired vocabulary size is reached

Well, we can do the reverse as well.

That's what we see next.

Module 4.4 : SentencePiece

AI4Bharat, Department of Computer Science and Engineering, IIT Madras

Mitesh M. Khapra

\mathbf{x_1}=he,ll,o
\mathbf{x_3}=he,l,lo
\mathbf{x_2}=h,el,lo
\mathbf{x_4}=hell,o

A given word can have numerous subwords.

\(\mathcal{V}\)={h,e,l,l,o,he,el,lo,ll,hell}

however, following the merge rule, BPE outputs

:he,l,lo

On the other hand, if 

\(\mathcal{V}\)={h,e,l,l,o,el,he,lo,ll,hell}

then BPE outputs:

h,el,lo

Therefoe, we say BPE is greedy and deterministic (we can use BPE-Dropout [Ref] to make it stochastic)

For instance, the word "\(X\)=hello" can be segmented in multiple ways (by BPE) even with the same vocabulary  

The probabilistic approach is to find the subword sequence \(\mathbf{x^*} \in \{\mathbf{x_1},\mathbf{x_2},\cdots,\mathbf{x_k}\}\) that maximizes the likelihood of the word \(X\)

The word \(X\) in sentencepiece means a sequence of characters or words (without spaces)

Therefore, it can be applied to languages (like Chinese and Japanese)  that do not use any word delimiters in a sentence.

X

All possible subwords of \(X\)

hidden

observed

Let \(\mathbf{x}\) denote a subword sequence of length \(n\).

P(\mathbf{x}) = \prod\limits_{i=1}^{n} P(x_i)
\mathbf{x}^* = \argmax_{\mathbf{x} \in S(X)} P(\mathbf{x})
\sum \limits_{x \in \mathcal{V}} p(x)=1
\mathbf{x}=(x_1, x_2, \dots, x_n)

then the probability of the subword sequence (with unigram LM) is simply

the objective is to find the subword sequence for the input sequence \(X\) (from all possible segmentation candidates of  \(S(X)\))  that maximizes the (log) likelihood of the sequence

Recall that the subwords \(p(x_i)\) are hidden (latent) variables.

Then, for all the sequences  in the dataset \(D\), we define the likelihood function as

\mathcal{L}= \sum \limits_{s=1}^{|D|}\log(P(X^s))
= \sum \limits_{s=1}^{|D|}\log \Big(\sum \limits_{\mathbf{x} \in S(X^s)} P(\mathbf{x}) \Big)

Therefore, given the vocabulary \(\mathcal{V}\), Expectation-Maximization (EM) algorithm could be used to maximize the likelihood

We can use Viterbi decoding to find \(\mathbf{x}^*\).

Word Frequency
'k n o w i n g' 3
't h e ' 1
'n a m e ' 1
'o f ' 1
's o m e t h i n g ' 2
'i s ' 1
'd i f f e r e n t ' 1
'f r o m ' 1
's o m e t h i n g. ' 1
'a b o u t ' 1
'e v e r y t h i n g ' 1
"i s n ' t ' 1
'b a d ' 1

Let \(X=\) "knowing"  and a few segmentation candidates be \(S(X)=\{`k,now,ing`,`know,ing`,`knowing`\}\)

p(\mathbf{x_1}=k,now,ing)
=p(k)p(now)p(ing)
=\frac{3}{16} \times \frac{3}{16} \times \frac{7}{16}
=\frac{63}{4096}
p(\mathbf{x_2}=know,ing)
=p(know)p(ing)
=\frac{21}{256}=\frac{336}{4096}
p(\mathbf{x_3}=knowing)
=p(knowing)
=\frac{3}{16}=\frac{768}{4096}

Unigram model favours the segmentation with least number of subwords

Given the unigram language model we can calculate the probabilities of the segments as follows

\mathbf{x}^* = \argmax_{\mathbf{x} \in S(X)} P(\mathbf{x})=\mathbf{x_3}

In practice, we use Viterbi decoding to find \(\mathbf{x}^*\) instead of enumerating all possible segments

  1. Construct a reasonably large seed vocabulary using BPE or Extended Suffix Array algorithm.

  2. E-Step:

    Estimate the probability for every token in the given vocabulary using frequency counts in the training corpus

  3. M-Step:

    Use Viterbi algorithm to segment the corpus and return optimal segments that maximizes the (log) likelihood.

  4. Compute the likelihood for each new subword from optimal segments

  5. Shrink the vocabulary size by removing top \(x\%\) of subwords that have the smallest likelihood.

  6. Repeat step 2 to 5 until desired vocabulary size is reached

Algorithm

Set the desired vocabulary size

Let us consider segmenting the word "whereby" using Viterbi decoding

k

a

no

b

e

d

f

now

bow

bo

in

om

ro

ry

ad

ng

out

eve

win

some

bad

owi

ing

hing

thing

g

z

er

\mathcal{V}
Token log(p(x))
b -4.7
e -2.7
h -3.34
r -3.36
w -4.29
wh -5.99
er -5.34
where -8.21
by -7.34
he -6.02
ere -6.83
here -7.84
her -7.38
re -6.13
w\\-4.29
t=1
whereby

Forward algorithm

Iterate over every position in the given word

output the segment which has highest likelihood

w
Token log(p(x))
b -4.7
e -2.7
h -3.34
r -3.36
w -4.29
wh -5.99
er -5.34
where -8.21
by -7.34
he -6.02
ere -6.83
here -7.84
her -7.38
re -6.13
h
wh
-7.63
-5.99
whereby
w
(w,h)

At this position, the posible segmentations of the slice "wh" are (w,h) and (wh) 

Compute the log-likelihood for both and output the best one.

Token log(p(x))
b -4.7
e -2.7
h -3.34
r -3.36
w -4.29
wh -5.99
er -5.34
where -8.21
by -7.34
he -6.02
ere -6.83
here -7.84
her -7.38
re -6.13
whereby
h
wh
-7.63
-5.99
e
-10.33
he
-10.31
e
-8.69

We do not need to compute the likelihood of (w,h,e) as we already ruled out (w,h) to (wh). We display it for completeness

whe
-\infty

Of these, (wh,e) is the best segmentation that maximizes the likelihood. 

w
Token log(p(x))
b -4.7
e -2.7
h -3.34
r -3.36
w -4.29
wh -5.99
er -5.34
where -8.21
by -7.34
he -6.02
ere -6.83
here -7.84
her -7.38
re -6.13
whereby
wh
-5.99
he
-10.31
e
-8.69
whe
-\infty
\times
\times
her
-11.67
er
-11.33
r
-12.05
wher
-\infty
h
w
\times
-7.63
Token log(p(x))
b -4.7
e -2.7
h -3.34
r -3.36
w -4.29
wh -5.99
er -5.34
where -8.21
by -7.34
he -6.02
ere -6.83
here -7.84
her -7.38
re -6.13
whereby
wh
-5.99
he
-10.31
e
-8.69
whe
-\infty
\times
\times
her
-11.67
er
-11.33
r
-12.05
wher
\times
-\infty
\times
where
-8.21
h
w
\times
here
-10.54
\times
e
-14.03
ere
-12.82
\times
-7.63
Token log(p(x))
b -4.7
e -2.7
h -3.34
r -3.36
w -4.29
wh -5.99
er -5.34
where -8.21
by -7.34
he -6.02
ere -6.83
here -7.84
her -7.38
re -6.13
whereby
wh
-5.99
he
-10.31
e
-8.69
whe
-\infty
\times
\times
her
-11.67
er
-11.33
r
-12.05
wher
\times
-\infty
\times
where
-8.21
h
w
\times
here
-10.54
\times
e
-14.03
ere
-12.82
\times
-7.63
whereb
-\infty
b
-12.91
Token log(p(x))
b -4.7
e -2.7
h -3.34
r -3.36
w -4.29
wh -5.99
er -5.34
where -8.21
by -7.34
he -6.02
ere -6.83
here -7.84
her -7.38
re -6.13
whereby
wh
-5.99
he
-10.31
e
-8.69
whe
-\infty
\times
\times
her
-11.67
er
-11.33
r
-12.05
wher
\times
-\infty
\times
where
-8.21
h
w
\times
here
-10.54
\times
e
-14.03
ere
-12.82
\times
-7.63
whereb
-\infty
b
-12.91
by
-15.51
\times
\times
\times
Token log(p(x))
b -4.7
e -2.7
h -3.34
r -3.36
w -4.29
wh -5.99
er -5.34
where -8.21
by -7.34
he -6.02
ere -6.83
here -7.84
her -7.38
re -6.13

Backtrack

whereby
wh
-5.99
he
-10.31
e
-8.69
whe
-\infty
\times
\times
her
-11.67
er
-11.33
r
-12.05
wher
\times
-\infty
\times
where
-8.21
h
w
\times
here
-10.54
\times
e
-14.03
ere
-12.82
\times
-7.63
whereb
-\infty
b
-12.91
by
-15.51

The best segmentation of the word "whereby" that maximizes the likelihood is "where,by"

\times
\times
\times

We can follow the same procedure for languages that do not use any word delimiters in a sentence.

Morfessor

For example, Consider a word

Sesquipedalophobia (fear of long words)

and suppose the sentence piece returns one best segmentation

Ses-qui-ped-alo-phobia

Recall that the unigram models used Maximim Likelihood Estimation (MLE) as an objective function.

However, for some reason, we want the segmentation  to be 

Ses-qui-pedalo-phobia

In general, Morfessor allows us to incorporate the linguistically motivated a prior knowledge into the tokenization procedure.

Another algorithm that is related to sentence piece  is Morfessor [Ref] which uses Maximum  a posterior (MAP) Estimation as an objective function that allows us to incorporate a prior

Also, allows us to add a small annotated data while training the model

Has been used for large vocabulary continuous speech recognition

References

Wordpiece model uses a likelihood instead of frequency.

Sentencepiece is called sentencepiece because it treats the entire sentence (corpus) as single input and finds the subwords for it.

substring and subwords array not the one and the same

The suffix array A of S is now defined to be an array of integers providing the starting positions of suffixes of S in lexicographical order

Counting subwords: https://math.stackexchange.com/questions/104258/counting-subwords