Lecture 4: Tokenizers: BPE,WordPiece and SentencePiece
Mitesh M. Khapra, Arun Prakash A
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Introduction to Large Language Models
Introduction
Recall that language modelling involves computing probabilities over a sequence of tokens
This requires us to split the input text into tokens in a consistent way.
Let's see where we fit in a tokenizer in the training and inference pipeline
Tokenizer
Token Ids
embeddings
Language Model
ids to tokens
How are you?
A simple approach is to use whitespace for splitting the text.
The first step is to build (learn) the vocabulary \(\mathcal{V}\) that contains unique tokens (\(x_i\)).
Therefore, the fundamental question is how do we build (learn) a vocabulary \(\mathcal{V}\) from a large corpus that contains billions of sentences?
We also include special tokens such as <go>,<stop>,<mask>,<sep>,<cls> and others to the vocabulary based on the type of downstream tasks and the architecture (GPT/BERT) choice
I
enjoyed
the
movie
transformers
Feed Forward Network
Multi-Head Attention
<go>
<stop>
We can split the text into words using whitespace (called pre-tokenization) and add all unique words in the corpus to vocbulary.
Is spliting the input text into words using whitespace a good approach?
What about languages like Japanese, which do not use any word delimiters like space?
Why not treat each individual character in a language as a vocabulary?
Some Questions
In that case, Do we treat the words "enjoy" and "enjoyed" as separate tokens?
Finally, what are good tokens?
I
enjoyed
the
movie
transformers
Feed Forward Network
Multi-Head Attention
<go>
<stop>
映画『トランスフォーマー』を楽しく見ました
Challenges in building a vocabulary
What should be the size of vocabulary?
Larger the size, larger the size of embedding matrix and greater the computation of softmax probabilities. What is the optimal size?
Out-of-vocabulary
If we limit the size of the vocabulary (say, 250K to 50K) , then we need a mechanism to handle out-of-vocabulary (OOV) words. How do we handle them?
Handling misspelled words in corpus
Often, the corpus is built by scraping the web. There are chances of typo/spelling errors. Such erroneous words should not be considered as unique words.
Open Vocabulary problem
A new word can be constructed (say,in agglutinative languages) by combining the existing words . The vocabulary, in principle, is infinite (that is, names,numbers,..) which makes a task like machine translation challenging
Module 4.1 : Types of tokenization
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra
It would be interesting to measure the size of vocabulary of native (non-native) speakers
If we treat two words are different if they differ by meaning, then words like bank, air have multiple meanings
If we treat words as group of characters arranged in order, then the words in the list [ "learn", "learns", "learning", "learnt" ] are all different.
This increases the size of vocabulary which in turn increases the computational requirement.
In practice, the size of vocabulary varies based on the corpus. It is often in orders of tens of thousands
Source: lemongrad
What constitutes a word?
One way to reduce the size of the vocabulary is to consider individual charecters in the input text as tokens
Hmm.. I know, I don't know
[H, m, m, ., ., I, k, n, o, w, ,, I, d, o, n, ', t, k, n, o, w]
21 |
Number of tokens
Character Level Tokenization
Encoder
\(\mathcal{V}\): {H,m,.,I,k,n,o,w,d,',t}
\(|\mathcal{V}|\): 11
The size of the vocabulary is small and finite even for a large corpus of text,
It solves both open vocabulary and out-of-vocabulary problems
However, it loses semantic meaning of words
The number of characters in a sentence is, on an average, \(5x\) higher than the number of words in a sentence.
This increases the computational complexity of the models as the context window size increases 5 times (on average)
Therefore, computing softmax probabilities is not a bottleneck.
Let's look at the word level tokenizer once again
[Hmm.., I, know,, I, don't, know]
6 |
Number of tokens
Word Level Tokenization
\(\mathcal{V}\): {Hmm..,I, know, don't}
\(|\mathcal{V}|\): 4
In general, the size of the vocabulary grows based on the number of unique words in a corpus of text.
In practice, It could range anywhere from 30000 to 250000 based on the size of the corpus
Therefore, computing softmax probabilities to predict the token at each time step becomes expensive
One approach to limit the size of the vocabulary is to consider the words with a minimum frequency of occurrence, say, at least 100 times in the corpus.
Hmm.. I know, I don't know
The words which are not in the vocabulary (Out-of-vocabulary problem) are substituted by a special token: <unk> during training and inference. Doing this gives suboptimal performance for tasks like text generation and translation
Moreover, it is difficult to deal with misspelled words in a corpus (i.e., each misspelled word may be treated as new word in the vocabulary)
So we need a tokenizer that does not blow up the size of the vocabulary as in a character level tokenizer and also preserves the meaning of words as in a word level tokenizer.
[Hmm.., I, know,, I, don't, know]
6 |
Number of tokens
\(\mathcal{V}\): {Hmm..,I, know, don't}
\(|\mathcal{V}|\): 4
Hmm.. I know, I don't know
Word Level Tokenization
Hmm.. I know, I don't know
[Hmm.., I, know,, I, do, n't, know]
7 |
Number of tokens
Sub-Word Level Tokenization
\(\mathcal{V}\): {Hmm..,I, know, do, n't}
\(|\mathcal{V}|\): 5
The size of the vocabulary is carefully built based on the frequency of occurrence of subwords
For ex, the subword level tokenizer breaks "don't" into "do" and "n't" by learning that "n't" occurs frequently in a corpus.
we have a representation for "do" and a representation for "n't"
Therefore, subword level tokenizers are preferred for LLMs.
The size of the vocabulary is moderate
A middle ground between character level and word level tokenizers
The most frequently occuring words are preserved as is and rare words are split into subword units.
Categories
c h a r a c t e r l e v e l
word level
sub-word level
* There are many categories, these three are most commonly used
WordPiece
SentencePiece
BytePairEncoding (BPE)
Wishlist
Moderate-sized Vocabulary
Efficiently handle unknown words during inference
Be language agnostic
Module 4.2 : Byte Pair Encoding
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra
General Pre-processing steps
Hmm.., I know I Don't know
Input text
hmm.., i know i don't know
normalization
Pre-tokenization
[hmm.., i, know, i , don't,know]
Subword Tokenization
[hmm.., I, know, i, do, #n't,know]
Splitting the text by whitespace was traditionally called tokenization. However, when it is used with a sub-word tokenization algorithm, it is called pre-tokenization.
First the text is normalized, which involves operations such as treating cases, removing accents, eliminating multiple whitespaces, handling HTML tags, etc.
The input text corpus is often built by scraping web pages and ebooks.
Learn the vocabulary (called training) using these words
The tokenization schemes follow a sequence of steps to learn the vocabulary.
Text Corpus
Learn the vocabulary
Pre-process
Post-process
Model
I enjoyed the movie
[I enjoy #ed the movie]
[I enjoy #ed the movie .]
I enjoyed the movie.
Preprocess
In translation one has to deal with open-vocabulary (names,numbers,units..) and therefore it is more challenging to handle unknown tokens
Motivation for subwords
Encoder
Decoder
Tokens from Source language
Tokens from Target language
Consider a problem of machine translation
Restricting the size of vocabulary introduces more number of unknown tokens both in source and target languages and therefore results in poor translation.
Often, OOV tokens are the rare words (with a minimum frequency of occurrence) that were excluded from the vocabulary while constructing it. For example, the word "plural" may be present in the vocabulary but not the "plurify".
An effective approach would relate this word to the existing words in the vocabulary by breaking the word into subword units (for ex, "plur+ify").
Encoder
Decoder
Tokens from Source language
Tokens from Target language
This is motivated by observing how the unknown tokens like names are translated
Various word classes (for example, name) are translatable via
English:
Chandrayan
Observation
Tamil:
?
Various word classes (for example, name) are translatable via smaller units than words
Chandrayan
ச
ந்தி
ரா
ய
ன்
Why not apply the same idea to rare words in the corpus?
Let's see how
English:
Tamil:
Algorithm
import re, collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
'n e w e s t </w>':6, 'w i d e s t </w>':3}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(best)
Start with a dictionary that contains words and their count
Append a special symbol </w> at the end of each word in the dictionary
Set required number of merges (a hyperparameter)
Get the frequency count for a pair of characters
Merge pairs with maximum occurrence
Initialize the character-frequency table (a base vocabulary)
Example
knowing the name of something is different from knowing something. knowing something about everything isn't bad
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
</w> identifies the word boundary.
Objective: Find most frequently occurring byte-pair
Example
knowing the name of something is different from knowing something. knowing something about everything isn't bad
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
</w> identifies the word boundary.
Objective: Find most frequently occurring byte-pair
Let's count the word frequencies first.
We can count character frequencies from the table
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
We could infer that "k" has occurred three times by counting the frequency of occurrence of words having the character "k" in it.
In this corpus, the only word that contains "k" is the word "knowing" and it has occurred three times.
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Vocab size:13
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 3+1+1+1+2+...+1=16 |
Vocab size:13
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial tokens and count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 3+1+1+1+2+...+1=16 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 16 |
"'" | 1 |
Vocab size:13
Initial Vocab Size :22
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'i' | 10 |
'</w>' | 16 |
"'" | 1 |
Initial Vocabulary
Merge the most commonly occurring pair : \((i,n) \rightarrow in\)
"in" | 7 |
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 -7 = 6 |
'o' | 9 |
'i' | 10-7 = 3 |
'</w>' | 16 |
"'" | 1 |
Updated Vocabulary
Merge the most commonly occurring pair
"in" | 7 |
Update token count
Added new token
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
Now, treat "in" as a single token and repeat the steps.
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w','in') | 3 |
('in', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
"ing" | 7 |
Therefore, the new byte pairs are (w,in):3,(in,g):7, (h,in):4
Note, at iteration 4, we treat (w,in) as a pair instead of (w,i)
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'in') | 3 |
('h', 'in') | 4 |
('in', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
"ing" | 7 |
Of all these pairs, merge most frequently occurring byte-pairs
which turns out to be "ing"
Now, treat "ing" as a single token and repeat the steps
After 45 merges
'k n o w i n g </w>' |
't h e </w> |
'n a m e </w> |
'o f </w>' |
's o m e t h i n g </w> |
'i s </w>' |
'd i f f e r e n t </w>' |
'f r o m </w>' |
's o m e t h i n g . </w>' |
'a b o u t </w>' |
'e v e r y t h' |
"i s n ' t </w> |
'b a d </w>' |
The final vocabulary contains initial vocabulary and all the merges (in order). The rare words are broken down into two or more subwords
At test time, the input word is split into a sequence of characters, and the characters are merged into a larger and known symbols
everything
tokenizer
everyth
ing
The pair ('i','n') is merged first and follwed by the pair ('in','g')
For a larger corpus, we often end up with vocabulary of size smaller than considering individual words as tokens
everything
bad
tokenizer
b
d
bad
a
Tokens |
---|
'k' |
'n' |
'o' |
'i' |
'</w>' |
'in' |
'ing' |
The algorithm offers a way to adjust the size of vocabulary as a function of number of merges.
BPE for non-segmented languages
Languages such as Japanese and Korean are non-segmented
However, BPE requires space-separated words
How do we apply BPE for non-segmented languages then?
In practice, we can use language specific morphology based word segmenters such as Juman++ for Japanese (as shown in the figure on the right)
映画『トランスフォーマー』を楽しく見ました
Input text
Pre-tokenization
Word level Segmenter
Juman++ (or) Mecab
Run BPE
[Eiga, toransufōmā, o, tanoshi,#ku, mi,#mashita]
[Eiga, toransufōmā, o, tanoshiku, mimashita]
However, in the case of multi-lingual translation, having a language agnostic tokenizer is desirable.
Example:
\(\mathcal{V}=\{a,b,c,\cdots,z,lo,he\}\)
Tokenize the text "hello lol"
[hello],[lol]
search for the byte-pair 'lo', if present merge
Yes. Therefore, Merge
[h e l lo], [lo l]
search for the byte-pair 'he', if present merge
Yes. Therefore, Merge
[he l lo], [lo l]
return the text after merge
he #l #lo, lo #l
def tokenize(text):
pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
pre_tokenized_text = [word for word, offset in pre_tokenize_result]
splits = [[l for l in word] for word in pre_tokenized_text]
for pair, merge in merges.items():
for idx, split in enumerate(splits):
i = 0
while i < len(split) - 1:
if split[i] == pair[0] and split[i + 1] == pair[1]:
split = split[:i] + [merge] + split[i + 2 :]
else:
i += 1
splits[idx] = split
return sum(splits, [])
Search for byte-pair in order inserted in the vocab through entire input text
Merge if found
first, 2-grams, 3-grams,...
Module 4.3 : WordPiece
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra
In BPE we merged a pair of tokens which has the highest frequency of occurence.
Take the frequency of occurrence of individual tokens in the pair into an account
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '.') | 1 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>') | 2 |
('a', 'd') | 1 |
What if there are more than one pair that is occurring with the same frequency, for example ('i','n') and ('n','g')?
Now we can select a pair of tokens where the individual tokens are less frequent in the vocabulary
The WordPiece algorithm uses this score to merge the pairs.
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
Word count
Character | Frequency |
---|---|
'k' | 3 |
'#n' | 13 |
'#o' | 9 |
't' | 16 |
'#h' | 5 |
"'" | 1 |
Initial Vocab Size :22
knowing
k #n #o #w #i #n #g
Subwords are identified by prefix ## (we use single # for illustration)
Word count
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '.') | 1 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>') | 2 |
('a', 'd') | 1 |
ignoring the prefix #
Freq of 1st element | Freq of 2nd element | score |
---|---|---|
'k':3 | 'n':13 | 0.076 |
'n':13 | 'o':9 | 0.02 |
'o':9 | 'w':3 | 0.111 |
'i':10 | 'n':13 | 0.05 |
'n':13 | 'g':7 | 0.076 |
't':8 | 'h':5 | 0.125 |
'a':3 | 'd':2 | 0.16 |
Now, merging is based on the score of each byte pair.
Merge the pair with highest score
Small Vocab
Larger Vocab
Medium
Vocab
We start with a character level vocab
and keep merging until a desired vocabulary size is reached
Small Vocab
Larger Vocab
Medium
Vocab
We start with word level vocab
and keep eliminating words until a desired vocabulary size is reached
Well, we can do the reverse as well.
That's what we see next.
Module 4.4 : SentencePiece
AI4Bharat, Department of Computer Science and Engineering, IIT Madras
Mitesh M. Khapra
A given word can have numerous subwords.
\(\mathcal{V}\)={h,e,l,l,o,he,el,lo,ll,hell}
however, following the merge rule, BPE outputs
On the other hand, if
\(\mathcal{V}\)={h,e,l,l,o,el,he,lo,ll,hell}
then BPE outputs:
Therefoe, we say BPE is greedy and deterministic (we can use BPE-Dropout [Ref] to make it stochastic)
For instance, the word "\(X\)=hello" can be segmented in multiple ways (by BPE) even with the same vocabulary
The probabilistic approach is to find the subword sequence \(\mathbf{x^*} \in \{\mathbf{x_1},\mathbf{x_2},\cdots,\mathbf{x_k}\}\) that maximizes the likelihood of the word \(X\)
The word \(X\) in sentencepiece means a sequence of characters or words (without spaces)
Therefore, it can be applied to languages (like Chinese and Japanese) that do not use any word delimiters in a sentence.
All possible subwords of \(X\)
hidden
observed
Let \(\mathbf{x}\) denote a subword sequence of length \(n\).
then the probability of the subword sequence (with unigram LM) is simply
the objective is to find the subword sequence for the input sequence \(X\) (from all possible segmentation candidates of \(S(X)\)) that maximizes the (log) likelihood of the sequence
Recall that the subwords \(p(x_i)\) are hidden (latent) variables.
Then, for all the sequences in the dataset \(D\), we define the likelihood function as
Therefore, given the vocabulary \(\mathcal{V}\), Expectation-Maximization (EM) algorithm could be used to maximize the likelihood
We can use Viterbi decoding to find \(\mathbf{x}^*\).
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
Let \(X=\) "knowing" and a few segmentation candidates be \(S(X)=\{`k,now,ing`,`know,ing`,`knowing`\}\)
Unigram model favours the segmentation with least number of subwords
Given the unigram language model we can calculate the probabilities of the segments as follows
In practice, we use Viterbi decoding to find \(\mathbf{x}^*\) instead of enumerating all possible segments
-
Construct a reasonably large seed vocabulary using BPE or Extended Suffix Array algorithm.
-
E-Step:
Estimate the probability for every token in the given vocabulary using frequency counts in the training corpus
-
M-Step:
Use Viterbi algorithm to segment the corpus and return optimal segments that maximizes the (log) likelihood.
-
Compute the likelihood for each new subword from optimal segments
-
Shrink the vocabulary size by removing top \(x\%\) of subwords that have the smallest likelihood.
-
Repeat step 2 to 5 until desired vocabulary size is reached
Algorithm
Set the desired vocabulary size
Let us consider segmenting the word "whereby" using Viterbi decoding
k
a
no
b
e
d
f
now
bow
bo
in
om
ro
ry
ad
ng
out
eve
win
some
bad
owi
ing
hing
thing
g
z
er
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Forward algorithm
Iterate over every position in the given word
output the segment which has highest likelihood
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
At this position, the posible segmentations of the slice "wh" are (w,h) and (wh)
Compute the log-likelihood for both and output the best one.
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
We do not need to compute the likelihood of (w,h,e) as we already ruled out (w,h) to (wh). We display it for completeness
Of these, (wh,e) is the best segmentation that maximizes the likelihood.
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Backtrack
The best segmentation of the word "whereby" that maximizes the likelihood is "where,by"
We can follow the same procedure for languages that do not use any word delimiters in a sentence.
Morfessor
For example, Consider a word
Sesquipedalophobia (fear of long words)
and suppose the sentence piece returns one best segmentation
Ses-qui-ped-alo-phobia
Recall that the unigram models used Maximim Likelihood Estimation (MLE) as an objective function.
However, for some reason, we want the segmentation to be
Ses-qui-pedalo-phobia
In general, Morfessor allows us to incorporate the linguistically motivated a prior knowledge into the tokenization procedure.
Another algorithm that is related to sentence piece is Morfessor [Ref] which uses Maximum a posterior (MAP) Estimation as an objective function that allows us to incorporate a prior
Also, allows us to add a small annotated data while training the model
Has been used for large vocabulary continuous speech recognition
References
Wordpiece model uses a likelihood instead of frequency.
Sentencepiece is called sentencepiece because it treats the entire sentence (corpus) as single input and finds the subwords for it.
substring and subwords array not the one and the same
The suffix array A of S is now defined to be an array of integers providing the starting positions of suffixes of S in lexicographical order.
Counting subwords: https://math.stackexchange.com/questions/104258/counting-subwords
Lecture-4-Tokenizers
By Arun Prakash
Lecture-4-Tokenizers
Covers Different Types of tokenizers used in NLP
- 811