Source Input Sequence
Target Input Sequence
Predicted Output Sequence
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Masked Attention
Add&Norm
I am reading a book
Naan oru puthagathai padiththu kondirukiren
Tokenizer
Token to ID
embeddings
Language Model
ID to token
How are you?
token to words
Tokenizer
Token ID
embeddings
Language Model
ID to token
How are you?
token to words
Encoder
Decoder
Tokenizer
Token ID
embeddings
Language Model
ID to token
How are you?
token to words
I
enjoyed
the
movie
transformers
Feed Forward Network
Multi-Head Attention
<go>
<stop>
映画『トランスフォーマー』を楽しく見ました
WordPiece
SentencePiece
BytePairEncoding (BPE)
Vocab size
Vocab size
Vocab size
Hmm.., I know I Don't know
Input text
hmm.., i know i don't know
normalization
Pre-tokenization
[hmm.., i, know, i , don't,know]
Subword Tokenization
[hmm.., i, know, i, do, #n't,know]
I enjoyed the movie
[I enjoy #ed the movie]
[I enjoy #ed the movie .]
I enjoyed the movie.
import re, collections
def get_stats(vocab):
  pairs = collections.defaultdict(int)
  for word, freq in vocab.items():
    symbols = word.split()
    for i in range(len(symbols)-1):
      pairs[symbols[i],symbols[i+1]] += freq
  return pairs
def merge_vocab(pair, v_in):
  v_out = {}
  bigram = re.escape(' '.join(pair))
  p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
  for word in v_in:
  	w_out = p.sub(''.join(pair), word)
  	v_out[w_out] = v_in[word]
  return v_out
vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
'n e w e s t </w>':6, 'w i d e s t </w>':3}
num_merges = 10
for i in range(num_merges):
  pairs = get_stats(vocab)
  best = max(pairs, key=pairs.get)
  vocab = merge_vocab(best, vocab)
print(best)
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
Initial token count
Word count
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
Initial token count
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
Vocab size:13
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
Initial token count
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
| 'o' | 9 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
| 'o' | 9 | 
| '</w>' | 3+1+1+1+2+...+1=16 | 
Vocab size:13
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
Initial tokens and count
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
| 'o' | 9 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
| 'o' | 9 | 
| '</w>' | 3+1+1+1+2+...+1=16 | 
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
| 'o' | 9 | 
| '</w>' | 16 | 
| "'" | 1 | 
Vocab size:13
Initial Vocab Size :22
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
Byte-Pair count
Word count
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o') | 3 | 
Byte-Pair count
Word count
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o' | 3 | 
| ('o', 'w') | 3 | 
Byte-Pair count
Word count
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o' | 3 | 
| ('o', 'w') | 3 | 
| ('w', 'i') | 3 | 
| ('i', 'n') | 
Byte-Pair count
Word count
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o' | 3 | 
| ('o', 'w') | 3 | 
| ('w', 'i') | 3 | 
| ('i', 'n') | 7 | 
| ('n', 'g') | 7 | 
| ('g', '</w>') | 6 | 
| ('t', 'h') | 5 | 
| ('h', 'e') | 1 | 
| ('e', '</w>' | 2 | 
| ('a', 'd') | 1 | 
| ('d', '</w>') | 1 | 
Byte-Pair count
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 | 
| 'o' | 9 | 
| 'i' | 10 | 
| '</w>' | 16 | 
| "'" | 1 | 
Initial Vocabulary
| "in" | 7 | 
| Word | Frequency | 
|---|---|
| 'k n o w i n g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h i n g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h i n g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h i n g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o' | 3 | 
| ('o', 'w') | 3 | 
| ('w', 'i') | 3 | 
| ('i', 'n') | 7 | 
| ('n', 'g') | 7 | 
| ('g', '</w>') | 6 | 
| ('t', 'h') | 5 | 
| ('h', 'e') | 1 | 
| ('e', '</w>' | 2 | 
| ('a', 'd') | 1 | 
| ('d', '</w>') | 1 | 
Byte-Pair count
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 13 -7 = 6 | 
| 'o' | 9 | 
| 'i' | 10-7 = 3 | 
| '</w>' | 16 | 
| "'" | 1 | 
Updated Vocabulary
| "in" | 7 | 
Added new token
| Word | Frequency | 
|---|---|
| 'k n o w in g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h in g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h in g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h in g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o' | 3 | 
| ('o', 'w') | 3 | 
| ('w', 'i') | 3 | 
| ('i', 'n') | 7 | 
| ('n', 'g') | 7 | 
| ('g', '</w>') | 6 | 
| ('t', 'h') | 5 | 
| ('h', 'e') | 1 | 
| ('e', '</w>' | 2 | 
| ('a', 'd') | 1 | 
| ('d', '</w>') | 1 | 
Byte-Pair count
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 6 | 
| 'o' | 9 | 
| 'i' | 3 | 
| '</w>' | 16 | 
| "'" | 1 | 
| 'g': 7 | 7 | 
Updated vocabulary
| "in" | 7 | 
| Word | Frequency | 
|---|---|
| 'k n o w in g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h in g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h in g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h in g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o' | 3 | 
| ('o', 'w') | 3 | 
| ('w','in') | 3 | 
| ('in', 'g') | 7 | 
| ('g', '</w>') | 6 | 
| ('t', 'h') | 5 | 
| ('h', 'e') | 1 | 
| ('e', '</w>' | 2 | 
| ('a', 'd') | 1 | 
| ('d', '</w>') | 1 | 
Byte-Pair count
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 6 | 
| 'o' | 9 | 
| 'i' | 3 | 
| '</w>' | 16 | 
| "'" | 1 | 
| 'g': 7 | 7 | 
Updated vocabulary
| "in" | 7 | 
| "ing" | 7 | 
| Word | Frequency | 
|---|---|
| 'k n o w in g </w>' | 3 | 
| 't h e </w> | 1 | 
| 'n a m e </w> | 1 | 
| 'o f </w>' | 1 | 
| 's o m e t h in g </w> | 2 | 
| 'i s </w>' | 1 | 
| 'd i f f e r e n t </w>' | 1 | 
| 'f r o m </w>' | 1 | 
| 's o m e t h in g . </w>' | 1 | 
| 'a b o u t </w>' | 1 | 
| 'e v e r y t h in g </w>' | 1 | 
| "i s n ' t </w> | 1 | 
| 'b a d </w>' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o' | 3 | 
| ('o', 'w') | 3 | 
| ('w', 'in') | 3 | 
| ('h', 'in') | 4 | 
| ('in', 'g') | 7 | 
| ('g', '</w>') | 6 | 
| ('t', 'h') | 5 | 
| ('h', 'e') | 1 | 
| ('e', '</w>' | 2 | 
| ('a', 'd') | 1 | 
| ('d', '</w>') | 1 | 
Byte-Pair count
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| 'n' | 6 | 
| 'o' | 9 | 
| 'i' | 3 | 
| '</w>' | 16 | 
| "'" | 1 | 
| 'g': 7 | 7 | 
Updated vocabulary
| "in" | 7 | 
| "ing" | 7 | 
| 'k n o w i n g </w>' | 
| 't h e </w> | 
| 'n a m e </w> | 
| 'o f </w>' | 
| 's o m e t h i n g </w> | 
| 'i s </w>' | 
| 'd i f f e r e n t </w>' | 
| 'f r o m </w>' | 
| 's o m e t h i n g . </w>' | 
| 'a b o u t </w>' | 
| 'e v e r y t h' | 
| "i s n ' t </w> | 
| 'b a d </w>' | 
everything
tokenizer
everyth
ing
everything
bad
tokenizer
b
d
bad
a
| Tokens | 
|---|
| 'k' | 
| 'n' | 
| 'o' | 
| 'i' | 
| '</w>' | 
| 'in' | 
| 'ing' | 
映画『トランスフォーマー』を楽しく見ました
[Eiga, toransufōmā, o, tanoshi,#ku, mi,#mashita][Eiga, toransufōmā, o, tanoshiku, mimashita]Example:
\(\mathcal{V}=\{a,b,c,\cdots,z,lo,he\}\)
Tokenize the text "hello lol"
[hello],[lol]
search for the byte-pair 'lo', if present merge
Yes. Therefore, Merge
[h e l lo], [lo l]
search for the byte-pair 'he', if present merge
Yes. Therefore, Merge
[he l lo], [lo l]
return the text after merge
he #l #lo, lo #l
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o') | 3 | 
| ('o', 'w') | 3 | 
| ('w', 'i') | 3 | 
| ('i', 'n') | 7 | 
| ('n', 'g') | 7 | 
| ('g', '.') | 1 | 
| ('t', 'h') | 5 | 
| ('h', 'e') | 1 | 
| ('e', '</w>') | 2 | 
| ('a', 'd') | 1 | 
| Word | Frequency | 
|---|---|
| 'k n o w i n g' | 3 | 
| 't h e ' | 1 | 
| 'n a m e ' | 1 | 
| 'o f ' | 1 | 
| 's o m e t h i n g ' | 2 | 
| 'i s ' | 1 | 
| 'd i f f e r e n t ' | 1 | 
| 'f r o m ' | 1 | 
| 's o m e t h i n g. ' | 1 | 
| 'a b o u t ' | 1 | 
| 'e v e r y t h i n g ' | 1 | 
| "i s n ' t ' | 1 | 
| 'b a d ' | 1 | 
Word count
| Character | Frequency | 
|---|---|
| 'k' | 3 | 
| '#n' | 13 | 
| '#o' | 9 | 
| 't' | 16 | 
| '#h' | 5 | 
| "'" | 1 | 
Initial Vocab Size :22
knowing
k #n #o #w #i #n #g
Word count
| Word | Frequency | 
|---|---|
| 'k n o w i n g' | 3 | 
| 't h e ' | 1 | 
| 'n a m e ' | 1 | 
| 'o f ' | 1 | 
| 's o m e t h i n g ' | 2 | 
| 'i s ' | 1 | 
| 'd i f f e r e n t ' | 1 | 
| 'f r o m ' | 1 | 
| 's o m e t h i n g. ' | 1 | 
| 'a b o u t ' | 1 | 
| 'e v e r y t h i n g ' | 1 | 
| "i s n ' t ' | 1 | 
| 'b a d ' | 1 | 
| Word | Frequency | 
|---|---|
| ('k', 'n') | 3 | 
| ('n', 'o') | 3 | 
| ('o', 'w') | 3 | 
| ('w', 'i') | 3 | 
| ('i', 'n') | 7 | 
| ('n', 'g') | 7 | 
| ('g', '.') | 1 | 
| ('t', 'h') | 5 | 
| ('h', 'e') | 1 | 
| ('e', '</w>') | 2 | 
| ('a', 'd') | 1 | 
ignoring the prefix #
| Freq of 1st element | Freq of 2nd element | score | 
|---|---|---|
| 'k':3 | 'n':13 | 0.076 | 
| 'n':13 | 'o':9 | 0.02 | 
| 'o':9 | 'w':3 | 0.111 | 
| 'i':10 | 'n':13 | 0.05 | 
| 'n':13 | 'g':7 | 0.076 | 
| 't':8 | 'h':5 | 0.125 | 
| 'a':3 | 'd':2 | 0.16 | 
Small Vocab
Larger Vocab
Medium
Vocab
Small Vocab
Larger Vocab
Medium
Vocab
\(\mathcal{V}\)={h,e,l,l,o,he,el,lo,ll,hell}
\(\mathcal{V}\)={h,e,l,l,o,el,he,lo,ll,hell}
All possible subwords of \(X\)
hidden
observed
| Word | Frequency | 
|---|---|
| 'k n o w i n g' | 3 | 
| 't h e ' | 1 | 
| 'n a m e ' | 1 | 
| 'o f ' | 1 | 
| 's o m e t h i n g ' | 2 | 
| 'i s ' | 1 | 
| 'd i f f e r e n t ' | 1 | 
| 'f r o m ' | 1 | 
| 's o m e t h i n g. ' | 1 | 
| 'a b o u t ' | 1 | 
| 'e v e r y t h i n g ' | 1 | 
| "i s n ' t ' | 1 | 
| 'b a d ' | 1 | 
k
a
no
b
e
d
f
now
bow
bo
in
om
ro
ry
ad
ng
out
eve
win
some
bad
owi
ing
hing
thing
g
z
er
| Token | log(p(x)) | 
|---|---|
| b | -4.7 | 
| e | -2.7 | 
| h | -3.34 | 
| r | -3.36 | 
| w | -4.29 | 
| wh | -5.99 | 
| er | -5.34 | 
| where | -8.21 | 
| by | -7.34 | 
| he | -6.02 | 
| ere | -6.83 | 
| here | -7.84 | 
| her | -7.38 | 
| re | -6.13 | 
| Token | log(p(x)) | 
|---|---|
| b | -4.7 | 
| e | -2.7 | 
| h | -3.34 | 
| r | -3.36 | 
| w | -4.29 | 
| wh | -5.99 | 
| er | -5.34 | 
| where | -8.21 | 
| by | -7.34 | 
| he | -6.02 | 
| ere | -6.83 | 
| here | -7.84 | 
| her | -7.38 | 
| re | -6.13 | 
| Token | log(p(x)) | 
|---|---|
| b | -4.7 | 
| e | -2.7 | 
| h | -3.34 | 
| r | -3.36 | 
| w | -4.29 | 
| wh | -5.99 | 
| er | -5.34 | 
| where | -8.21 | 
| by | -7.34 | 
| he | -6.02 | 
| ere | -6.83 | 
| here | -7.84 | 
| her | -7.38 | 
| re | -6.13 | 
| Token | log(p(x)) | 
|---|---|
| b | -4.7 | 
| e | -2.7 | 
| h | -3.34 | 
| r | -3.36 | 
| w | -4.29 | 
| wh | -5.99 | 
| er | -5.34 | 
| where | -8.21 | 
| by | -7.34 | 
| he | -6.02 | 
| ere | -6.83 | 
| here | -7.84 | 
| her | -7.38 | 
| re | -6.13 | 
| Token | log(p(x)) | 
|---|---|
| b | -4.7 | 
| e | -2.7 | 
| h | -3.34 | 
| r | -3.36 | 
| w | -4.29 | 
| wh | -5.99 | 
| er | -5.34 | 
| where | -8.21 | 
| by | -7.34 | 
| he | -6.02 | 
| ere | -6.83 | 
| here | -7.84 | 
| her | -7.38 | 
| re | -6.13 | 
| Token | log(p(x)) | 
|---|---|
| b | -4.7 | 
| e | -2.7 | 
| h | -3.34 | 
| r | -3.36 | 
| w | -4.29 | 
| wh | -5.99 | 
| er | -5.34 | 
| where | -8.21 | 
| by | -7.34 | 
| he | -6.02 | 
| ere | -6.83 | 
| here | -7.84 | 
| her | -7.38 | 
| re | -6.13 | 
| Token | log(p(x)) | 
|---|---|
| b | -4.7 | 
| e | -2.7 | 
| h | -3.34 | 
| r | -3.36 | 
| w | -4.29 | 
| wh | -5.99 | 
| er | -5.34 | 
| where | -8.21 | 
| by | -7.34 | 
| he | -6.02 | 
| ere | -6.83 | 
| here | -7.84 | 
| her | -7.38 | 
| re | -6.13 | 
| Token | log(p(x)) | 
|---|---|
| b | -4.7 | 
| e | -2.7 | 
| h | -3.34 | 
| r | -3.36 | 
| w | -4.29 | 
| wh | -5.99 | 
| er | -5.34 | 
| where | -8.21 | 
| by | -7.34 | 
| he | -6.02 | 
| ere | -6.83 | 
| here | -7.84 | 
| her | -7.38 | 
| re | -6.13 | 
Hmm.., I know I Don't know
Input text
hmm.., i know i don't know
Normalization
Pre-tokenization
[hmm.., i, know, i , don't,know]
Tokenization using learned vocab
from tokenizers import Tokenizer[hmm.., i, know, i, do, #n't,know]
Post Processing (add model specific tokens)
pre-tokenizer
normalizer
Tokenizer
Algorithm (model)
Post Processor
[<go>,hmm.., i, know, i, do, #n't,know,<end>]
from tokenizers import Tokenizerpre-tokenizer
normalizer
Tokenizer
Algorithm (model)
Post Processor
WhiteSpace, Regex, BERTlike
BPE,WordPiece..
LowerCase, StripAccents
Choices
Insert model specific tokens
Property (getter and setter,del)
model
normalizer
pre_tokenizer
post_processor
padding (no setter)
truncation (no setter)
decoder
tokenizer = Tokenizer(BPE()) 
#doesn't matter  order in which the properties are set
tokenizer.pre_tokenizer = Whitespace()
tokenizer.normalizer = Lowercase()pre-tokenizer
normalizer
Tokenizer
Algorithm (model)
Post Processor
WhiteSpace
BPE
LowerCase
[CLS],[SEP]
tokenizer = Tokenizer(BPE()) 
#doesn't matter  order in which the properties are set
tokenizer.pre_tokenizer = Whitespace()
tokenizer.normalizer = Lowercase()add_special_tokens(str,AddedToken)
no token_ids are assigned
add_tokens()
enable_padding(), enable_truncation()
encode(seq,pair,is_pretokenized), encode_batch()
decode(),decode_batch()
from_file(.json) # serialized local json file
from_pretrained(.json) # from hub
get_vocab(), get_vocab_size()
id_to_token(), token_to_id()
post_process()
train(files), train_from_iterator(dataset)
Tokenizer
# Core Class
class tokenizers.Tokenizer.train_from_iterator(dataset, trainer)
Object
List of strings (text)
(vocab_size, special_tokens,prefix,..)
encode for encoding the input string encode_batch method?Tokenizer
Object
['This is great','tell me a joke']
{input_ids, attention_mask, type_ids, ..}
 input_ids but also optional outputs like attention_mask, type_ids (which we will learn about in the next lecture)post_process() methods of Tokenizer (it is used internally, we can just set the desired formats in the optional parameters like "pair" in encode())encoding = Encoding()
tokenizer.post_process(encoding,pair=True)