Source Input Sequence
Target Input Sequence
Predicted Output Sequence
Multi-Head Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head cross Attention
Feed forward NN
Add&Norm
Add&Norm
Multi-Head Masked Attention
Add&Norm
I am reading a book
Naan oru puthagathai padiththu kondirukiren
Tokenizer
Token to ID
embeddings
Language Model
ID to token
How are you?
token to words
Tokenizer
Token ID
embeddings
Language Model
ID to token
How are you?
token to words
Encoder
Decoder
Tokenizer
Token ID
embeddings
Language Model
ID to token
How are you?
token to words
I
enjoyed
the
movie
transformers
Feed Forward Network
Multi-Head Attention
<go>
<stop>
映画『トランスフォーマー』を楽しく見ました
WordPiece
SentencePiece
BytePairEncoding (BPE)
Vocab size
Vocab size
Vocab size
Hmm.., I know I Don't know
Input text
hmm.., i know i don't know
normalization
Pre-tokenization
[hmm.., i, know, i , don't,know]
Subword Tokenization
[hmm.., i, know, i, do, #n't,know]
I enjoyed the movie
[I enjoy #ed the movie]
[I enjoy #ed the movie .]
I enjoyed the movie.
import re, collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
'n e w e s t </w>':6, 'w i d e s t </w>':3}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(best)
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Vocab size:13
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 3+1+1+1+2+...+1=16 |
Vocab size:13
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial tokens and count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 3+1+1+1+2+...+1=16 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 16 |
"'" | 1 |
Vocab size:13
Initial Vocab Size :22
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'i' | 10 |
'</w>' | 16 |
"'" | 1 |
Initial Vocabulary
"in" | 7 |
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 -7 = 6 |
'o' | 9 |
'i' | 10-7 = 3 |
'</w>' | 16 |
"'" | 1 |
Updated Vocabulary
"in" | 7 |
Added new token
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w','in') | 3 |
('in', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
"ing" | 7 |
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'in') | 3 |
('h', 'in') | 4 |
('in', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
"ing" | 7 |
'k n o w i n g </w>' |
't h e </w> |
'n a m e </w> |
'o f </w>' |
's o m e t h i n g </w> |
'i s </w>' |
'd i f f e r e n t </w>' |
'f r o m </w>' |
's o m e t h i n g . </w>' |
'a b o u t </w>' |
'e v e r y t h' |
"i s n ' t </w> |
'b a d </w>' |
everything
tokenizer
everyth
ing
everything
bad
tokenizer
b
d
bad
a
Tokens |
---|
'k' |
'n' |
'o' |
'i' |
'</w>' |
'in' |
'ing' |
映画『トランスフォーマー』を楽しく見ました
[Eiga, toransufōmā, o, tanoshi,#ku, mi,#mashita]
[Eiga, toransufōmā, o, tanoshiku, mimashita]
Example:
\(\mathcal{V}=\{a,b,c,\cdots,z,lo,he\}\)
Tokenize the text "hello lol"
[hello],[lol]
search for the byte-pair 'lo', if present merge
Yes. Therefore, Merge
[h e l lo], [lo l]
search for the byte-pair 'he', if present merge
Yes. Therefore, Merge
[he l lo], [lo l]
return the text after merge
he #l #lo, lo #l
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '.') | 1 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>') | 2 |
('a', 'd') | 1 |
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
Word count
Character | Frequency |
---|---|
'k' | 3 |
'#n' | 13 |
'#o' | 9 |
't' | 16 |
'#h' | 5 |
"'" | 1 |
Initial Vocab Size :22
knowing
k #n #o #w #i #n #g
Word count
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '.') | 1 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>') | 2 |
('a', 'd') | 1 |
ignoring the prefix #
Freq of 1st element | Freq of 2nd element | score |
---|---|---|
'k':3 | 'n':13 | 0.076 |
'n':13 | 'o':9 | 0.02 |
'o':9 | 'w':3 | 0.111 |
'i':10 | 'n':13 | 0.05 |
'n':13 | 'g':7 | 0.076 |
't':8 | 'h':5 | 0.125 |
'a':3 | 'd':2 | 0.16 |
Small Vocab
Larger Vocab
Medium
Vocab
Small Vocab
Larger Vocab
Medium
Vocab
\(\mathcal{V}\)={h,e,l,l,o,he,el,lo,ll,hell}
\(\mathcal{V}\)={h,e,l,l,o,el,he,lo,ll,hell}
All possible subwords of \(X\)
hidden
observed
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
k
a
no
b
e
d
f
now
bow
bo
in
om
ro
ry
ad
ng
out
eve
win
some
bad
owi
ing
hing
thing
g
z
er
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Hmm.., I know I Don't know
Input text
hmm.., i know i don't know
Normalization
Pre-tokenization
[hmm.., i, know, i , don't,know]
Tokenization using learned vocab
from tokenizers import Tokenizer
[hmm.., i, know, i, do, #n't,know]
Post Processing (add model specific tokens)
pre-tokenizer
normalizer
Tokenizer
Algorithm (model)
Post Processor
[<go>,hmm.., i, know, i, do, #n't,know,<end>]
from tokenizers import Tokenizer
pre-tokenizer
normalizer
Tokenizer
Algorithm (model)
Post Processor
WhiteSpace, Regex, BERTlike
BPE,WordPiece..
LowerCase, StripAccents
Choices
Insert model specific tokens
Property (getter and setter,del)
model
normalizer
pre_tokenizer
post_processor
padding (no setter)
truncation (no setter)
decoder
tokenizer = Tokenizer(BPE())
#doesn't matter order in which the properties are set
tokenizer.pre_tokenizer = Whitespace()
tokenizer.normalizer = Lowercase()
pre-tokenizer
normalizer
Tokenizer
Algorithm (model)
Post Processor
WhiteSpace
BPE
LowerCase
[CLS],[SEP]
tokenizer = Tokenizer(BPE())
#doesn't matter order in which the properties are set
tokenizer.pre_tokenizer = Whitespace()
tokenizer.normalizer = Lowercase()
add_special_tokens(str,AddedToken)
no token_ids are assigned
add_tokens()
enable_padding(), enable_truncation()
encode(seq,pair,is_pretokenized), encode_batch()
decode(),decode_batch()
from_file(.json) # serialized local json file
from_pretrained(.json) # from hub
get_vocab(), get_vocab_size()
id_to_token(), token_to_id()
post_process()
train(files), train_from_iterator(dataset)
Tokenizer
# Core Class
class tokenizers.Tokenizer
.train_from_iterator(dataset, trainer)
Object
List of strings (text)
(vocab_size, special_tokens,prefix,..)
encode
for encoding the input string encode_batch
method?Tokenizer
Object
['This is great','tell me a joke']
{input_ids, attention_mask, type_ids, ..}
input_ids
but also optional outputs like attention_mask
, type_ids
(which we will learn about in the next lecture)post_process()
methods of Tokenizer (it is used internally, we can just set the desired formats in the optional parameters like "pair" in encode()
)encoding = Encoding()
tokenizer.post_process(encoding,pair=True)