Tokenizer
Token Ids
embeddings
Language Model
ids to tokens
How are you?
I
enjoyed
the
movie
transformers
Feed Forward Network
Multi-Head Attention
<go>
<stop>
I
enjoyed
the
movie
transformers
Feed Forward Network
Multi-Head Attention
<go>
<stop>
映画『トランスフォーマー』を楽しく見ました
Source: lemongrad
Hmm.. I know, I don't know
[H, m, m, ., ., I, k, n, o, w, ,, I, d, o, n, ', t, k, n, o, w]
21 |
Number of tokens
Encoder
\(\mathcal{V}\): {H,m,.,I,k,n,o,w,d,',t}
\(|\mathcal{V}|\): 11
[Hmm.., I, know,, I, don't, know]
6 |
Number of tokens
\(\mathcal{V}\): {Hmm..,I, know, don't}
\(|\mathcal{V}|\): 4
Hmm.. I know, I don't know
[Hmm.., I, know,, I, don't, know]
6 |
Number of tokens
\(\mathcal{V}\): {Hmm..,I, know, don't}
\(|\mathcal{V}|\): 4
Hmm.. I know, I don't know
Hmm.. I know, I don't know
[Hmm.., I, know,, I, do, n't, know]
7 |
Number of tokens
\(\mathcal{V}\): {Hmm..,I, know, do, n't}
\(|\mathcal{V}|\): 5
* There are many categories, these three are most commonly used
WordPiece
SentencePiece
BytePairEncoding (BPE)
Hmm.., I know I Don't know
Input text
hmm.., i know i don't know
normalization
Pre-tokenization
[hmm.., i, know, i , don't,know]
Subword Tokenization
[hmm.., I, know, i, do, #n't,know]
I enjoyed the movie
[I enjoy #ed the movie]
[I enjoy #ed the movie .]
I enjoyed the movie.
Encoder
Decoder
Tokens from Source language
Tokens from Target language
Encoder
Decoder
Tokens from Source language
Tokens from Target language
English:
Chandrayan
Tamil:
?
Chandrayan
ச
ந்தி
ரா
ய
ன்
English:
Tamil:
import re, collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
'n e w e s t </w>':6, 'w i d e s t </w>':3}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(best)
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Vocab size:13
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial token count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 3+1+1+1+2+...+1=16 |
Vocab size:13
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Character | Frequency |
---|---|
'k' | 3 |
Initial tokens and count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 3+1+1+1+2+...+1=16 |
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'</w>' | 16 |
"'" | 1 |
Vocab size:13
Initial Vocab Size :22
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') |
Byte-Pair count
Word count
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 |
'o' | 9 |
'i' | 10 |
'</w>' | 16 |
"'" | 1 |
Initial Vocabulary
"in" | 7 |
Word | Frequency |
---|---|
'k n o w i n g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h i n g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h i n g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h i n g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 13 -7 = 6 |
'o' | 9 |
'i' | 10-7 = 3 |
'</w>' | 16 |
"'" | 1 |
Updated Vocabulary
"in" | 7 |
Added new token
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w','in') | 3 |
('in', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
"ing" | 7 |
Word | Frequency |
---|---|
'k n o w in g </w>' | 3 |
't h e </w> | 1 |
'n a m e </w> | 1 |
'o f </w>' | 1 |
's o m e t h in g </w> | 2 |
'i s </w>' | 1 |
'd i f f e r e n t </w>' | 1 |
'f r o m </w>' | 1 |
's o m e t h in g . </w>' | 1 |
'a b o u t </w>' | 1 |
'e v e r y t h in g </w>' | 1 |
"i s n ' t </w> | 1 |
'b a d </w>' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o' | 3 |
('o', 'w') | 3 |
('w', 'in') | 3 |
('h', 'in') | 4 |
('in', 'g') | 7 |
('g', '</w>') | 6 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>' | 2 |
('a', 'd') | 1 |
('d', '</w>') | 1 |
Byte-Pair count
Word count
Character | Frequency |
---|---|
'k' | 3 |
'n' | 6 |
'o' | 9 |
'i' | 3 |
'</w>' | 16 |
"'" | 1 |
'g': 7 | 7 |
Updated vocabulary
"in" | 7 |
"ing" | 7 |
'k n o w i n g </w>' |
't h e </w> |
'n a m e </w> |
'o f </w>' |
's o m e t h i n g </w> |
'i s </w>' |
'd i f f e r e n t </w>' |
'f r o m </w>' |
's o m e t h i n g . </w>' |
'a b o u t </w>' |
'e v e r y t h' |
"i s n ' t </w> |
'b a d </w>' |
everything
tokenizer
everyth
ing
everything
bad
tokenizer
b
d
bad
a
Tokens |
---|
'k' |
'n' |
'o' |
'i' |
'</w>' |
'in' |
'ing' |
映画『トランスフォーマー』を楽しく見ました
[Eiga, toransufōmā, o, tanoshi,#ku, mi,#mashita]
[Eiga, toransufōmā, o, tanoshiku, mimashita]
Example:
\(\mathcal{V}=\{a,b,c,\cdots,z,lo,he\}\)
Tokenize the text "hello lol"
[hello],[lol]
search for the byte-pair 'lo', if present merge
Yes. Therefore, Merge
[h e l lo], [lo l]
search for the byte-pair 'he', if present merge
Yes. Therefore, Merge
[he l lo], [lo l]
return the text after merge
he #l #lo, lo #l
def tokenize(text):
pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
pre_tokenized_text = [word for word, offset in pre_tokenize_result]
splits = [[l for l in word] for word in pre_tokenized_text]
for pair, merge in merges.items():
for idx, split in enumerate(splits):
i = 0
while i < len(split) - 1:
if split[i] == pair[0] and split[i + 1] == pair[1]:
split = split[:i] + [merge] + split[i + 2 :]
else:
i += 1
splits[idx] = split
return sum(splits, [])
Search for byte-pair in order inserted in the vocab through entire input text
Merge if found
first, 2-grams, 3-grams,...
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '.') | 1 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>') | 2 |
('a', 'd') | 1 |
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
Word count
Character | Frequency |
---|---|
'k' | 3 |
'#n' | 13 |
'#o' | 9 |
't' | 16 |
'#h' | 5 |
"'" | 1 |
Initial Vocab Size :22
knowing
k #n #o #w #i #n #g
Word count
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
Word | Frequency |
---|---|
('k', 'n') | 3 |
('n', 'o') | 3 |
('o', 'w') | 3 |
('w', 'i') | 3 |
('i', 'n') | 7 |
('n', 'g') | 7 |
('g', '.') | 1 |
('t', 'h') | 5 |
('h', 'e') | 1 |
('e', '</w>') | 2 |
('a', 'd') | 1 |
ignoring the prefix #
Freq of 1st element | Freq of 2nd element | score |
---|---|---|
'k':3 | 'n':13 | 0.076 |
'n':13 | 'o':9 | 0.02 |
'o':9 | 'w':3 | 0.111 |
'i':10 | 'n':13 | 0.05 |
'n':13 | 'g':7 | 0.076 |
't':8 | 'h':5 | 0.125 |
'a':3 | 'd':2 | 0.16 |
Small Vocab
Larger Vocab
Medium
Vocab
Small Vocab
Larger Vocab
Medium
Vocab
\(\mathcal{V}\)={h,e,l,l,o,he,el,lo,ll,hell}
\(\mathcal{V}\)={h,e,l,l,o,el,he,lo,ll,hell}
All possible subwords of \(X\)
hidden
observed
Word | Frequency |
---|---|
'k n o w i n g' | 3 |
't h e ' | 1 |
'n a m e ' | 1 |
'o f ' | 1 |
's o m e t h i n g ' | 2 |
'i s ' | 1 |
'd i f f e r e n t ' | 1 |
'f r o m ' | 1 |
's o m e t h i n g. ' | 1 |
'a b o u t ' | 1 |
'e v e r y t h i n g ' | 1 |
"i s n ' t ' | 1 |
'b a d ' | 1 |
k
a
no
b
e
d
f
now
bow
bo
in
om
ro
ry
ad
ng
out
eve
win
some
bad
owi
ing
hing
thing
g
z
er
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Token | log(p(x)) |
---|---|
b | -4.7 |
e | -2.7 |
h | -3.34 |
r | -3.36 |
w | -4.29 |
wh | -5.99 |
er | -5.34 |
where | -8.21 |
by | -7.34 |
he | -6.02 |
ere | -6.83 |
here | -7.84 |
her | -7.38 |
re | -6.13 |
Wordpiece model uses a likelihood instead of frequency.
Sentencepiece is called sentencepiece because it treats the entire sentence (corpus) as single input and finds the subwords for it.
substring and subwords array not the one and the same
The suffix array A of S is now defined to be an array of integers providing the starting positions of suffixes of S in lexicographical order.
Counting subwords: https://math.stackexchange.com/questions/104258/counting-subwords