Tokenization: BPE, Unigram and more

## There is more than one way to tokenize a sentence

word-level chunks/tokens
- A big vocabulary is needed
- We combine words: what exactly constitutes a word (“bachelor of science”, or isolated words)
- Abbreviated words: “LOL”, “IMO”, are these collections of words or new words?
- Languages that don’t segment by spaces
character-level chunks/tokens
- Lack of meaning: Unlike words, characters don’t have any inherent meaning, model may lose the semantic-specific feature of words.
- Increased input computation
- Limits netword+k choices: It’s difficult to use architectures which process input sequentially since the input sequences will be much longer.
Subword-level chunks/tokens

2024-04-22