fgg blog

: Subword Chunking

Tokenization: BPE, Unigram and more

## There is more than one way to tokenize a sentence

  • word-level chunks/tokens

    • A big vocabulary is needed
    • We combine words: what exactly constitutes a word (“bachelor of science”, or isolated words)
    • Abbreviated words: “LOL”, “IMO”, are these collections of words or new words?
    • Languages that don’t segment by spaces
  • character-level chunks/tokens

    • Lack of meaning: Unlike words, characters don’t have any inherent meaning, model may lose the semantic-specific feature of words.
    • Increased input computation
    • Limits netword+k choices: It’s difficult to use architectures which process input sequentially since the input sequences will be much longer.
  • Subword-level chunks/tokens