Tokenization: BPE, Unigram and more

2024-04-22

/posts/llms/llm_tokenization/ map[email:1522009317@qq.com name:fmh]

## There is more than one way to tokenize a sentence

word-level chunks/tokens
- A big vocabulary is needed
- We combine words: what exactly constitutes a word (“bachelor of science”, or isolated words)
- Abbreviated words: “LOL”, “IMO”, are these collections of words or new words?
- Languages that don’t segment by spaces
character-level chunks/tokens
- Lack of meaning: Unlike words, characters don’t have any inherent meaning, model may lose the semantic-specific feature of words.
- Increased input computation
- Limits netword+k choices: It’s difficult to use architectures which process input sequentially since the input sequences will be much longer.
Subword-level chunks/tokens
- We want a tokenization scheme that deals with an infinite potential vocabulary via a finite list of known words. Make up the word “unfortunately” via “un” + “for”+ “tun” + “ate” + “ly”.
- Subword tokenisation will break the text into chunks based on the word frequency. In practice what happens is that common words will be tokenized generally as whole words, e.g. “the”, “at”, “and”, etc., while rarer words will be broken into smaller chunks and can be used to create the rest of the words in the relevant dataset.
- BPE(Byte Pair Encoding): One popular algorithm for subword tokenisation which follows the above approach is BPE. BPE was originally used to help compress data by finding common byte pair combinations. It can also be applied to NLP to find the most efficient way of representing text.
  - What is merging? The main goal of the BPE subword algorithm is to find a way to represent your entire text dataset with the least amount of tokens. Similar to a compression algorithm, you want to find the best way to represent your image, text or whatever you are encoding, which uses the least amount of data, or in our case tokens. In the BPE algorithm merging is the way we try and “compress” the text into subword units.
  - There are a few steps to these merging actions:
    1. Get the word count frequency
    2. Get the initial token count and frequency (i.e., how many times each character occurs)
    3. Merge the most common byte pairing
    4. Add this to the list of tokens and recalculate the frequency count for each token (this will change with each merging step)
    5. Rinse and repeat until get reached pre-defined token limits (vocab size) or a set of number of iterations
- Greedy algorithm: BPE ensures that the most common words will be represented in the new vocabulary as a single token, while less common words will be broken down into two or more subword tokens. To achieve this, BPE will go through every potential option at each step and pick the tokens to merge based on the highest frequency.One downside of BPE’s greedy approach is it can result in a potentially ambiguous final token vocabulary. For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.
- BBPE(byte-level PBE): A base vocabulary that includes all possible base characters can be quite large if e.g. all unicode characters are considered as base characters. To have a better base vocabulary, GPT-2 uses bytes as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2’s tokenizer can tokenize every text without the need for the symbol. GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges. from hf doc

## Probabilistic Subword Tokenization

Using the frequency of subword patterns for tokenization can result in ambiguous final encodings. The problem is that we have no way to predict which particular token is more likely to be the best one when encoding any new input text. Luckily, needing to predict the most likely sequence of text is not a unique problem to tokenization. We can leverage this knowledge to build a better tokenizer.

Unigram Subword Tokenization
- The goal for a subword model, however, is different from a LM that is trying to predict a full sentence. We only want something that generates unambiguous tokenization.
- The unigram approach differs from BPE in that it attempts to choose the most likely option rather than the best option at each iteration. To generate a unigram subword token set you need to first define the desired final size of your token set and also a starting seed subword token set.
- You can choose the seed subword token set in a similar way to BPE and choose the most frequently occurring substrings. Once you have this in place then you need to:
  1. Work out the probability for each subword token
  2. Work out a loss value which would result if each subwork token were to be dropped. The loss is worked out via Expectation Maximization algorithm.
  3. Drop the tokens which have the largest loss value (e.g., the bottom 10% or 20% of subword tokens based on their loss calculations).
  4. Repeat these steps until reach the desired final vocabulary size or there is no change in token numbers after successive iterations.
WordPiece (greedy approach tokenzier, BERT partner) Think of WordPiece as an intermediary between the BPE approach and the unigram approach.
- BPE, if you remember, takes two tokens, looks at the frequency of each pair and then merges the pairs that have the highest combined frequency count. It only considers the most frequent pair combinations at each step, nothing else.
- An alternate approach is to check the potential impact of merging that particular pair. You can do this using the probabilistic LM approach. At each iterative step, choose the character pair which will result in the largest increase in likelihood once merged. This is the difference between the probability of the new meged pair occurring minus the probability of both individual tokens occurring individually.

The main difference is that WordPiece is a greedy approach. It still tries to build a tokenizer from the bottom up, picking the best pair at each iteration to merge. WordPiece uses the likelihood rather than count frequency but otherwise it is a similar approach. Unigram in contrast is a fully probabilistic approach which uses probability to both choose the pairs to merge and whether to merge them or not. It also removes tokens based on the fact that they add the least to the overall likelihood of the unigram model.

## briefly summarize:

BPE: Just uses the frequency of occurrences to identify the best match at every iteration until it reaches the predefined vocabulary size.
WordPiece: Similar to BPE and uses frequency occurrences to identify potential merges but makes the final decision based on the likelihood of the merged token
Unigram: A fully probabilistic model which does not use frequency occurrences. Instead, it trains a LM using a probabilistic model, removing the token which improves the overall likelihood the least and then starting over until it reaches the final token limit.

## SentencePiece

SentencePiece basically tries to bring all the subword tokenization tools and techniques under one banner. It’s kind of like the Swiss Army knife for subword tokenization. To be a Swiss Army-like tool something has to be capable of solving multiple problems. So what problems is SentencePiece addressing:

All other models assume input is already tokenized: BPE and Unigram are great model but they share one big disadvantage: they both need to have their input already tokenized. SentencePiece deals with this by simply taking in an input in raw text and then doing everything needed on that input to perform subword tokenization.
Language agnostic: Since all other subword algorithms need to have their input pre-tokenized, it limits their applicability to many languages.
Decoding is difficult: Another problem which is caused by model like BPE and unigram requiring already tokenized inputs is that you do not know what encoding rules were used. For example, how were spaces encoded in the tokens? So you cannot decode the input and return it to is original format.
No end to end solution: You cannot just plug in a raw input to BPE (or Unigram) and get an output.

Some of the techniques SentencePiece uses to address the above shortcomings:

Encode everything as unicode: SentencePiece first converts all the input into unicode characters. This makes it a language agnostic tool.
“space” encoded as “_"(U+2581): To get around the word segmenting issues.
And it’s faster: One of the issues preventing other subword algorithms from being used to tokenize raw sentences as part of model training was that there lack of speed. If you processed input in real time and performed your tokenization on the raw input it would be too slow. SentencePiece addresses this by using a priority queue for the BPE algorithm to speed it up so that you can use it as part of an end-to-end solution.

https://www.openteams.com/tokenizers-how-machines-read/

TODO: 补充中文