wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

Community Article Published September 27, 2024

TL;DR: train your own custom tokenizer!

Even though tokenizer choices have significant impacts on model performance, it is a relatively understudied aspect of language model research. Every once and a while, a tokenizer-adjacent story will go around, like how ChatGPT can’t tell you how many r’s are in ‘strawberry’ or that multimodal models are case-sensitive. But more attention is paid to other research areas, as is evidenced by the number of times ‘tokenization’ appears in this year’s ACL abstracts (25 versus 63 for ‘adapter’, 476 for ‘multimodal’, 1,103 for ‘benchmark’).

As a result of this disinterest in tokenizers, it’s relatively common to see models trained on tokenizers that were developed for other models. A notable example of tokenizer reuse is the Llama 3(.1) tokenizer, which is adapted from the tiktoken tokenizer used for OpenAI models like GPT-4. At the same time, there is a growing standardization of LLM design with Llama fast emerging as the leading approach for training. This has led to many open models but also training and inference frameworks simply reusing the Llama tokenizer without much attention given to it. For example, vLLM only supports Llama-style tokenizers.

In my own tests, I’ve found the best results when I train tokenizers from scratch on a representative sample of the model training data and design the tokenizer for the desired languages and domains. Doing so leads to better compression, which in turn reduces the time and cost of model training. The over-reliance on a particular type of tokenizer or the practice of re-using tokenizers is likely harming performance in ways we don’t fully understand.

Tokenizer Training Data

The data that the tokenizer is trained on affects the tokenization quality. Without any other optimizations, I get better compression across all languages represented in our data when using a tokenizer trained on that data. Existing tokenizers in some cases have extremely bad compression rates on our data.

Training on representative data ensures the tokenizer vocabulary will be the most relevant, without having to manually add vocabulary items to an existing tokenizer. Doing this requires to determine in advance what the relevant terms are for the domain. It also is unlikely that one would manually add relevant subwords, which would also be important for getting good compression in the desired domain.

This may especially be relevant in domains with highly specialized vocabulary, like clinical NLP, or working with different languages.

Pre-Tokenizer

Pre-tokenization is an even more understudied topic than tokenization. This step entails splitting texts into discrete units, which can then further be split into subwords. One of the most common pre-tokenization methods is whitespace pre-tokenization. OLMo, Mistral, Gemma, and Claude all use some variant of whitespace pre-tokenization.

One exception is GPT-4 and Llama 3.1 pre-tokenizer. Both of these use a regex pre-tokenization, which splits texts according to several different criteria.

There are two parts of this pre-tokenizer that could be problematic.

(?i:'s|'t|'re|'ve|'m|'ll|'d): common English contractions, e.g. ‘s, ‘ve, ‘ll
|\p{N}{1,3}|: any sequence of 1 to 3 numeric characters

Segmenting along English contractions overly optimizes for English, which will have little benefit for other languages. It may also have unintended consequences for other languages or code. For example, apostrophes are used to delimit strings in coding languages, such as python. This means that the content of the string will include the initial apostrophe if the string begins with the letters in the listed contractions. This means that text inside of single quotation marks will be tokenized differently.

Segmenting sequences of numeric digits is also an important consideration. This method allows sequences up to three digits, which departs from the original Llama method, where all digits were separated in pre-tokenization. This may affect arithmetic and other numeric reasoning tasks, such as evaluating whether 9.11 is bigger than 9.9 or multiplication.

An analysis of multiplication capabilities for GPT-4o1-mini shows a steep dropoff in ability when one of the two factors is greater than three digits long. Then there is a further drop off after six digits, which would correspond to a string more than 2 tokens long. After this point, accuracy falls to zero very quickly as the number of digits increases. This could potentially be a result of this pre-tokenization step.

Next Steps

At PleIAs, we’ve been working on some new ways to do efficient tokenization, such as by modifying the BPE algorithm. One of our biggest struggles has been in figuring out how to evaluate our tokenizers. Compression is a good metric if you want to optimize for the amount of information the model sees in each training batch or for minimizing inference latency. Compression has been shown by some to be predictive of model performance, but there is also evidence to the contrary.

Another popular approach is to assume that how ‘meaningful’ tokens are is a metric of tokenizer quality. On the one hand, this seems intuitively correct. There are a lot of studies showing that explicitly making tokenizers more morphologically aware, and thus producing more meaningful tokens, improves performance. I have a pre-print coming out soon showing some possible counterevidence to this assumption.

In the end, it might not be that intuitive that meaningful tokens are better tokens. Making tokenizers more meaningful for one language means they are likely less meaningful for another language or another domain. Explicitly requiring morphological tokenization may harm generalizability of your tokenizer. We are excited to work on this problem in the near future.

After all of this, why don’t we just abandon tokenization altogether? One of the most popular approaches to tokenizer-free language modeling, character- and byte-based tokenization, has some major drawbacks. Because sequences are split up into individual bytes or characters, sequence lengths are very long. This leads to high compute requirements for training and inference. And byte-/character-based tokenization also introduces differences between languages. Because of differences in word length and how different writing systems are rendered in UTF-8 encoding, some languages need as many as 5x more bytes to convey the same amount of information relative to English. So, on top of long sequence lengths, some languages may need several times longer sequences than others. While subword tokenization has been shown to have similar problems, it at least doesn’t come with high compute demands.

Even though tokenizer-free language modeling has not been terribly popular, there is some recent work on this. Plus, advancements in State Space Models and long-context language modeling may soon mean long sequences are not so much of a problem.

Tokenization is a very interesting area of work, especially for small models. We believe there’s a lot of room for improvement in this area.

Thanks to the PleIAs Research team for their feedback, especially Pierre-Carl Langlais and Ivan Yamshchikov.

Upvote