use_fast=False tokenizer broken

#7
by sanderland - opened

Potentially something in the pretokenization splitting. cf similar issues with gemma and other models reported here.

for use_fast in [False,True]:
    tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Nemo-Base-2407", use_fast=use_fast)
    s = tok.decode([28804,123144,78962,91615])
    reencoded = [(i,tok.decode([i])) for i in tok.encode(s,add_special_tokens=False)]
    print(f"{use_fast=} {reencoded=}") 
use_fast=False reencoded=[(1046, '.'), (1119, 'w'), (24591, 'izards'), (1046, '.'), (41173, 'NOT'), (1046, '.'), (3609, 'RE'), (52892, 'LEASE'), (1046, '.'), (112617, 'APPL'), (67805, 'ICATION')]
use_fast=True reencoded=[(28804, '.wizards'), (123144, '.NOT'), (78962, '.RELEASE'), (91615, '.APPLICATION')]
Mistral AI_ org
edited Jul 19

Hi there! This is because the slow tokenizer uses an old pre-tokenization regex.
Here's one way you can override this:

import regex as re
tokenizer_id = 'Xenova/Mistral-Nemo-Instruct-Tokenizer'
text = '.wizards.NOT.RELEASE.APPLICATION'
tokens = [28804, 123144, 78962, 91615]
slow_tokenizer = AutoTokenizer.from_pretrained(tokenizer_id, use_fast=False)
slow_tokenizer.pat = re.compile("[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+")
print(slow_tokenizer.encode(text, add_special_tokens=False)) # [28804, 123144, 78962, 91615]
print(slow_tokenizer.decode(tokens)) # .wizards.NOT.RELEASE.APPLICATION

To ensure 100% compatibility though, we encourage you to use the fast tokenizer, and this is now the behaviour if you load the tokenizer with use_fast=False.

Sign up or log in to comment