mistralai/Codestral-22B-v0.1 · Align tokenizer with mistral-common

Align tokenizer with mistral-common53f216c5

Rocketknight1

12 days ago

No description provided.

Rocketknight1

12 days ago

This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v3()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Codestral-22B-v0.1", revision="pr/45")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))

Note that we're just using a standard chat template for v3, so we don't support FIM yet. However, this PR should still align all other aspects of tokenization and make it possible to do standard completion requests.

Defend the honour of the Hugging Face tokenizer684c1751

Update to tokenizer v3 with correct proper special tokens106a1b0c

Re-add chat template3256c7e7