croissantllm/CroissantLLMBase · Tokenizer model in SentencePiece format

25 days ago

Hello,

I would be interested in using the tokenizer that is supposedly a SentencePiece tokenizer. I think the tokenizer is an important piece of the croissant project. Unfortunately, there is no file corresponding to the SentencePiece format in the repo. Hence, we cannot use the tokenizer outside of huggingface.

In llama2, the sentencepiece format of the tokenizer is saved as tokenizer.model
Here is another discussion open about the availability of the tokenizer.model in croissantllm: https://huggingface.co/croissantllm/CroissantLLMBase/discussions/4 .

Is it possible to release the tokenizer in the sentencepiece save format ?

manu

CroissantLLM org 24 days ago

Hello !
I think we really only ever kept the fast version of the tokenizer (use_fast = True) and never had to rely on the original sentencepiece tokenizer.model standard...

This is similar as what is done in https://huggingface.co/meta-llama/Meta-Llama-3-8B/.

I don't have any more files than you sadly...

https://github.com/huggingface/transformers/issues/21289

manu changed discussion status to closed 24 days ago