Slow tokenizer problem.

#22
by bradhutchings - opened

I'm trying to make this work with PrivateGPT:
https://github.com/zylon-ai/private-gpt

Their default local LLM is Mistral-7B-Instruct-v0.2.

With the new tokenizer, I'm getting this error.

Downloading tokenizer mistralai/Mistral-7B-Instruct-v0.3
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Traceback (most recent call last):
File "/home/linux/privateGPT/scripts/setup", line 43, in
AutoTokenizer.from_pretrained(
File "/home/linux/.cache/pypoetry/virtualenvs/private-gpt-AnyLGiqx-py3.11/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linux/.cache/pypoetry/virtualenvs/private-gpt-AnyLGiqx-py3.11/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
return cls._from_pretrained(
^^^^^^^^^^^^^^^^^^^^^
File "/home/linux/.cache/pypoetry/virtualenvs/private-gpt-AnyLGiqx-py3.11/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linux/.cache/pypoetry/virtualenvs/private-gpt-AnyLGiqx-py3.11/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 133, in init
super().init(
File "/home/linux/.cache/pypoetry/virtualenvs/private-gpt-AnyLGiqx-py3.11/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 102, in init
raise ValueError(
ValueError: Cannot instantiate this tokenizer from a slow version. If it's based on sentencepiece, make sure you have sentencepiece installed.

I'm not very knowledgeable about how these things work, but am wondering if there's something that can be done about it.

Thanks!
-Brad

try pip install sentencepiece

Yep, but at the same time for a fast tokenizer, this should not be necessary, we'll updated this

I immediately tried pip install sentencepiece when I first encountered that, and that didn't help.

I found other suggestions of things to install after Googling the error message. They didn't solve the problem either.

I did a deep dive (as deep as I could anyway) into add_prefix_space and came up empty there too.

I am encountering the same exact problem when I try to use the Mistral-7B-Instruct-v0.3 model in a kaggle competition.

Sign up or log in to comment