Warning message: "GENERATION QUALITY WILL BE DEGRADED! CONSIDER REGENERATING THE MODEL"

#1
by yumemio - opened

@grapevine-AI Hello, and awesome work releasing the GGUF model! I've been struggling for a while to load the original model (here's the issue page), so it's great to see that you made it.

By the way, I want to let you know that I got the warning message below when I loaded the model with llama-cpp-python, which I suspect was emitted from the llama.cpp backend:

> llm = Llama(
>     model_path="./stockmark-100b-instruct-v0.1-Q4_K_M/stockmark-100b-instruct-v0.1-Q4_K_M-00001-of-00007.gguf",
>     n_gpu_layers=48,
>     use_mmap=False,
>     seed=42,
>     n_ctx=2048,
>     n_threads=11,
> )
...
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 384
llm_load_vocab: token to piece cache size = 0.4788 MB
...

According to this HF post, it seems that llama.cpp updated the BPE tokenization logic about a month ago (PR), which requires the models to be re-compiled with the latest version of convert-hf-to-gguf.py. To quote ggerganov:

Old GGUF models using BPE tokenizers, generated before this change, will fallback to the "default" pre-tokenization, which in almost all cases is wrong.

Contrary to what the message said, the model's output seems coherent:

Q. 仕事の熱意を取り戻すためのアイデアを5つ挙げてください。
A. 熱意を取り戻すためには、まずその原因を特定することが重要です。以下に、仕事に熱が入らない原因とそれに対する対処法の一例を示します。
1.プライベートで問題がある...仕事以外の人生が充実していない可能性があります。まずは、仕事とプライベートのバランスを見直しましょう。
2.将来に不安がある...キャリアプランが明確でない、会社の将来性に不安があるなどの場合は、それらを明確にして対応する必要があります。
3.仕事が合っていない...自分のスキルや興味に合わない仕事をしている可能性があります。部署移動を希望してみるのも良いでしょう。
4.仕事に興味が持てない...今の仕事にやりがいを感じられない可能性があります。その場合、まずは上司に相談しましょう。
5.人間関係に問題がある...職場の人間関係にストレスを感じている場合、上司に相談するなどしましょう。

I'm not an expert in the GGUF quantization pipeline, so I'm not sure whether the performance is affected or not. Do you think the output quality is indeed impacted, and if so, would you mind addressing the issue by updating the model with the latest conversion script?

Thanks in advance!

Thanks for reporting!
This is very important information, so I spent many time to research. I'm sorry to have kept you waiting.
And, unfortunately, I conclude it is hard to fix this problem.
I tried to re-compile the model, but I failed to solve the problem. Because llama.cpp is not support pre-tokenization of stockmark.

Fortunately, model's output may be not bad. Please continue to patiently use the current model as it is.

Hi @grapevine-AI 👋, and thanks a lot for looking into this!

Sorry that I can't comment on how we can to fix the pre-processing config. I've found a Zenn post, which suggests that a model-specific implementation may be necessary to address the issue:

これを解決するには実際にそのモデル固有の前処理を実装する必要があるため、完全解決は困難と思われる。

To resolve this [the error message], it would be necessary to implement pre-processing specific to that model, so a complete solution seems difficult.

On the other hand, as you mentioned, the quality of the generated text looks good despite the scary warning. FYI, we're collecting the output of various Japanese-language models in this Google Spreadsheet and, as far as I can tell, the Stockmark model produces coherent Japanese text (although the benchmark score is a bit lower than expected😅).

Once again, I appreciate your effort very much!

Closing this issue as it won't be fixed.

yumemio changed discussion status to closed

Sign up or log in to comment