Response quality much lower than with 2K version

#1
by mikolodz - opened

Hi,

Is it normal that I'm getting really poor responses from the SuperHOT models I tested?
E.g. they stop in the middle of the sentence (Wizard-Vicuna 13B 8K) or add some strange formating like ```makefile (Wicuna 1.3 13B 8K) etc.

Maybe I'm missing something? I use ooba + AutoGPTQ, nothing fancy..

@mikeoldz I think its supposed to be used with exllama under load model *make sure u have latest oobabooga updated to choose model from drop down menu) and then if still issue maybe lower response to 6000 tokens (aitreupener latest video shows how on youtube https://www.youtube.com/watch?v=199h5XxUEOY&ab_channel=Aitrepreneur

Ah yeah this model (and all the SuperHot-8k models) are not supposed to be loaded with AutoGPTQ, you use exllama.
Otherwise it'll probably be buggy lol

@Renegadesoffun 's linked video is really noob friendly, it's how I figured out how to use it also!
Good luck :D

It will work with AutoGPTQ, as long as you set trust_remote_code = True and set the desired context length by editing config.json. I explained this in the Readme.

But ExLlama is definitely preferable, as it's faster, uses less VRAM, and the controls are directly integrated into the UI. I'd only suggest AutoGPTQ use if you want to use the model from Python code, or outside text-generation-webui.

Thank you, mate! Indeed, lowering the token count to 6144 seems to help :)

BTW. I've tried AutoGPTQ and ExLlama before, but only the above helped. Is it due to the response token count? I've been using response max_new_tokens like 2048 or 1024 and max_seq_len = 8192 with no luck on 3090. It gave me either short responses or pure nonsense.

Sign up or log in to comment