Are Bloke's models usually slow on Kaggle?

#4
by fahim9778 - opened

Hello, I have been testing this GPTQ model since this morning and compared to the original model, it's being quite slow.
For example, for this code only:

%%time
# Use the model
input_text = "Which entrepreneur wants to go to Mars?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))

And It had to run almost 6.19 minutes to answer! Is it normal or I am missing something? I have used the following

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map = 'auto',
        trust_remote_code=True,
        torch_dtype=torch.float16,  
        cache_dir="./cache", 
        revision="main"
    )

Can anyone please help???

Sign up or log in to comment