tiiuae/falcon-40b · Bug: Generate method doesn't work for falcon-7b and falcon-40b in int8 mode.

System Info

transformers version: 4.30.0.dev0
Platform: Linux-5.15.0-72-generic-x86_64-with-glibc2.35
Python version: 3.9.16
Huggingface_hub version: 0.14.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker @younes

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

Import modules and load the model:

from transformers import  AutoModelForCausalLM, AutoConfig, AutoTokenizer
model_path="tiiuae/falcon-40b"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
        model_path, config=config, trust_remote_code=True,  load_in_8bit=True, device_map="auto")
model.eval()
model.config.eos_token_id = 0
model.config.forced_eos_token_id = 0
model.config.pad_token_id = 0

Tokenize a text:

text = "Hola qué tal estás Íñigo? ¿Qué vas a hacer hoy?"
inpts = tokenizer(text, return_tensors="pt").to("cuda")

Try to generate text:

out = model.generate(**{k: v for k, v in inpts.items() if "token_type" not in k})

You will receive the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[13], line 1
----> 1 out = model.generate(**{k: v for k, v in inpts.items() if "token_type" not in k})

File ~/miniconda3/envs/int4/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/miniconda3/envs/int4/lib/python3.9/site-packages/transformers/generation/utils.py:1518, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
   1512         raise ValueError(
   1513             "num_return_sequences has to be 1 when doing greedy search, "
   1514             f"but is {generation_config.num_return_sequences}."
   1515         )
   1517     # 11. run greedy search
-> 1518     return self.greedy_search(
   1519         input_ids,
   1520         logits_processor=logits_processor,
   1521         stopping_criteria=stopping_criteria,
   1522         pad_token_id=generation_config.pad_token_id,
   1523         eos_token_id=generation_config.eos_token_id,
   1524         output_scores=generation_config.output_scores,
   1525         return_dict_in_generate=generation_config.return_dict_in_generate,
...
    291 )
    293 x = attn_output.view(batch_size, self.num_heads, q_length, self.head_dim)
    294 x = x.permute(0, 2, 1, 3)

RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.

Expected behavior

It is expected that the falcon-40b model is able to generate also with int8, otherwise we cannot perform inference even on a 80GB A-100. Also, other models have no problem with inference in 8bit.