tiiuae/falcon-40b · Finetuned Falcon40 is not working with pipeline (text-generation)

Hi,

First of all thanks for the great job I really love the Falcon models and for my task it performs better than Llama2 70B!

I have finetuned the falcon40 (not instruct) on my task, using QLora and Peft.

I am now in the process of deploying it using AWS Sagemaker. There are several problems but I would like to focus on one you might help me with.

When I load the model straight from the hub, create a pipeline and infer it, i get a response for a query in 150 seconds. It works great!

model_name = "tiiuae/falcon-40b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto",
)

generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device_map="auto")

The problem is when I try to use my finetuned model along with the pipeline.

I tried 2 options:

passing the pipeline the peft model

PEFT_MODEL = 'models/falcon40_ft_sft'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

config = PeftConfig.from_pretrained(PEFT_MODEL)

model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b")
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL)
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device_map="auto")

I get an error: The model 'PeftModel' is not supported for text-generation. Supported models are [...]
And the response time for the same query takes twice as much time.

Merging the peft model
Then I tried to merge the peft model using:

merged_model = model.merge_and_unload()
merged_model.save_pretrained('models/merged_ft_sft_falcon40')
tokenizer.save_pretrained('models/merged_ft_sft_falcon40')

I have copied all the configuration files and the including the config file which contains RWForCausalLM function. I copied also the config.json with the right auto_map properties.

When I run

MERGED_MODEL = 'models/merged_ft_sft_falcon40'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    MERGED_MODEL,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MERGED_MODEL)
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device_map="auto")

I get an error: The model 'RWForCausalLM' is not supported for text-generation. Supported models are [...]
And the response time for the same query takes twice as much time.

My question is how can I make my finetuned model benefit from all of the pipeline functions? Why isn't it working the same as the hub model, given I have all files (my assumption is that only the weights files have changed slightly)?