Request: YeungNLP/firefly-gemma-7b

#8
by Cran-May - opened

Model name: firefly-gemma-7b
Model link: https://huggingface.co/YeungNLP/firefly-gemma-7b
Brief description:
The chat template of our chat models is similar as Official gemma-7b-it:

user
hello, who are you?
model
I am a AI program developed by Firefly

An image/direct image link to represent the model (square shaped):
firefly_logo.png

[Optional] Additonal quants (if you want any):
IQ2 Series Q3_K-M(is it's performance better than IQ3_M? idk.)

Performance

We evaluate our models on Open LLM Leaderboard, they achieve good performance.

Model Average ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
firefly-gemma-7b 62.93 62.12 79.77 61.57 49.41 75.45 49.28
zephyr-7b-gemma-v0.1 62.41 58.45 83.48 60.68 52.07 74.19 45.56
firefly-qwen1.5-en-7b-dpo-v0.1 62.36 54.35 76.04 61.21 56.4 72.06 54.13
zephyr-7b-beta 61.95 62.03 84.36 61.07 57.45 77.74 29.04
firefly-qwen1.5-en-7b 61.44 53.41 75.51 61.67 51.96 70.72 55.34
vicuna-13b-v1.5 55.41 57.08 81.24 56.67 51.51 74.66 11.3
Xwin-LM-13B-V0.1 55.29 62.54 82.8 56.53 45.96 74.27 9.63
Qwen1.5-7B-Chat 55.15 55.89 78.56 61.65 53.54 67.72 13.57
gemma-7b-it 53.56 51.45 71.96 53.52 47.29 67.96 29.19

Usage

The chat template of our chat models is similar as Official gemma-7b-it:

<bos><start_of_turn>user
hello, who are you?<end_of_turn>
<start_of_turn>model
I am a AI program developed by Firefly<eos>

You can use script to inference in Firefly.

You can also use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name_or_path = "YeungNLP/firefly-gemma-7b"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

prompt = "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. "
text = f"""
<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
""".strip()
model_inputs = tokenizer([text], return_tensors="pt").to('cuda')

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1500,
    top_p = 0.9,
    temperature = 0.35,
    repetition_penalty = 1.0,
    eos_token_id=tokenizer.encode('<eos>', add_special_tokens=False)
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

IQ2 Series Q3_K-M(is it's performance better than IQ3_M? idk.)


You can refer to these for a general overview of the current sizes.

@Cran-May Can you add what would be intended use cases/applications for this particular model?

(I am thinking general simple/summary assistant work?)

Just because it doesn't seem like what I usually focus on -- un-aligned and unsafe models with few restrictions.

Cropped and upscaled card image: (square, 1:1)

firefly.jpg

    quantization_options = [
        "IQ2_XXS", "IQ2_XS", "IQ2_S", "IQ2_M", "Q3_K_M",
        "Q4_K_M", "Q4_K_S", "IQ4_XS", "Q5_K_M", "Q5_K_S",
        "Q6_K", "Q8_0", "IQ3_M", "IQ3_S", "IQ3_XXS"
    ]

IQ2's will take a good while longer.

@Cran-May - Heya! Will wait for your inputs.

I will need assistance to continue in this case as I am unable to load the initial FP16 GGUF created from your repo:

llama_model_load: error loading model: create_tensor: tensor 'blk.0.attn_q.weight' has wrong shape; expected  3072,  3072, got  3072,  4096,     1,     1
llama_load_model_from_file: failed to load model
main: error: unable to load model

Can't say I ever ran a gemma model.

Lewdiculous changed discussion title from Request: YeungNLP/firefly-gemma-7b to Request: YeungNLP/firefly-gemma-7b (help-needed)

@Virt-io Just in case you haven't seen this but do you know if I can sort this on my own?

I can't seem to load the FP16 GGUF converted from the linked repo using llama.cpp/convert.py. Maybe you know something about Gemma and Llama.cpp I am not aware...

@Lewdiculous

Use convert-hf-to-gguf.py instead of convert.py

I am unsure why convert.py doesn't work as I have not been paying attention to gemma models.

Warning convert-hf-to-gguf.py gave me a FP32 gguf so you will need to requant it to FP16 before running imatrix. In addition convert-hf-to-gguf.py needs more ram, or a page file.

convert-hf-to-gguf.py gave me a FP32 gguf

Love it xD

I'll try ty! I've never touched a gemma model :')

Maybe I should just be using convert-hf-to-gguf.py all along.

I consider 30/32GB RAM usage optimal. LOL. But yea, pagefile is there.

I think this is what they call a Skill Issue. Thank you, Virt.

@Cran-May - Things seem fine, quanting, I'd really like to see your tests for the quants as I'm curious about these Gemma IQ2-imatrix-quants.

Lewdiculous changed discussion title from Request: YeungNLP/firefly-gemma-7b (help-needed) to Request: YeungNLP/firefly-gemma-7b
This comment has been hidden
Lewdiculous changed discussion status to closed

Sign up or log in to comment