MaziyarPanahi/calme-2.4-rys-78b · 🎉 Congrats 🎉

dnhkng

10 days ago

This is a huge jump in performance!

What was the difference to the other Calme models, as they all use the same dataset?

MaziyarPanahi

Owner 10 days ago

This comment has been hidden

MaziyarPanahi pinned discussion 10 days ago

MaziyarPanahi

Owner 10 days ago

Thank you. Honestly, I didn't expect this to be this high. I made 7 experiments and started making them public for evaluation one after another. So 2.3 and 2.4 were under evaluation.

I had a quick look now to see what differentiates this one from the other two that we know (2.1 and 2.2):

I used MaziyarPanahi/calme-2.1-rys-78b as a base to fine-tune (since it had higher MMLU PRO and GPQA scores on average).
I can see this one went on for a long time! Compared to others which usually run for less than 1000 steps, this was trained for more than 3300 steps.
The dataset is a mix of what I usually use, like TruthfulQA and Orca, but I can see that for the first time, I used my own synthetically generated dataset that I thought would help with Chain of Thought and multi-step reasoning. (Something I did for LegalKit CoT; I thought maybe I could make DPO datasets.)
At the same time, there is another DPO dataset that tries to improve MMLU by introducing diverse multi-task understanding, and I used CLAIR to finalize the DPO (https://github.com/ContextualAI/CLAIR_and_APO).

This is my quick assessment. I will stop submitting any new experiments and instead go into details to make sure everything here is above board. I am happy to see the CoT and multi-step reasoning DPO datasets are successful, but I will dig into the datasets to be sure the model actually improved and didn't just learn how to answer certain questions.

Will come back with more once I find something interesting.

UPDATE: pinning the post so people can follow any updates