π Congrats π
This is a huge jump in performance!
What was the difference to the other Calme models, as they all use the same dataset?
Thank you. Honestly, I didn't expect this to be this high. I made 7 experiments and started making them public for evaluation one after another. So 2.3 and 2.4 were under evaluation.
I had a quick look now to see what differentiates this one from the other two that we know (2.1 and 2.2):
- I used
MaziyarPanahi/calme-2.1-rys-78b
as a base to fine-tune (since it had higher MMLU PRO and GPQA scores on average). - I can see this one went on for a long time! Compared to others which usually run for less than 1000 steps, this was trained for more than 3300 steps.
- The dataset is a mix of what I usually use, like TruthfulQA and Orca, but I can see that for the first time, I used my own synthetically generated dataset that I thought would help with Chain of Thought and multi-step reasoning. (Something I did for LegalKit CoT; I thought maybe I could make DPO datasets.)
- At the same time, there is another DPO dataset that tries to improve MMLU by introducing diverse multi-task understanding, and I used CLAIR to finalize the DPO (https://github.com/ContextualAI/CLAIR_and_APO).
This is my quick assessment. I will stop submitting any new experiments and instead go into details to make sure everything here is above board. I am happy to see the CoT and multi-step reasoning DPO datasets are successful, but I will dig into the datasets to be sure the model actually improved and didn't just learn how to answer certain questions.
Will come back with more once I find something interesting.
UPDATE: pinning the post so people can follow any updates