Average column values

#821
by Stark2008 - opened

Hi,

How is the Average column being calculated? When I calculate manually I get sightly different result than the values in that column...

Open LLM Leaderboard org

We average the full normalized scores - but since we only display a couple of decimals, I expect you could get differences due to rounding errors.
You can check contents for the full results

clefourrier changed discussion status to closed

@clefourrier

I already tried averaging the full normalized scores, but the result is fairly far off.

Taking "vicgalle/Roleplay-Llama-3-8B" for example:

eval = {'eval_name': 'vicgalle_Roleplay-Llama-3-8B_float16', , 'Average ⬆️': 24.3287788759506, 'IFEval': 73.20221456845613, 'BBH': 28.554603909240623,
        'MATH Lvl 5': 8.685800604229607, 'GPQA': 1.4541387024608499, 'MUSR': 1.6773437499999992, 'MMLU-PRO': 30.093823877068555,
        "Maintainer's Highlight": False}

mean(
    [
        eval["IFEval"],
        eval["BBH"],
        eval["MATH Lvl 5"],
        eval["GPQA"],
        eval["MUSR"],
        eval["MMLU-PRO"],
    ]
)

I get: 23.944654235242627
The average column reads: 24.33 (24.3287788759506)

What am I missing?

Open LLM Leaderboard org
edited 19 days ago

This is extremely weird indeed, tagging @alozowski for reference - we're investigating asap.
Thanks for reporting!

alozowski changed discussion status to open
Open LLM Leaderboard org

Hi @Stark2008 ,

Thanks a lot that you've noticed this! We accidentally calculated the average as the sum of all values, including raw ones. I've fixed this so now all the scores are correct

Screenshot 2024-07-04 at 13.56.42.png

alozowski changed discussion status to closed

Happy to help, @alozowski :)

Funny and ironic how that happened after explicitly declaring that the average would not be calculated using raw output scores anymore 😅

Sign up or log in to comment