The problem about the overall score of BBH and GPQA datasets

#842
by Amigozyq - opened

Hi! Thank you very much for your helpful and outstanding work!

The BBH dataset and the GPQA dataset both have multiple subsets, but on the open-llm-leaderboard, what is displayed is the overall score of each model on BBH and GPQA. I wonder how these overall scores are obtained? Are they simply the average of the scores the model achieves on each subset? Or they are the score of the concat of all subsets?
image.png
Thank you very much!

Open LLM Leaderboard org

Hi @Amigozyq ,

Here is the new page about Scores Normalization in our documentation, I think it will be helpful

I close this discussion, please, open a new one if you have any questions!

alozowski changed discussion status to closed

Sign up or log in to comment