The results of BBH are inconsistant with official result of Qwen2

#827
by peels7877 - opened
  • BBH official
    Qwen2-72B: 82.4
  • BBH open_llm_leaderboard
    Qwen2-72B::57.48
    Qwen2-72B raw:0.7
Open LLM Leaderboard org

Hi @peels7877 ,

The inconsistency between Qwen2-72B official BBH results and the Leaderboard ones may be due to several factors. Firstly, it appears you've checked results for the instruct model, not the base model. The Leaderboard results for Qwen2-72B BBH are 51.86 and 0.66 for Raw. Additionally, even though both evaluations are done in a 3-shot setting, there could be differences in the subsets split as BBH contains several subset splits, the metrics used (we use acc_norm), and the prompts utilised. Each of these elements can significantly influence the results.

alozowski changed discussion status to closed

Sign up or log in to comment