State of Open LLM Leaderboard v2 evals and Reproduciblity Issues.

#829
by pankajmathur - opened

Hi Team,
Huge thanks for your effort to upgrade the Open LLM LB to V2. I understand all migrations are hard and require immense handwork from people in the background, so users might not see even an iota of difference on the front end.

This discussion is to help understand the current state of Open LLM LB2 evals and the reproducibility issue.

TL;DR: IMHO choosing "Use chat template" for most of the models during UI submission is misguided. For all the current submission (FINISHED and PENDING) there should be option to resubmit the models without loosing vote count.

First, In order to reproduce similar steps which are mentioned on official doc https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about, there are issues with official 0.4.3 release of [lm-evaluation-harness] (https://github.com/EleutherAI/lm-evaluation-harness) like required python packages missing, gated datasets on HF usage and broken code (make_table).
I also use the PR (https://github.com/EleutherAI/lm-evaluation-harness/pull/2058) and steps mentioned in official doc.
It is still not fixing the main issue of LLM with chat_template submission are being evaluated wrongly.

As an example:
Here is Requests Log for "orca_mini_v6_8b"
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/pankajmathur/orca_mini_v6_8b_eval_request_False_bfloat16_Original.json

On LB UI, It shows "orca_mini_v6_8b" to me most crappiest model on LB, because all of 6 evaluation scores are in single digit score. There is literal 0 score for GPQA and Math Lvl 5.
When I finally managed to setup and conduct evals mentioned the official doc => https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/leaderboard/README.md

The results were much better.
Let's take the single case of MMLU-PRO
For orca_mini_v6_8b the LB shows score of 1.38 out of 100
When I ran this on 1xA100 using below command, it shows 0.3564 out of 1
Here is the command:

pretrained_model="pankajmathur/orca_mini_v6_8b"
lm_eval --model hf --model_args pretrained=$pretrained_model,dtype=bfloat16 --tasks leaderboard_mmlu_pro --num_fewshot 5 --batch_size=auto --device cuda:0 --output_path hf_open_llm_lb_2/leaderboard_mmlu_pro --use_cache cache_$pretrained_model/leaderboard_mmlu_pro

Here is the screenshot of result:

Screenshot 2024-07-09 at 2.43.29 PM.png

Here is the output log:
https://huggingface.co/pankajmathur/orca_mini_v6_8b/blob/main/open-llm-lb-2-evals-log/leaderboard_mmlu_pro/results_2024-07-09T17-41-22.805410.json

The difference I think is because of me not using externally provided chat_template while conducting these evals, which I think is right approach

Here are the other logs which were output while running all other tasks from official 0.4.3 release of [lm-evaluation-harness] (https://github.com/EleutherAI/lm-evaluation-harness) on all Open LLM LB tasks group:
https://huggingface.co/pankajmathur/orca_mini_v6_8b/tree/main/open-llm-lb-2-evals-log

So, please Please advise on next steps and if possible, Can we resubmit the model without "Use Chat template" option while preserving the Vote count state. It was very hard to get this votes just to be evaluated in first place.

Sorry for long post, Thanks again for V2. Happy to answer any further questions and provide more details.

pankajmathur changed discussion title from State of Open LLM Leaderboard 2 evals and Reproduciblity Issues. to State of Open LLM Leaderboard v2 evals and Reproduciblity Issues.
Open LLM Leaderboard org

Hi @pankajmathur !

Thanks for your appreciation of the leaderboard!

1) Reproducibility

@SaylorTwift will be able to give you the full command to reproduce, I can already see that yours misses the "few-shot-as-multiturn" keyword argument. It's highly possible that our doc is not perfectly up to date at the moment, for which we apologize - hopefully all our PRs will be merged soon which will make everything easier to run. You also should not manually select the fewshot sample size as it's hardcoded for all our tasks.

I think you could open issues on the harness (to indicate the bugs you found with the release, if it's the latest) in order to help them with the next version, it will be very helpful :)

2) "For all the current submission (FINISHED and PENDING) there should be option to resubmit the models without loosing vote count."

I don't think we will do this, as we expect users to be careful when submitting their models to avoid us wasting compute on evaluating models in an improper fashion. Loosing the associated votes will force users to be careful, and will prevent people from submitting the same model 10 times at different precisions/with without chat templates, etc. We do not want people to abuse this system.

At the moment, the leaderboard assumes that chat models require chat templates, but if you know your model does not require them, you are free to untoggle the category at submission time.

Thank you @clefourrier for the prompt response and for the effort in going through this discussion, it's appreciated.

  1. Reproducibility
    Hi @SaylorTwift , nice to meet you. when you get a chance, could you please share the full command to reproduce the v2 LB evals? I'm bit confused about what impact "few-shot-as-multiturn" argument has on the zero-shot group of tasks like leaderboard_ifeval. Why having this argument will evaluate model wrongly on zero shot task. Hence, I believe, the issue remains the same: orca_mini_v6_8b is evaluated incorrectly on LB

Here are the ifeval results and logs without the "few-shot-as-multiturn" argument:
For orca_mini_v6_8b, the LB shows a score of 1.11 out of 100.
When I ran this on 1xA100 using the command below, it shows:
Instance-Level Strict Accuracy: 0.32532347504621073 out of 1
Prompt-Level Strict Accuracy: 0.4580335731414868 out of 1

Here is the command:

pretrained_model="pankajmathur/orca_mini_v6_8b"
lm_eval --model hf --model_args pretrained=$pretrained_model,dtype=bfloat16 --tasks leaderboard_ifeval --num_fewshot 0 --batch_size=auto --device cuda:0 --output_path hf_open_llm_lb_2/leaderboard_ifeval --use_cache cache_$pretrained_model/leaderboard_ifeval

Here is the screenshot of the result:

Screenshot 2024-07-09 at 2.42.14 PM.png

Here is the output log:
https://huggingface.co/pankajmathur/orca_mini_v6_8b/blob/main/open-llm-lb-2-evals-log/leaderboard_ifeval/results_2024-07-09T18-34-22.230254.json

Opening Issues with lm-evaluation-harness: Yup, good idea. I'll do that for the issues which I found while running the latest (v0.4.3) release.

  1. "For all the current submissions (FINISHED and PENDING), there should be an option to resubmit the models without losing vote count."
    @clefourrier : Understood, However, How do you expect any random user (including anonymous ones), who can submit models for evaluation, to understand which model is best to evaluate with chat_template and which are not?

As a matter of fact, the models in discussion "orca_mini_v6_8b" were not submitted by me so it got evaluated with the chat_template flag set to True, which was wrong way to do the evaluation for this model.
So, the enforcement of not removing wrongly submitted models for evaluation and not preserving vote counts goes against the author of the model, not against people who want to abuse the system.

It was always possible that any person, knowingly or unknowingly, could submit any model incorrectly for evaluation, including with the wrong precision, wrong model type, and now with V2, the wrong chat_template flag.
I know, We do not want people to abuse this system. but the whole voting system to prioritize the evaluation of a wrongly submitted models has to be bit more thought through and may be with more edge cases.

one very quick idea, could be to just send a confirmation notification to original author of the model that your model is submitted for evaluation with these arguments, please approve before it goes to voting or evaluation and let author decide what are the best arguments for model evaluation.
Again, There could be many other ways to handle this.

Thanks again for all the hard work.
Pankaj

Open LLM Leaderboard org

Thanks for your answer!

The confirmation email is an interesting idea! We'll think about it and see how feasible it is.

Thanks @clefourrier

@SaylorTwift : I just submitted multiple PR's https://huggingface.co/datasets/open-llm-leaderboard/requests/discussions

for many of my models with:

 "use_chat_template": false

Let me know if you have questions or there is any issues accepting this PR's.

https://huggingface.co/datasets/open-llm-leaderboard/requests/tree/main/pankajmathur

Open LLM Leaderboard org
edited Jul 12

Hi @pankajmathur ,
We'll be merging your PRs for the models which have not already started evaluating.
For all the models which were already launched, this will be an issue, as we will need to cancel the job, and relaunch, which will be a waste of compute.
I think we'll indeed add a logging system to force people to be connected and only submit their own models to avoid these kind of issues (cc @alozowski since you wanted to add this).

Open LLM Leaderboard org

Btw, you opened several PRs several times, please try to open a single PR next time.
I'm also not accepting your PRs for finished models, as we already evaluated them, and changing the request file would mean providing an incorrect display to users.

Thanks @clefourrier truly appreciated. Yup will do a single PR next time (hopefully not many next time needed after adding logging system).
Also, Understood about already evaluated model, Make sense. I will submit them again with proper arguments.
Let me know if you have any questions in meanwhile.

@SaylorTwift : Could you please share the full command to reproduce the v2 LB evals?

Open LLM Leaderboard org

Hi @pankajmathur ,

I think I can help you with the command to reproduce the evals – here is the Reproducibility info in our documentation, I hope you will find it useful.

Let me close this discussion now, feel free to open a new one in case of any other questions!

alozowski changed discussion status to closed

Sign up or log in to comment