GSM8K Evaluation has a serious bug/oversight, that is negatively impacting score of all Llama 3 models. Please consider updating to the latest commit of lm-evaluation-harness which fixes it.

#770
by ArkaAbacus - opened

Hello,

We have found a bug/oversight in GSM8k in the commit of lm-evaluation-harness which is used for the LLM leaderboard. In (https://github.com/EleutherAI/lm-evaluation-harness/blob/b281b09/lm_eval/tasks/gsm8k.py#L82), the stop tokens for GSM8K generation are defined as:
[":", "Question:", "Question"]
Llama-3 and its finetunes, and potentially other models, sometimes use ":" in mid CoT for answering GSM8K. Here is an example completion we saw on Llama-3-70B-Instruct:

Question: A three-ounce box of flavored jello makes 10 small jello cups.  Greg wants to make small jello cups for his son's outdoor birthday party.  There will be 30 kids and he wants to have enough so that each kid can have 4 jello cups.  Jello is currently on sale for $1.25.  How much will he spend on jello?
Answer: First, find out how many jello cups Greg needs

Rerunning with ":" removed from the list of stop tokens increases Llama-3-70B-Instruct's score from 85.8% to 90.9%. In internal testing we also found the effect on e.g. Smaug-Llama-3-70B to be much more severe - there the fix increased the score from 67.8% to 88.9%.

Here is a fork with our fix: (https://github.com/abacusai/lm-evaluation-harness-hf/commit/c5269f4794aa2bdedd1f298657e4e1add6bfdebf)

More generally, the commit of lm-evaluation-harness that the LLM leaderboard is using is a year out of date at this point. There has been a significant amount of extra work done on the repo since then, including fixing the above issue, and tweaking other elements like answer parsing (at HEAD, flexible-matching on GSM8K gives an additional 2% to Smaug, for example). Would you guys consider updating the HF LLM Leaderboard to track a more up-to-date version of their repo? We feel this would give a much better indication of model performance currently.

Thanks!

@clefourrier

Open LLM Leaderboard org

Hi!
We're aware of the issue you are mentioning, which generally tends to affect models which are more verbose. We know it's a limitation, but we froze the implementation of the harness we used one year ago to make sure all models are evaluated in exactly the same setup (which is advantageous on some tasks and disadvantageous on others, from what we've seen, so the overall ranking should be fair).

However, we also agree considerably that we need an upgrade!
We have been working on this for the last 2 months - but it requires doing double checks of the task implementations in details, to avoid the above issue (among others, see our blog on DROP for another evaluation where the implementation was lacking), re-running models, adding a couple requested features, etc.
We are also adding several new features to the harness code base for the community in the meantime (such as a better logging system, chat templates, etc), with the support of their team.
It will still take a bit of time to finish, but we should get there before the end of the month (though we're a bit behind schedule at this point tbh).

Open LLM Leaderboard org

I'm going to close the issue (so I can keep track of the other issues we're managing atm), but feel free to comment if needed

clefourrier changed discussion status to closed

Thanks for your response! Great to hear you guys are planning to update to the latest version of the harness - we've used it internally and it does seem much nicer to use, with some of the features you mentioned too.

Curious how you guys will handle rerunning all the models when you do finish the update.

Sign up or log in to comment