Thanks for making this leaderboard!

#2
by SanjiWatsuki - opened

I'm definitely going to be using your autoeval colab going forward. I ran it on my most recent model and got a pretty solid result [0]. It definitely helped show the impact of DPO on that model, [1] has the old results.

I personally find TruthfulQA to be a very dubious benchmark but the other ones seem fine ;)

[0] https://gist.github.com/DHNishi/704d79949c53897db3900c3126fd1fed
[1] https://gist.github.com/DHNishi/ddbad251975d6e6024f65fbfbc0842df

That's really cool, thanks for the feedback!

I agree about TruthfulQA, but I can't remove it from this benchmark suite. I'm talking with @gblazex about creating a new suite that would be 1/ highly correlated with human evaluation, 2/ cheap to compute, and 3/ not contaminated. Let me know if you have good candidates.

I think EQ-Bench is a good candidate that meets all 3 of the criteria. Nobody is trying to contaminate their datasets to win that benchmark and it has shown good correlation so far.

My north stars for model eval right now are MT-Bench (expensive to compute), MMLU, EQ-Bench, and LLM logic tests (not aware of a good machine gradable one yet). I think the other metrics I'd consider are already inside of GPT4All.

I'm a bit skeptical that some of the models ranked on the EQ-Bench leaderboard are better than GPT-4 (https://eqbench.com/). Do you believe it?

Yeah, MT-Bench is really good. The search continues...

Hey Sanji. Thank you for the suggestions. Here are some thoughts on them, organized.

1. Conversational skill

MT-bench
+ rates helpfulness, relevance, accuracy, and detail
+ tests multi-turn capabilities which most other tests don't
+ high correlation to human judgement/preference
- probably doesn't capture breadth of knowledge well (need ~MMLU)
- expensive as you mentioned

2. Breadth of knowledge

MMLU
+ shows good correlations (and I believe one of the ungamed ones )
+ a subset of it might be enough(let's say 5-6 topics out of 57)
+ MT-bench doesn't capture it completely, so it seems necessary

Screenshot 2024-01-11 at 21.16.19.png

photo from MT-bench paper https://browse.arxiv.org/html/2306.05685v4

Good MT-bench results can be obtained
(with small high-quality conversation dataset)
without improving on MMLU (breadth of knowledge).

Conversely, MMLU can be good while MT-bench (conversational skill) isn't optimal.

Their conclusion:
No single benchmark can determine model quality,
meaning that a comprehensive evaluation is needed

3. Logic

LLM logic tests
- Not sure which one this is going to be. AGI Eval has math tests, maybe start with that.
- there's LLMonitor which is only 20 questions but it requires GPT-4 judge (like MT-bench)
+ MT-bench doesn't capture it completely, so it seems necessary

GPT-4 not good at grading math/coding answers

We discover that GPT-4 can produce not only relatively consistent scores
but also detailed explanations on why such scores are given (detailed examples link).
However, we also notice that GPT-4 is not very good at judging coding/math tasks.
https://lmsys.org/blog/2023-03-30-vicuna/

Even when it knows the answer itself

...more intriguing is that it also shows limitations in grading basic math problems which it is capable of solving*.
https://browse.arxiv.org/html/2306.05685v4

4. Emotional intelligence

EQ bench
+ it's only 60 questions.
+ should be cheap to run on a lot of models and figure the correlations
+ 16 results for Arena leaderboard models are already available
- I don't have deep intuitions how hard it is to game in future

5. Others

+ HELM Lite: interesting starting point, as it has many tests, but not too many. Has 5 MMLU, topics but maybe not best
https://twitter.com/tianle_cai/status/1745261204120170687
(@tianle_cai tried to find the best correlating subset of MMLU, but could be overfitting)
+ AGI Eval seems good too, has good correlations so far. I need to run it on more Arena models.
+ Big Bench (hard, COT, 3 shot), good correlations, not gamed yet I believe.
+ HumanEval (or something coding related needed)

Wow, excellent summary!

I'm a bit skeptical that some of the models ranked on the EQ-Bench leaderboard are better than GPT-4 (https://eqbench.com/). Do you believe it?

Yeah, MT-Bench is really good. The search continues...

I do think there are some odd outliers on that benchmark but, by and large, it seems to correlate shockingly well with Elo and it seems remarkably sturdy. For instance, the 7B mergers of mergers that kill on most benchmarks don't do exceptionally well on EQ-Bench - I suspect because it correlates really well with MMLU and it's tough to get improvements.

Yes, that's a really good sign. I think these merges are good overall, but their performance is probably overrated. I'd love to see an evaluation of some of my models to get a good intuition of this EQ score.

EQ correlation was 0.82 looking at 16 models (Which is close to HELM Lite, LLMonitor, BBH@3, HumanEval) so it is promising!
Needs more data.

I always wanted to have human feedback (like arena) on merges & contaminated models to see if/where they fall apart.

But actually if we find automated tests where we can see regressions it might be a good start too.

As an aside, what GPU do you typically spin up to test 7Bs with AutoEval?

I usually use an RTX 3090. The evaluation typically costs a little less than 1$.

Sign up or log in to comment