Nice Leaderboard :)

#1
by Venkman42 - opened

Nice to see some more leaderboards. I feel like you can't really trust the open llm leaderboard at this point and they don't add any phi-2 models except the Microsoft one because of remote code.
Could you add the following models?
Phi-2-dpo
Openchat v1+v2
Mistral v1+v2
Starling Alpha

I would be really interested how phi-2-dpo stacks up against dolphin phi and the other models would be great reference points, since they have been very popular

Owner

Thanks! I added Starling Alpha and Openchat thanks to @gblazex working on uploading more phi-2 models.

Thanks, don't forget the new phixtral models ๐Ÿ˜‰

Oh and btw, gobruins 2.1.1 was flagged as contaminated on the Open LLM Leaderboard because it contains Data for TruthfulQA
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#657e6a221e3e9c41a4a8ae23

Owner

Haha phixtral is cooking! Performance won't be that great with this version but it's a start.

Yes, I noticed that. I think @gblazex wanted to compare the performance on the Open LLM Leaderboard vs. Nous benchmark suite. I'll probably remove it.

I'll run these:
Openchat-1210 ("v2")
Mistral v1+v2

Yes, I noticed that. I think @gblazex wanted to compare the performance on the Open LLM Leaderboard vs. Nous benchmark suite. I'll probably remove it.

yes TruthfulQA is part of Nous. I wanted to see how it does on the rest.
No need to be on this leaderboard

(This was the only flagged one that was interesting to me because relatively lot of people liked the model card. It actually does well on other benchmarks like AGIEval).

๐Ÿ‘ Cool leaderboard!

I'm glad to see dolphin-2_6-phi-2 up here, it feels capable and it's cool to see it compared to phi-2. 3B models are pretty under represented on the Open LLM board with a big gap between phi-2 and MiniChat-2, so I'd also like to request stabilityai/stablelm-zephyr-3b. It isn't on the other board due to remote code, it doesn't feel as good as phi-2 but it is decent.

Also Weyaxi/OpenHermes-2.5-neural-chat-v3-3-Slerp is really good and my go-to 7B so I'd like to see how it performs here too.

New Openchat model just dropped, would be a great addition ;)
openchat/openchat-3.5-0106

Nice to see a new leaderboard and thanks for your very useful Colab AutoEval as well!

Another one to add:
rhysjones/phi-2-orange

llm-autoeval for it has just completed, output is in the following gist:
https://gist.github.com/rhys101/90704633aee67d7325fc9b599be27fa2

Thanks @gblazex for all your contributions, I'll add you in the About section!

@b-mc2 Cool, I added all the models you mentioned.

@Venkman42 Added! It looks good but Mistral Instruct v0.2 is really impressive.

@rhysjones For fairness, I reran the benchmark for your model (it gained +0.04 on average :)). It looks really strong, congrats! I'll probably use it for phixtral.

Could you please make the link to the leaderboard clickable?

What do you mean, exactly?

Sorry, I meant the link column where the link to the map is

The URLs are already clickable. You can't do better than this in Streamlit unfortunately (I can't insert the links in the model ids).

I ran openchat-1210, you can decide if you want to add or not. It was before the newest one got released :)
Maybe the oldest and newest one are enough to keep the leaderboard less polluted.
Oldest one is nice to see as it's on Arena, and most widely used (also Starling is based on it).

https://gist.github.com/gblazex/8c39e043f13cbbfc4ab1fa68faf2cedc

Added, thanks again @gblazex !

The URLs are already clickable. You can't do better than this in Streamlit unfortunately (I can't insert the links in the model ids).

Oh okay, on mobile I have to click twice in the cell to make the link clickable, just figured that out.

@Venkman42 Added! It looks good but Mistral Instruct v0.2 is really impressive.

What about a NeuralHermes-2.5-Mistral-7B-v0. 2-Laser ? ๐Ÿ˜‰

I'll try playing with the HTML render and add buttons to sort it to fix this issue.

Haha yes that'd be interesting, especially using the new preference dataset made by @argilla

And I don't know if it's possible but if someone could add Benchmarks for some OpenAI models for reference, that would be great too :)

It's a great point, I'll definitely add that (writing it down for now)!

You would have to run it from scratch.

? TruthfulQA maybe the only one where you can find the same few-shot results, but all the others on this leaderboard are not really standard or widely-used in academia in this form.
- GPT4All: despite the name I don't know any results for closed source models
- BigBench: is not the BigBench results you see elsewhere (this one is a cherry picked subset by Nous)
- AGI Eval: isn't the complete AGI Eval too (missing 1,000 question math part). Some online sources might also report the whole test which is English + Chinese tests (Nous only tests English). OpenCompass leaderboard has English-only only results, but again it ran the full suite, including the 1,000 math test that is missing from this leadboard's results.

So the only way to get comparable results for closed source models on this exact benchmark is to run it on them directly.

new leader (distilabeled-Marcoro14-7B-slerp)
https://gist.github.com/gblazex/38f0f222f43c629f4e5fd18b596b889b

Model AGIEval GPT4All TruthfulQA Bigbench Average
distilabeled-Marcoro14-7B-slerp 45.38 76.48 65.68 48.18 58.93

actually I just saw that @argilla released their version too (the one i tested I found it through your 'like' Maxime)
https://huggingface.co/argilla/distilabeled-Marcoro14-7B-slerp

Maybe you wanna test argilla one, but they are probably the same.

edit: yes I checked on Hugging leaderboard they are practically the same. I suggest linking to argilla one because they made the dataset.

Thanks @gblazex , and congrats to @dvilasuero and team for this excellent model! Top of the YALL leaderboard :))

@mlabonne have you checked out functionary v2.2 small/medium? They claim their function calling is better than GPT 3.5 Turbo. Would be interesting to see how they stack up to non function calling models and general tasks

Not sure it's the right benchmark to evaluate this model as it would significantly underperform general-purpose LLMs.

Not sure it's the right benchmark to evaluate this model as it would significantly underperform general-purpose LLMs.

That's exactly what I would like to know, if it's a generally good model or only good for function calling. Like does it require a RAG setup or can it answer general questions and or topics on its own.

Unfortunately the model card doesn't say much.

On a sidenote, would it be possible to add function calling abilities to NeuralHermes-2.5-Mistral-7B-Laser? It can kinda do it when explained in the system prompt, but a lot of times it keeps generating text(hallucinations mostly) after generating the function call

Ok ok I'm adding meetkai/functionary-small-v2.2 to the benchmark (currently evaluating it).

It's possible in theory but no plan to add that at the moment.

Here you go: https://gist.github.com/mlabonne/59e885173061ec5d94aeb00c539bf903

Thanks for taking the time ๐Ÿ˜Š

Could someone add
g-ronimo/phi-2-OpenHermes-2.5
CognitiveComputation/Openchat-01-06-LASER

Thanks in advance :)

Owner

No problem, running the evaluations now.

@mlabonne Hi, me again. I hope I don't annoy you yet haha.

I've made an "incest model" of my Phiter model by Creating ReversePhiter(switched the two models in the config) and remerging it with Phiter, creating PhiPhiter.

I'm not sure if that makes much sense, it was just an experiment. It doesn't feel much smarter, but it would be nice to see if there was even a slight change in performance.

Could you run an eval on PhiPhiter for me or tell me if it doesn't make any sense?
https://huggingface.co/Venkman42/PhiPhiter

Thanks in advance :)

No problem, added it to the list ๐Ÿซก

image.png

No problem, added it to the list ๐Ÿซก

image.png

Thanks, but ReversePhiter is just the intermediate model, PhiPhiter is the one I'm curious about :)

Could you also add the new Gemma models sometime? I don't have high hopes for them, but I'm curious where they rank

@mlabonne what is this list? Your private scheduled runs?

@Venkman42 Oops, you'll have both! Yeah the Gemma-7b models crashed last time but I think it's because they didn't have enough VRAM. Will try again with an A6000.

@gblazex Haha yeah, it's getting out of control.

@mlabonne No worries, I'll gladly take an extra freebie eval haha

@mlabonne btw, nice addition with the compare model feature :)

Yes once it available it will be relocating back

@Venkman42 Thanks, this was actually implemented by @CultriX in a PR so all credit to him :)

It's updated: Phiter is still the king of the non-MoE phi-2 leaderboard. On the other hand, Gemma-7b underperforms Gemma-2b. Probably the weirdest thing that ever happened to this leaderboard, but it means that my Gemmalpaca-7B fine-tune was actually successful.

Damn, my rookie merge beats Gemma ๐Ÿ˜‚ not surprised though, already looked pretty low from first impressions and kinda what I expected from Google at this point...
Thanks for running the evals on my models :) I already though they weren't better than Phiter, but it was still an interesting experiment for me ๐Ÿ˜Š

@Venkman42 Thanks, this was actually implemented by @CultriX in a PR so all credit to him :)

It's updated: Phiter is still the king of the non-MoE phi-2 leaderboard. On the other hand, Gemma-7b underperforms Gemma-2b. Probably the weirdest thing that ever happened to this leaderboard, but it means that my Gemmalpaca-7B fine-tune was actually successful.

:)

Btw feel free to fork some of my evals from my Github Gists (I forked some of yours so they would show up on my Leaderboard, no shame in that haha :p)

Also you still did not add my MonaTrix-v4 eval (which would top your leaderboard by a whopping 0.03 points). Just reminding you in case you forgot.
(Jokes aside thanks for the amazing LeaderBoard code @mlabonne ! I have been using it a lot to test out my new models!)

Owner

@CultriX Will do! I need more samples for my next project, thanks for your evals (and your PRs).

Do you mean NeuralMaxime-7B-slerp instead of MonaTrix-v4? Obviously, it is the superior model.

@CultriX Will do! I need more samples for my next project, thanks for your evals (and your PRs).

Do you mean NeuralMaxime-7B-slerp instead of MonaTrix-v4? Obviously, it is the superior model.

Shouldve named it NeuralMaxime-7B-derp tbh 0.o

@mlabonne A new version of phi-2-orange is out to be tested.

It uses the latest version of Microsoft's phi-2 model format, so can be used directly from within the Hugging Face Transformers library without having to trust remote code.

The new version is rhysjones/phi-2-orange-v2

If it could be added to the leaderboard, that would be great.
Thanks!

Owner

Cool, currently evaluating it

Owner

@rhysjones Wow the new version of your model is significantly better on AGIEval and Bigbench (and a lot better on TruthfulQA but I don't trust this one). Can you share how you changed the training process?

@mlabonne it's a training pipleline system that we (AxonZeta) are building, allowing a range of hyperparameters to be explored in multiple fine-tuning runs (usual suspects: learning rate, (q)lora rank and alpha, number of epochs etc.) and also the sub-selection of the training data.

For example, does a selected subset train better than the entire dataset? Especially with something like the multi-dataset used in this phi2-orange model. How weighted should each dataset be in the training group? Then at each stage, look at each layer and test if a blended / self-merged model of different layers from different runs over the hyperparameters give a better model for that stage. Then onto the next stage (dpo datasets in this case) and repeat, etc.

The theory is that this should give an improved and more balanced model - seems to be true-ish in the evals on this and the open_llm leaderboard, but the real test will be in its actual use across different scenarios in the real world. So who knows!

Still haven't got round to writing this up properly (and its still under active development, with more training runs / models underway), but will share more details when I do.

@rhysjones nice Update :) did you ever think about making a tinyllama orange by any chance? Would be interesting to see on an even smaller model

@Venkman42 Yes, very interested in smaller models - thinking of exploring TinyLlama and maybe Cosmo-1B for a bit of a challenge...

@Venkman42 Yes, very interested in smaller models - thinking of exploring TinyLlama and maybe Cosmo-1B for a bit of a challenge...

Nice! Looking forward to seeing the results :)

@mlabonne Hey, could please you add Microsofts Phi-3-instruct?

@Venkman42

@mlabonne Hey, could please you add Microsofts Phi-3-instruct?

I did this for you, results can be found here: https://huggingface.co/spaces/CultriX/Alt_LLM_LeaderBoard?logs=build

Or in more detail: https://gist.github.com/CultriX-Github/63952ac9317c80e241c0337c31e53a13

Owner

Sorry @Venkman42 I forgot to add it. Thanks @CultriX I'm forking you gist :)

@Venkman42

@mlabonne Hey, could please you add Microsofts Phi-3-instruct?

I did this for you, results can be found here: https://huggingface.co/spaces/CultriX/Alt_LLM_LeaderBoard?logs=build

Or in more detail: https://gist.github.com/CultriX-Github/63952ac9317c80e241c0337c31e53a13

Thanks :)

Sorry @Venkman42 I forgot to add it. Thanks @CultriX I'm forking you gist :)

Don't tell anyone but I think I forked about every single evaluation you ever ran on your leaderboard so I didn't have to run it myself. So all good :p!

Edit: Hey btw what happened to your leaderboard? It looks like all your scores are now way lower than they used to be (compare your board with mine). Did you change something? The tests run maybe? (Also you have quite a few models added double which is absolutely not a big deal except it ruins the pretty graphs at the bottom which for me personally actually is a little bit of a big deal but that's probably just me lol).

Yeah I removed TruthfulQA from the Average score to make it more accurate. Where do you see duplicated models, is it in Automerger's version?

No on this one. And ahh check that explains a lot! Any reason why you took that one out in particular?

image.png

Ah it's because of the "Other" category, just fixed it thanks.

TruthfulQA is an awful benchmark, I believe HF also thinks of removing it.

I liked your idea of being able to remove certain benchmarks or tests and have the table recalculate a new filtered average. But instead of making the choice for the user, I decided to implement a way so that people can chose which benchmarks they want to use in the table. Take a look and feel free to copy the idea if you like it :)!

https://huggingface.co/spaces/CultriX/Alt_LLM_LeaderBoard

Yeah that's a good idea, might steal it haha

might

edit: nvm I stole your entire leaderboard so I really can't say sh*t lol

Sign up or log in to comment