Update README.md
So you think we should remove them for now @stellaathena ?
Yes, that was the conclusion we reached on today’s Eval WG call.
I don't think we should remove them.
I added a disclaimer above the results that these are not final, as your working group is working on visualizations and different ways to represent the data. As far as I followed your working groups call that was an acceptable solution. So I'd suggest to replace the table in the PR that adds a better visualization of the evaluation results 😊
@Muennighoff We’ve tried really hard to be polite, but since that’s not working I’ll try being blunt instead: these evaluation results should have never been released. They are untrustworthy, unverified, and actively misleading. They have already caused substantial confusion, and will continue to do so. The evaluation WG in no way supports them, and their release is a violation of BigScience’s guiding principles.
Additionally, the disclaimer you added (“WARNING: These are intermediate results”) is false. The problem is not that these results were done on intermediate checkpoints. A more appropriate disclaimer would be:
WARNING: these evaluation results were carried out by people unfamiliar with the evaluation code. Some of them are known to be incorrect, and the rest are largely invalidated. They were released without the approval or consent of the Evaluation WG. The Evaluation WG disowns them and wishes that they had never been released in the first place.
Hey @stellaathena ! I don't think @Muennighoff meant any harm at all as he wasn't there at the end of the meeting. I'm okay with removing them and letting you guys handle the evaluation. I think we should keep the original dump though (I think some of the ongoing work is being done on that) and the human eval evaluation done by @loubnabnl on a seperate codebase. Does that work for you?
Nit: They did run on the final checkpoint.
I spoke with @TimeRobber one-on-one and we agreed to go ahead and remove the evaluation results. I'm not sure who has the permissions to merge this PR, but please do so ASAP
Still think we should keep human eval and training/validation loss/perplexity. If you can update the PR I can merge it.