MMLU-Pro benchmark

#13
by kth8 - opened

In Meta's announcement I noticed they showed MMLU scores for the 1B and 3B models but not MMLU-Pro like with 11B and 90B. Here is my testing result with Llama 3.1 8B and Qwen2.5 for comparison:

| Models                | Data Source   | Overall | Biology | Business | Chemistry | Computer Science | Economics | Engineering | Health  | History | Law   | Math  | Philosophy | Physics | Psychology | Other |
|-----------------------|---------------|---------|---------|----------|-----------|------------------|-----------|-------------|---------|---------|-------|-------|------------|---------|------------|-------|
| Llama-3.1-8B-Instruct | TIGER-Lab     | 0.443   | 0.630   | 0.493    | 0.376     | 0.483            | 0.551     | 0.297       | 0.507   | 0.423   | 0.273 | 0.438 | 0.445      | 0.403   | 0.600      | 0.448 |
| Qwen2.5-3B            | Self-Reported | 0.437   | 0.545   | 0.541    | 0.407     | 0.432            | 0.530     | 0.292       | 0.440   | 0.391   | 0.223 | 0.545 | 0.371      | 0.440   | 0.555      | 0.415 |
| Llama-3.2-3B-Instruct | Self-Reported | 0.365   | 0.552   | 0.399    | 0.264     | 0.371            | 0.480     | 0.260       | 0.461   | 0.336   | 0.227 | 0.378 | 0.349      | 0.302   | 0.514      | 0.358 |

You can view the full leaderboard here: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

Sign up or log in to comment