meta-llama/Llama-3.2-1B-Instruct · MMLU-Pro benchmark

In Meta's announcement I noticed they showed MMLU scores for the 1B and 3B models but not MMLU-Pro like with 11B and 90B. Here is my testing result with Qwen2.5 for comparison:

| Models                  | Data Source    | Overall  | Biology  | Business  | Chemistry  | Computer Science | Economics  | Engineering  | Health   | History   | Law    | Math    | Philosophy  | Physics  | Psychology  | Other   |
|-------------------------|----------------|----------|----------|-----------|------------|------------------|------------|--------------|----------|-----------|--------|---------|-------------|----------|-------------|---------|
| Qwen2.5-1.5B            | Self-Reported  |   0.321  |   0.435  |   0.374   |   0.256    |     0.351        |   0.389    |    0.190     |   0.336  |   0.278   |  0.148 |  0.430  |    0.279    |   0.286  |    0.469    |  0.325  |
| Llama-3.2-1B-Instruct   | Self-Reported  |   0.226  |   0.406  |   0.219   |   0.155    |     0.239        |   0.274    |    0.125     |   0.260  |   0.213   |  0.173 |  0.234  |    0.200    |   0.180  |    0.346    |  0.242  |
| Qwen2.5-0.5B            | Self-Reported  |   0.149  |   0.208  |   0.146   |   0.116    |     0.137        |   0.225    |    0.110     |   0.169  |   0.131   |  0.134 |  0.133  |    0.132    |   0.122  |    0.212    |  0.150  |

You can view the full leaderboard here: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro