Losin94 commited on
Commit
0546a58
1 Parent(s): 9b039ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -0
README.md CHANGED
@@ -34,6 +34,57 @@ KeyError: 'qwen2'
34
  We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
35
 
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ## Citation
38
 
39
  If you find our work helpful, feel free to give us a cite.
 
34
  We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
35
 
36
 
37
+ ### Performance
38
+
39
+ The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.
40
+
41
+ The datasets for evaluation include:
42
+
43
+ **English Tasks**: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)
44
+
45
+ **Coding Tasks**: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)
46
+
47
+ **Math Tasks**: GSM8K (4-shot), MATH (4-shot)
48
+
49
+ **Chinese Tasks**: C-Eval(5-shot), CMMLU (5-shot)
50
+
51
+ **Multilingual Tasks**: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)
52
+
53
+
54
+
55
+ #### Qwen2-7B performance
56
+ | Datasets | Mistral-7B | Gemma-7B | Llama-3-8B | Qwen1.5-7B | Qwen2-7B |
57
+ | :--------| :---------: | :------------: | :------------: | :------------: | :------------: |
58
+ |# Params | 7.2B | 8.5B | 8.0B | 7.7B | 7.6B |
59
+ |# Non-emb Params | 7.0B | 7.8B | 7.0B | 6.5B | 6.5B |
60
+ | ***English*** | | | | | |
61
+ |MMLU | 64.2 | 64.6 | 66.6 | 61.0 | **70.3** |
62
+ |MMLU-Pro | 30.9 | 33.7 | 35.4 | 29.9 | **40.0** |
63
+ |GPQA | 24.7 | 25.7 | 25.8 | 26.7 | **31.8** |
64
+ |Theorem QA | 19.2 | 21.5 | 22.1 | 14.2 | **31.1** |
65
+ |BBH | 56.1 | 55.1 | 57.7 | 40.2 | **62.6** |
66
+ |HellaSwag | **83.2** | 82.2 | 82.1 | 78.5 | 80.7 |
67
+ |Winogrande | 78.4 | **79.0** | 77.4 | 71.3 | 77.0 |
68
+ |ARC-C | 60.0 | **61.1** | 59.3 | 54.2 | 60.6 |
69
+ |TruthfulQA | 42.2 | 44.8 | 44.0 | 51.1 | **54.2** |
70
+ | ***Coding*** | | | | | |
71
+ |HumanEval | 29.3 | 37.2 | 33.5 | 36.0 | **51.2** |
72
+ |MBPP | 51.1 | 50.6 | 53.9 | 51.6 | **65.9** |
73
+ |EvalPlus | 36.4 | 39.6 | 40.3 | 40.0 | **54.2** |
74
+ |MultiPL-E | 29.4 | 29.7 | 22.6 | 28.1 | **46.3** |
75
+ | ***Mathematics*** | | | | | |
76
+ |GSM8K | 52.2 | 46.4 | 56.0 | 62.5 | **79.9** |
77
+ |MATH | 13.1 | 24.3 | 20.5 | 20.3 | **44.2** |
78
+ | ***Chinese*** | | | | | |
79
+ |C-Eval | 47.4 | 43.6 | 49.5 | 74.1 | **83.2** |
80
+ |CMMLU | - | - | 50.8 | 73.1 | **83.9** |
81
+ | ***Multilingual*** | | | | | |
82
+ |Multi-Exam | 47.1 | 42.7 | 52.3 | 47.7 | **59.2** |
83
+ |Multi-Understanding | 63.3 | 58.3 | 68.6 | 67.6 | **72.0** |
84
+ |Multi-Mathematics | 26.3 | 39.1 | 36.3 | 37.3 | **57.5** |
85
+ |Multi-Translation | 23.3 | 31.2 | **31.9** | 28.4 | 31.5 |
86
+
87
+
88
  ## Citation
89
 
90
  If you find our work helpful, feel free to give us a cite.