yuvraj17 commited on
Commit
72c0667
1 Parent(s): b0c88be

Added Nous Eval-Scores

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -61,4 +61,72 @@ pipeline = transformers.pipeline(
61
 
62
  outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
63
  print(outputs[0]["generated_text"])
64
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
63
  print(outputs[0]["generated_text"])
64
+ ```
65
+
66
+ # 🏆 Evaluation Scores
67
+
68
+ ## Nous
69
+
70
+ | Model |AGIEval|TruthfulQA|Bigbench|
71
+ |---------------------------------------------------------------------------------------------|------:|---------:|-------:|
72
+ |[yuvraj17/Llama3-8B-Instruct-Slerp](https://huggingface.co/yuvraj17/Llama3-8B-Instruct-Slerp)| 38.32| 57.15| 43.91|
73
+
74
+
75
+ ### AGIEval
76
+ | Task |Version| Metric | Value | | Stderr |
77
+ |------------------------------|------:|---------|------:|---|-------:|
78
+ | agieval_aqua_rat | 0| acc | 23.62 |± | 2.67 |
79
+ | | | acc_norm| 22.05 |± | 2.61 |
80
+ | agieval_logiqa_en | 0| acc | 27.50 |± | 1.75 |
81
+ | | | acc_norm| 31.80 |± | 1.83 |
82
+ | agieval_lsat_ar | 0| acc | 21.30 |± | 2.71 |
83
+ | | | acc_norm| 20.87 |± | 2.69 |
84
+ | agieval_lsat_lr | 0| acc | 35.29 |± | 2.12 |
85
+ | | | acc_norm| 37.65 |± | 2.15 |
86
+ | agieval_lsat_rc | 0| acc | 42.01 |± | 3.01 |
87
+ | | | acc_norm| 39.78 |± | 2.99 |
88
+ | agieval_sat_en | 0| acc | 55.83 |± | 3.47 |
89
+ | | | acc_norm| 50.49 |± | 3.49 |
90
+ | agieval_sat_en_without_passage| 0| acc | 36.89 |± | 3.37 |
91
+ | | | acc_norm| 34.95 |± | 3.33 |
92
+ | agieval_sat_math | 0| acc | 29.55 |± | 3.08 |
93
+ | | | acc_norm| 28.64 |± | 3.05 |
94
+
95
+ **Average score**: 33.28%
96
+
97
+ ### TruthfulQA
98
+
99
+
100
+ | Task |Version| Metric | Value | | Stderr |
101
+ |---------------------|------:|--------|------:|---|-------:|
102
+ | truthfulqa_mc | 1| mc1 | 33.54 |± | 1.65 |
103
+ | | | mc2 | 49.78 |± | 1.53 |
104
+
105
+ **Average score**: 49.78%
106
+
107
+ ### BigBench
108
+
109
+ | Task |Version| Metric | Value | | Stderr |
110
+ |------------------------------------|------:|-----------------------|------:|---|-------:|
111
+ | bigbench_causal_judgement | 0| multiple_choice_grade | 47.89 |± | 3.63 |
112
+ | bigbench_date_understanding | 0| multiple_choice_grade | 39.02 |± | 2.54 |
113
+ | bigbench_disambiguation_qa | 0| multiple_choice_grade | 33.72 |± | 2.95 |
114
+ | bigbench_geometric_shapes | 0| multiple_choice_grade | 20.61 |± | 2.14 |
115
+ | bigbench_logical_deduction_five_objects| 0| multiple_choice_grade | 31.40 |± | 2.08 |
116
+ | bigbench_logical_deduction_seven_objects| 0| multiple_choice_grade | 23.71 |± | 1.61 |
117
+ | bigbench_logical_deduction_three_objects| 0| multiple_choice_grade | 47.00 |± | 2.89 |
118
+ | bigbench_movie_recommendation | 0| multiple_choice_grade | 27.40 |± | 1.99 |
119
+ | bigbench_navigate | 0| multiple_choice_grade | 50.10 |± | 1.58 |
120
+ | bigbench_reasoning_about_colored_objects| 0| multiple_choice_grade | 38.40 |± | 1.09 |
121
+ | bigbench_ruin_names | 0| multiple_choice_grade | 27.23 |± | 2.11 |
122
+ | bigbench_salient_translation_error_detection| 0| multiple_choice_grade | 25.45 |± | 1.38 |
123
+ | bigbench_snarks | 0| multiple_choice_grade | 46.41 |± | 3.72 |
124
+ | bigbench_sports_understanding | 0| multiple_choice_grade | 50.30 |± | 1.59 |
125
+ | bigbench_temporal_sequences | 0| multiple_choice_grade | 37.30 |± | 1.53 |
126
+ | bigbench_tracking_shuffled_objects_five_objects| 0| multiple_choice_grade | 21.36 |± | 1.16 |
127
+ | bigbench_tracking_shuffled_objects_seven_objects| 0| multiple_choice_grade | 17.14 |± | 0.90 |
128
+ | bigbench_tracking_shuffled_objects_three_objects| 0| multiple_choice_grade | 47.00 |± | 2.89 |
129
+
130
+ **Average score**: 35.38%
131
+
132
+