Lin-K76 commited on
Commit
3aed33c
1 Parent(s): fcc898a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -25
README.md CHANGED
@@ -33,7 +33,7 @@ base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
- It achieves an average score of 73.67 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 74.17.
37
 
38
  ### Model Optimizations
39
 
@@ -117,11 +117,11 @@ model_stub = "meta-llama/Meta-Llama-3.1-8B-Instruct"
117
  model_name = model_stub.split("/")[-1]
118
 
119
  device_map = calculate_offload_device_map(
120
- model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype=torch.float16
121
  )
122
 
123
  model = SparseAutoModelForCausalLM.from_pretrained(
124
- model_stub, torch_dtype=torch.float16, device_map=device_map
125
  )
126
  tokenizer = AutoTokenizer.from_pretrained(model_stub)
127
 
@@ -171,7 +171,7 @@ oneshot(
171
 
172
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
173
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
174
- This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
175
 
176
  ### Accuracy
177
 
@@ -190,71 +190,81 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
190
  <tr>
191
  <td>MMLU (5-shot)
192
  </td>
193
- <td>67.94
194
  </td>
195
- <td>68.00
196
  </td>
197
  <td>100.0%
198
  </td>
199
  </tr>
 
 
 
 
 
 
 
 
 
 
200
  <tr>
201
  <td>ARC Challenge (0-shot)
202
  </td>
203
- <td>83.11
204
  </td>
205
- <td>82.25
206
  </td>
207
- <td>98.97%
208
  </td>
209
  </tr>
210
  <tr>
211
- <td>GSM-8K (CoT, 8-shot, strict-match)
212
  </td>
213
- <td>82.03
214
  </td>
215
- <td>81.80
216
  </td>
217
- <td>99.72%
218
  </td>
219
  </tr>
220
  <tr>
221
  <td>Hellaswag (10-shot)
222
  </td>
223
- <td>80.01
224
  </td>
225
- <td>79.56
226
  </td>
227
- <td>99.44%
228
  </td>
229
  </tr>
230
  <tr>
231
  <td>Winogrande (5-shot)
232
  </td>
233
- <td>77.90
234
  </td>
235
- <td>77.58
236
  </td>
237
- <td>99.59%
238
  </td>
239
  </tr>
240
  <tr>
241
  <td>TruthfulQA (0-shot, mc2)
242
  </td>
243
- <td>54.04
244
  </td>
245
- <td>52.84
246
  </td>
247
- <td>97.78%
248
  </td>
249
  </tr>
250
  <tr>
251
  <td><strong>Average</strong>
252
  </td>
253
- <td><strong>74.17</strong>
254
  </td>
255
- <td><strong>73.67</strong>
256
  </td>
257
- <td><strong>99.33%</strong>
258
  </td>
259
  </tr>
260
  </table>
@@ -273,6 +283,17 @@ lm_eval \
273
  --batch_size auto
274
  ```
275
 
 
 
 
 
 
 
 
 
 
 
 
276
  #### ARC-Challenge
277
  ```
278
  lm_eval \
 
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
+ It achieves an average score of 73.44 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.79.
37
 
38
  ### Model Optimizations
39
 
 
117
  model_name = model_stub.split("/")[-1]
118
 
119
  device_map = calculate_offload_device_map(
120
+ model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype="auto"
121
  )
122
 
123
  model = SparseAutoModelForCausalLM.from_pretrained(
124
+ model_stub, torch_dtype="auto", device_map=device_map
125
  )
126
  tokenizer = AutoTokenizer.from_pretrained(model_stub)
127
 
 
171
 
172
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
173
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
174
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
175
 
176
  ### Accuracy
177
 
 
190
  <tr>
191
  <td>MMLU (5-shot)
192
  </td>
193
+ <td>67.95
194
  </td>
195
+ <td>67.97
196
  </td>
197
  <td>100.0%
198
  </td>
199
  </tr>
200
+ <tr>
201
+ <td>MMLU-cot (0-shot)
202
+ </td>
203
+ <td>71.24
204
+ </td>
205
+ <td>71.12
206
+ </td>
207
+ <td>99.83%
208
+ </td>
209
+ </tr>
210
  <tr>
211
  <td>ARC Challenge (0-shot)
212
  </td>
213
+ <td>82.00
214
  </td>
215
+ <td>81.66
216
  </td>
217
+ <td>99.59%
218
  </td>
219
  </tr>
220
  <tr>
221
+ <td>GSM-8K-cot (8-shot, strict-match)
222
  </td>
223
+ <td>81.96
224
  </td>
225
+ <td>81.12
226
  </td>
227
+ <td>98.98%
228
  </td>
229
  </tr>
230
  <tr>
231
  <td>Hellaswag (10-shot)
232
  </td>
233
+ <td>80.46
234
  </td>
235
+ <td>80.4
236
  </td>
237
+ <td>99.93%
238
  </td>
239
  </tr>
240
  <tr>
241
  <td>Winogrande (5-shot)
242
  </td>
243
+ <td>78.45
244
  </td>
245
+ <td>77.90
246
  </td>
247
+ <td>99.30%
248
  </td>
249
  </tr>
250
  <tr>
251
  <td>TruthfulQA (0-shot, mc2)
252
  </td>
253
+ <td>54.50
254
  </td>
255
+ <td>53.92
256
  </td>
257
+ <td>98.94%
258
  </td>
259
  </tr>
260
  <tr>
261
  <td><strong>Average</strong>
262
  </td>
263
+ <td><strong>73.79</strong>
264
  </td>
265
+ <td><strong>73.44</strong>
266
  </td>
267
+ <td><strong>99.52%</strong>
268
  </td>
269
  </tr>
270
  </table>
 
283
  --batch_size auto
284
  ```
285
 
286
+ #### MMLU-cot
287
+ ```
288
+ lm_eval \
289
+ --model vllm \
290
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
291
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
292
+ --apply_chat_template \
293
+ --num_fewshot 0 \
294
+ --batch_size auto
295
+ ```
296
+
297
  #### ARC-Challenge
298
  ```
299
  lm_eval \