Zhiyu Wu commited on
Commit
862fdcc
1 Parent(s): d846882

Add Pegasus scripts for running NLP evaluation (#9)

Browse files
Files changed (2) hide show
  1. pegasus/README.md +26 -0
  2. pegasus/nlp-eval.yaml +68 -0
pegasus/README.md CHANGED
@@ -58,3 +58,29 @@ $ pegasus q
58
  ```
59
 
60
  `q` stands for queue. Each command is run once on the next available (`hostname`, `gpu`) combination.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
59
 
60
  `q` stands for queue. Each command is run once on the next available (`hostname`, `gpu`) combination.
61
+
62
+ ## NLP-eval
63
+
64
+ Now use Pegasus to run benchmarks for all the models across all nodes.
65
+
66
+ ```console
67
+ $ cd pegasus
68
+ $ cp nlp-eval.yaml queue.yaml
69
+ $ pegasus q
70
+ ```
71
+
72
+ for some tasks, if the cuda memory of a single gpu is not enough, you can use more GPUs like follows —
73
+
74
+ 1. create a larger docker with more gpus, e.g. 2 gpus:
75
+
76
+ ```console
77
+ $ docker run -dit --name leaderboard_nlp_tasks --gpus '"device=0,1"' -v /data/leaderboard:/data/leaderboard -v $HOME/workspace/leaderboard:/workspace/leaderboard ml-energy:latest bash
78
+ ```
79
+
80
+ 2. then run the specific task with Pegasus or directly run with
81
+
82
+ ```console
83
+ $ docker exec leaderboard_nlp_tasks python lm-evaluation-harness/main.py --device cuda --no_cache --model hf-causal-experimental --model_args pretrained={{model}},trust_remote_code=True,use_accelerate=True --tasks {{task}} --num_fewshot {{shot}}
84
+ ```
85
+
86
+ change 'model', `task` and `shot` to specific tasks
pegasus/nlp-eval.yaml ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - command:
2
+ - docker exec leaderboard{{ gpu }} python lm-evaluation-harness/main.py --device cuda --no_cache --model hf-causal-experimental --model_args pretrained={{model}},trust_remote_code=True,use_accelerate=True --tasks arc_challenge --num_fewshot 25
3
+ model:
4
+ - /data/leaderboard/weights/metaai/llama-7B
5
+ - /data/leaderboard/weights/metaai/llama-13B
6
+ - /data/leaderboard/weights/lmsys/vicuna-7B
7
+ - /data/leaderboard/weights/lmsys/vicuna-13B
8
+ - /data/leaderboard/weights/tatsu-lab/alpaca-7B
9
+ - /data/leaderboard/weights/BAIR/koala-7b
10
+ - /data/leaderboard/weights/BAIR/koala-13b
11
+ - camel-ai/CAMEL-13B-Combined-Data
12
+ - databricks/dolly-v2-12b
13
+ - FreedomIntelligence/phoenix-inst-chat-7b
14
+ - h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2
15
+ - lmsys/fastchat-t5-3b-v1.0
16
+ - Neutralzz/BiLLa-7B-SFT
17
+ - nomic-ai/gpt4all-13b-snoozy
18
+ - openaccess-ai-collective/manticore-13b-chat-pyg
19
+ - OpenAssistant/oasst-sft-1-pythia-12b
20
+ - project-baize/baize-v2-7B
21
+ - StabilityAI/stablelm-tuned-alpha-7b
22
+ - togethercomputer/RedPajama-INCITE-7B-Chat
23
+
24
+ - command:
25
+ - docker exec leaderboard{{ gpu }} python lm-evaluation-harness/main.py --device cuda --no_cache --model hf-causal-experimental --model_args pretrained={{model}},trust_remote_code=True,use_accelerate=True --tasks hellaswag --num_fewshot 10
26
+ model:
27
+ - /data/leaderboard/weights/metaai/llama-7B
28
+ - /data/leaderboard/weights/metaai/llama-13B
29
+ - /data/leaderboard/weights/lmsys/vicuna-7B
30
+ - /data/leaderboard/weights/lmsys/vicuna-13B
31
+ - /data/leaderboard/weights/tatsu-lab/alpaca-7B
32
+ - /data/leaderboard/weights/BAIR/koala-7b
33
+ - /data/leaderboard/weights/BAIR/koala-13b
34
+ - camel-ai/CAMEL-13B-Combined-Data
35
+ - databricks/dolly-v2-12b
36
+ - FreedomIntelligence/phoenix-inst-chat-7b
37
+ - h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2
38
+ - lmsys/fastchat-t5-3b-v1.0
39
+ - Neutralzz/BiLLa-7B-SFT
40
+ - nomic-ai/gpt4all-13b-snoozy
41
+ - openaccess-ai-collective/manticore-13b-chat-pyg
42
+ - OpenAssistant/oasst-sft-1-pythia-12b
43
+ - project-baize/baize-v2-7B
44
+ - StabilityAI/stablelm-tuned-alpha-7b
45
+ - togethercomputer/RedPajama-INCITE-7B-Chat
46
+
47
+ - command:
48
+ - docker exec leaderboard{{ gpu }} python lm-evaluation-harness/main.py --device cuda --no_cache --model hf-causal-experimental --model_args pretrained={{model}},trust_remote_code=True,use_accelerate=True --tasks truthfulqa_mc --num_fewshot 0
49
+ model:
50
+ - /data/leaderboard/weights/metaai/llama-7B
51
+ - /data/leaderboard/weights/metaai/llama-13B
52
+ - /data/leaderboard/weights/lmsys/vicuna-7B
53
+ - /data/leaderboard/weights/lmsys/vicuna-13B
54
+ - /data/leaderboard/weights/tatsu-lab/alpaca-7B
55
+ - /data/leaderboard/weights/BAIR/koala-7b
56
+ - /data/leaderboard/weights/BAIR/koala-13b
57
+ - camel-ai/CAMEL-13B-Combined-Data
58
+ - databricks/dolly-v2-12b
59
+ - FreedomIntelligence/phoenix-inst-chat-7b
60
+ - h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2
61
+ - lmsys/fastchat-t5-3b-v1.0
62
+ - Neutralzz/BiLLa-7B-SFT
63
+ - nomic-ai/gpt4all-13b-snoozy
64
+ - openaccess-ai-collective/manticore-13b-chat-pyg
65
+ - OpenAssistant/oasst-sft-1-pythia-12b
66
+ - project-baize/baize-v2-7B
67
+ - StabilityAI/stablelm-tuned-alpha-7b
68
+ - togethercomputer/RedPajama-INCITE-7B-Chat