File size: 3,911 Bytes
4e9ddf9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5a071f
4e9ddf9
b5a071f
4e9ddf9
 
 
 
 
 
 
 
 
 
b5a071f
4e9ddf9
 
 
b5a071f
4e9ddf9
 
 
 
 
4a385c8
4e9ddf9
4a385c8
 
 
4e9ddf9
 
 
 
 
 
 
 
862fdcc
4a385c8
 
 
 
 
862fdcc
4a385c8
862fdcc
 
 
 
 
 
 
4a385c8
862fdcc
4a385c8
 
 
 
 
862fdcc
 
4a385c8
 
 
 
 
 
862fdcc
 
4a385c8
862fdcc
 
4a385c8
 
 
 
 
 
 
 
862fdcc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# Running benchmarks on multiple GPU nodes with Pegasus

[Pegasus](https://github.com/jaywonchung/pegasus) is an SSH-based multi-node command runner.
Different models have different verbosity, and benchmarking takes vastly different amounts of time.
Therefore, we want an automated piece of software that drains a queue of benchmarking jobs (one job per model) on a set of GPUs.

## Setup

### Install Pegasus

Pegasus needs to keep SSH connections with all the nodes in order to queue up and run jobs over SSH.
So you should install and run Pegasus on a computer that you can keep awake.

If you already have Rust set up:

```console
$ cargo install pegasus-ssh
```

Otherwise, you can set up Rust [here](https://www.rust-lang.org/tools/install), or just download Pegasus release binaries [here](https://github.com/jaywonchung/pegasus/releases/latest).

### Necessary setup for each node

Every node must have two things:

1. This repository cloned under `~/workspace/leaderboard`.
  - If you want a different path, search and replace in `spawn-containers.yaml`.
2. Model weights under `/data/leaderboard/weights`.
  - If you want a different path, search and replace in `setupspawn-containers.yaml` and `benchmark.yaml`.

### Specify node names for Pegasus

Modify `hosts.yaml` with nodes. See the file for an example.

- `hostname`: List the hostnames you would use in order to `ssh` into the node, e.g. `jaywonchung@gpunode01`.
- `gpu`: We want to create one Docker container for each GPU. List the indices of the GPUs you would like to use for the hosts.

### Set up Docker containers on your nodes with Pegasus

This spawns one container per GPU (named `leaderboard%d`), for every node.

```console
$ cd pegasus
$ cp spawn-containers.yaml queue.yaml
$ pegasus b
```

`b` stands for broadcast. Every command is run once on all (`hostname`, `gpu`) combinations.

## System benchmark

This will benchmark each model and get you data for the columns `energy`, `throughput`, `latency`, and `response_length`.

Use Pegasus to run benchmarks for all the models across all nodes.

```console
$ cd pegasus
$ cp benchmark.yaml queue.yaml
$ pegasus q
```

`q` stands for queue. Each command is run once on the next available (`hostname`, `gpu`) combination.

After all the tasks finish, aggregate all the data into one node and run [`compute_system_metrics.py`](../scripts/compute_system_metrics.py) to generate CSV files that the leaderboard can display.

## NLP benchmark

We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) to run models through three NLP datasets: ARC challenge (`arc`), HellaSwag (`hellaswag`), and TruthfulQA (`truthfulqa`).

Use Pegasus to run benchmarks for all the models across all nodes.

```console
$ cd pegasus
$ cp nlp-eval.yaml queue.yaml
$ pegasus q
```

After all the tasks finish, aggregate all the data into one node and run [`aggregate_nlp_metrics.py`](../scripts/aggregate_nlp_metrics.py) to generate a single `score.csv` that the leaderboard can display.

### Dealing with OOM

Some tasks might run out of memory, in which case you should create a container with more GPUs:

1. Create a container with two GPUs, for example:

```console
$ docker run -dit \
    --name leaderboard01 \
    --gpus '"device=0,1"' \
    -v /data/leaderboard:/data/leaderboard \
    -v $HOME/workspace/leaderboard:/workspace/leaderboard \
    mlenergy/leaderboard:latest bash
```

2. Revise `nlp-eval.yaml` and run with Pegasus, or run directly like this on LLaMA 7B and ARC, for example:

```console
$ docker exec leaderboard01 \
    python lm-evaluation-harness/main.py \
    --device cuda \
    --no_cache \
    --model hf-causal-experimental \
    --model_args pretrained=/data/leaderboard/weights/metaai/llama-7B,trust_remote_code=True,use_accelerate=True \
    --tasks arc_challenge \
    --num_fewshot 25
```