Skip to content

Commit 9f16d04

Browse files
committed
chore: Move Benchmarking to Top Level
This change moves the examples/llm benchmarking code to benchmarking/llm. Includes corrections and style changes to the README as well.
1 parent 3d49970 commit 9f16d04

File tree

4 files changed

+139
-99
lines changed

4 files changed

+139
-99
lines changed

benchmarks/llm/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../examples/llm/benchmarks/README.md
File renamed without changes.
File renamed without changes.

examples/llm/benchmarks/README.md

Lines changed: 138 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -24,35 +24,47 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
2424
2525
## Prerequisites
2626

27-
H100 80GB x8 node(s) are required for benchmarking.
27+
> [!Important] At least one 8xH100-80GB node is required for benchmarking.
28+
29+
1. Build benchmarking image
30+
31+
```bash
32+
./container/build.sh
33+
```
34+
35+
2. Download model
36+
37+
```bash
38+
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
39+
```
40+
41+
3. Start NATS and ETCD
42+
43+
```bash
44+
docker compose -f deploy/docker_compose.yml up -d
45+
```
2846

2947
> [!NOTE]
3048
> This guide was tested on node(s) with the following hardware configuration:
31-
> * **GPUs**: 8xH100 80GB HBM3 (GPU Memory Bandwidth 3.2 TBs)
32-
> * **CPU**: 2x Intel Saphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
33-
> * **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
34-
> * **InfiniBand**: 8X400Gbit/s (Compute Links), 2X400Gbit/s (Storage Links)
49+
>
50+
> * **GPUs**:
51+
> 8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs)
52+
>
53+
> * **CPU**:
54+
> 2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
55+
>
56+
> * **NVLink**:
57+
> NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
58+
>
59+
> * **InfiniBand**:
60+
> 8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
3561
>
3662
> Benchmarking with a different hardware configuration may yield suboptimal results.
3763

38-
1\. Build benchmarking image
39-
```bash
40-
./container/build.sh
41-
```
42-
43-
2\. Download model
44-
```bash
45-
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
46-
```
47-
48-
3\. Start NATS and ETCD
49-
```bash
50-
docker compose -f deploy/docker_compose.yml up -d
51-
```
5264

5365
## Disaggregated Single Node Benchmarking
5466

55-
One H100 80GB x8 node is required for this setup.
67+
> [!Important] One 8xH100-80GB node is required for this setup.
5668

5769
In the following setup we compare Dynamo disaggregated vLLM performance to
5870
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
@@ -64,24 +76,30 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
6476

6577
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
6678

67-
1\. Run benchmarking container
68-
```bash
69-
./container/run.sh --mount-workspace
70-
```
71-
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
79+
1. Run benchmarking container
7280

73-
2\. Start disaggregated services
74-
```bash
75-
cd /workspace/examples/llm
76-
dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
77-
```
78-
Note: Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
81+
```bash
82+
./container/run.sh --mount-workspace
83+
```
84+
85+
> [!Tip]
86+
> The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
87+
88+
2. Start disaggregated services
89+
90+
```bash
91+
cd /workspace/examples/llm
92+
dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
93+
```
94+
95+
> [!Tip]
96+
> Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
7997

80-
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
98+
3. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
8199

82100
## Disaggregated Multi Node Benchmarking
83101

84-
Two H100 80GB x8 nodes are required for this setup.
102+
> [!Important] Two 8xH100-80GB nodes are required for this setup.
85103

86104
In the following steps we compare Dynamo disaggregated vLLM performance to
87105
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
@@ -93,87 +111,108 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
93111

94112
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:
95113

96-
1\. Run benchmarking container (node 0 & 1)
97-
```bash
98-
./container/run.sh --mount-workspace
99-
```
100-
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
114+
1. Run benchmarking container (node 0 & 1)
101115

102-
2\. Config NATS and ETCD (node 1)
103-
```bash
104-
export NATS_SERVER="nats://<node_0_ip_addr>"
105-
export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
106-
```
107-
Note: Node 1 must be able to reach Node 0 over the network for the above services.
116+
```bash
117+
./container/run.sh --mount-workspace
118+
```
108119

109-
3\. Start workers (node 0)
110-
```bash
111-
cd /workspace/examples/llm
112-
dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
113-
```
114-
Note: Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
120+
> [!Tip]
121+
> The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
115122

116-
4\. Start workers (node 1)
117-
```bash
118-
cd /workspace/examples/llm
119-
dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
120-
```
121-
Note: Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
123+
2. Config NATS and ETCD (node 1)
122124

123-
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
125+
```bash
126+
export NATS_SERVER="nats://<node_0_ip_addr>"
127+
export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
128+
```
124129

125-
## vLLM Aggregated Baseline Benchmarking
130+
> [!Important]
131+
> Node 1 must be able to reach Node 0 over the network for the above services.
126132

127-
One (or two) H100 80GB x8 nodes are required for this setup.
133+
3. Start workers (node 0)
128134

129-
With the Dynamo repository and the benchmarking image available, perform the following steps:
135+
```bash
136+
cd /workspace/examples/llm
137+
dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
138+
```
130139

131-
1\. Run benchmarking container
132-
```bash
133-
./container/run.sh --mount-workspace
134-
```
135-
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
140+
> [!Tip]
141+
> Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
136142

137-
2\. Start vLLM serve
138-
```bash
139-
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
140-
--block-size 128 \
141-
--max-model-len 3500 \
142-
--max-num-batched-tokens 3500 \
143-
--tensor-parallel-size 4 \
144-
--gpu-memory-utilization 0.95 \
145-
--disable-log-requests \
146-
--port 8001 1> vllm_0.log 2>&1 &
147-
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
148-
--block-size 128 \
149-
--max-model-len 3500 \
150-
--max-num-batched-tokens 3500 \
151-
--tensor-parallel-size 4 \
152-
--gpu-memory-utilization 0.95 \
153-
--disable-log-requests \
154-
--port 8002 1> vllm_1.log 2>&1 &
155-
```
156-
Notes:
157-
* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
158-
* If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
143+
4. Start workers (node 1)
159144

160-
3\. Use NGINX as load balancer
161-
```bash
162-
apt update && apt install -y nginx
163-
cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
164-
service nginx restart
165-
```
166-
Note: If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
145+
```bash
146+
cd /workspace/examples/llm
147+
dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
148+
```
149+
150+
> [!Tip]
151+
> Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
152+
153+
5. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
154+
155+
## vLLM Aggregated Baseline Benchmarking
156+
157+
> [!Important] One (or two) 8xH100-80GB nodes are required for this setup.
158+
159+
With the Dynamo repository and the benchmarking image available, perform the following steps:
167160

168-
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
161+
1. Run benchmarking container
162+
163+
```bash
164+
./container/run.sh --mount-workspace
165+
```
166+
167+
> [!Tip]
168+
> The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
169+
170+
2. Start vLLM serve
171+
172+
```bash
173+
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
174+
--block-size 128 \
175+
--max-model-len 3500 \
176+
--max-num-batched-tokens 3500 \
177+
--tensor-parallel-size 4 \
178+
--gpu-memory-utilization 0.95 \
179+
--disable-log-requests \
180+
--port 8001 1> vllm_0.log 2>&1 &
181+
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
182+
--block-size 128 \
183+
--max-model-len 3500 \
184+
--max-num-batched-tokens 3500 \
185+
--tensor-parallel-size 4 \
186+
--gpu-memory-utilization 0.95 \
187+
--disable-log-requests \
188+
--port 8002 1> vllm_1.log 2>&1 &
189+
```
190+
191+
> [!Tip]
192+
> * Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
193+
> * If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
194+
195+
3. Use NGINX as load balancer
196+
197+
```bash
198+
apt update && apt install -y nginx
199+
cp /workspace/benchmarks/llm/nginx.conf /etc/nginx/nginx.conf
200+
service nginx restart
201+
```
202+
203+
> [!Note]
204+
> If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
205+
206+
4. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
169207

170208
## Collecting Performance Numbers
171209

172210
Run the benchmarking script
211+
173212
```bash
174-
bash -x /workspace/examples/llm/benchmarks/perf.sh
213+
bash -x /workspace/benchmarks/llm/perf.sh
175214
```
176215

177-
## Future Roadmap
178-
179-
* Results Interpretation
216+
> [!Tip]
217+
> See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)
218+
> for additional information about how to run GenAI-Perf and how to interpret results.

0 commit comments

Comments
 (0)