@@ -24,35 +24,47 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
2424
2525## Prerequisites
2626
27- H100 80GB x8 node(s) are required for benchmarking.
27+ > [ !Important] At least one 8xH100-80GB node is required for benchmarking.
28+
29+ 1 . Build benchmarking image
30+
31+ ``` bash
32+ ./container/build.sh
33+ ```
34+
35+ 2. Download model
36+
37+ ` ` ` bash
38+ huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
39+ ` ` `
40+
41+ 3. Start NATS and ETCD
42+
43+ ` ` ` bash
44+ docker compose -f deploy/docker_compose.yml up -d
45+ ` ` `
2846
2947> [! NOTE]
3048> This guide was tested on node(s) with the following hardware configuration:
31- > * ** GPUs** : 8xH100 80GB HBM3 (GPU Memory Bandwidth 3.2 TBs)
32- > * ** CPU** : 2x Intel Saphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
33- > * ** NVLink** : NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
34- > * ** InfiniBand** : 8X400Gbit/s (Compute Links), 2X400Gbit/s (Storage Links)
49+ >
50+ > * ** GPUs** :
51+ > 8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs)
52+ >
53+ > * ** CPU** :
54+ > 2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
55+ >
56+ > * ** NVLink** :
57+ > NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
58+ >
59+ > * ** InfiniBand** :
60+ > 8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
3561>
3662> Benchmarking with a different hardware configuration may yield suboptimal results.
3763
38- 1\. Build benchmarking image
39- ``` bash
40- ./container/build.sh
41- ```
42-
43- 2\. Download model
44- ``` bash
45- huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
46- ```
47-
48- 3\. Start NATS and ETCD
49- ``` bash
50- docker compose -f deploy/docker_compose.yml up -d
51- ```
5264
5365# # Disaggregated Single Node Benchmarking
5466
55- One H100 80GB x8 node is required for this setup.
67+ > [ ! Important] One 8xH100-80GB node is required for this setup.
5668
5769In the following setup we compare Dynamo disaggregated vLLM performance to
5870[native vLLM Aggregated Baseline](# vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
@@ -64,24 +76,30 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
6476
6577With the Dynamo repository, benchmarking image and model available, and ** NATS and ETCD started** , perform the following steps:
6678
67- 1\. Run benchmarking container
68- ``` bash
69- ./container/run.sh --mount-workspace
70- ```
71- Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
79+ 1. Run benchmarking container
7280
73- 2\. Start disaggregated services
74- ``` bash
75- cd /workspace/examples/llm
76- dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
77- ```
78- Note: Check the ` disagg.log ` to make sure the service is fully started before collecting performance numbers.
81+ ` ` ` bash
82+ ./container/run.sh --mount-workspace
83+ ` ` `
84+
85+ > [! Tip]
86+ > The huggingface home source mount can be changed by setting ` --hf-cache ~ /.cache/huggingface` .
87+
88+ 2. Start disaggregated services
89+
90+ ` ` ` bash
91+ cd /workspace/examples/llm
92+ dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
93+ ` ` `
94+
95+ > [! Tip]
96+ > Check the ` disagg.log` to make sure the service is fully started before collecting performance numbers.
7997
80- Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section below.
98+ 3. Collect the performance numbers as shown on the [Collecting Performance Numbers](# collecting-performance-numbers) section below.
8199
82100# # Disaggregated Multi Node Benchmarking
83101
84- Two H100 80GB x8 nodes are required for this setup.
102+ > [ ! Important] Two 8xH100-80GB nodes are required for this setup.
85103
86104In the following steps we compare Dynamo disaggregated vLLM performance to
87105[native vLLM Aggregated Baseline](# vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
@@ -93,87 +111,108 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
93111
94112With the Dynamo repository, benchmarking image and model available, and ** NATS and ETCD started on node 0** , perform the following steps:
95113
96- 1\. Run benchmarking container (node 0 & 1)
97- ``` bash
98- ./container/run.sh --mount-workspace
99- ```
100- Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
114+ 1. Run benchmarking container (node 0 & 1)
101115
102- 2\. Config NATS and ETCD (node 1)
103- ``` bash
104- export NATS_SERVER=" nats://<node_0_ip_addr>"
105- export ETCD_ENDPOINTS=" <node_0_ip_addr>:2379"
106- ```
107- Note: Node 1 must be able to reach Node 0 over the network for the above services.
116+ ` ` ` bash
117+ ./container/run.sh --mount-workspace
118+ ` ` `
108119
109- 3\. Start workers (node 0)
110- ``` bash
111- cd /workspace/examples/llm
112- dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
113- ```
114- Note: Check the ` disagg_multinode.log ` to make sure the service is fully started before collecting performance numbers.
120+ > [! Tip]
121+ > The huggingface home source mount can be changed by setting ` --hf-cache ~ /.cache/huggingface` .
115122
116- 4\. Start workers (node 1)
117- ``` bash
118- cd /workspace/examples/llm
119- dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
120- ```
121- Note: Check the ` prefill_multinode.log ` to make sure the service is fully started before collecting performance numbers.
123+ 2. Config NATS and ETCD (node 1)
122124
123- Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section above.
125+ ` ` ` bash
126+ export NATS_SERVER=" nats://<node_0_ip_addr>"
127+ export ETCD_ENDPOINTS=" <node_0_ip_addr>:2379"
128+ ` ` `
124129
125- ## vLLM Aggregated Baseline Benchmarking
130+ > [! Important]
131+ > Node 1 must be able to reach Node 0 over the network for the above services.
126132
127- One (or two) H100 80GB x8 nodes are required for this setup.
133+ 3. Start workers (node 0)
128134
129- With the Dynamo repository and the benchmarking image available, perform the following steps:
135+ ` ` ` bash
136+ cd /workspace/examples/llm
137+ dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
138+ ` ` `
130139
131- 1\. Run benchmarking container
132- ``` bash
133- ./container/run.sh --mount-workspace
134- ```
135- Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
140+ > [! Tip]
141+ > Check the ` disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
136142
137- 2\. Start vLLM serve
138- ``` bash
139- CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
140- --block-size 128 \
141- --max-model-len 3500 \
142- --max-num-batched-tokens 3500 \
143- --tensor-parallel-size 4 \
144- --gpu-memory-utilization 0.95 \
145- --disable-log-requests \
146- --port 8001 1> vllm_0.log 2>&1 &
147- CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
148- --block-size 128 \
149- --max-model-len 3500 \
150- --max-num-batched-tokens 3500 \
151- --tensor-parallel-size 4 \
152- --gpu-memory-utilization 0.95 \
153- --disable-log-requests \
154- --port 8002 1> vllm_1.log 2>&1 &
155- ```
156- Notes:
157- * Check the ` vllm_0.log ` and ` vllm_1.log ` to make sure the service is fully started before collecting performance numbers.
158- * If benchmarking over 2 nodes, ` --tensor-parallel-size 8 ` should be used and only run one ` vllm serve ` instance per node.
143+ 4. Start workers (node 1)
159144
160- 3\. Use NGINX as load balancer
161- ``` bash
162- apt update && apt install -y nginx
163- cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
164- service nginx restart
165- ```
166- Note: If benchmarking over 2 nodes, the ` upstream ` configuration will need to be updated to link to the ` vllm serve ` on the second node.
145+ ` ` ` bash
146+ cd /workspace/examples/llm
147+ dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
148+ ` ` `
149+
150+ > [! Tip]
151+ > Check the ` prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
152+
153+ 5. Collect the performance numbers as shown on the [Collecting Performance Numbers](# collecting-performance-numbers) section above.
154+
155+ # # vLLM Aggregated Baseline Benchmarking
156+
157+ > [! Important] One (or two) 8xH100-80GB nodes are required for this setup.
158+
159+ With the Dynamo repository and the benchmarking image available, perform the following steps:
167160
168- Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section below.
161+ 1. Run benchmarking container
162+
163+ ` ` ` bash
164+ ./container/run.sh --mount-workspace
165+ ` ` `
166+
167+ > [! Tip]
168+ > The huggingface home source mount can be changed by setting ` --hf-cache ~ /.cache/huggingface` .
169+
170+ 2. Start vLLM serve
171+
172+ ` ` ` bash
173+ CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
174+ --block-size 128 \
175+ --max-model-len 3500 \
176+ --max-num-batched-tokens 3500 \
177+ --tensor-parallel-size 4 \
178+ --gpu-memory-utilization 0.95 \
179+ --disable-log-requests \
180+ --port 8001 1> vllm_0.log 2>&1 &
181+ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
182+ --block-size 128 \
183+ --max-model-len 3500 \
184+ --max-num-batched-tokens 3500 \
185+ --tensor-parallel-size 4 \
186+ --gpu-memory-utilization 0.95 \
187+ --disable-log-requests \
188+ --port 8002 1> vllm_1.log 2>&1 &
189+ ` ` `
190+
191+ > [! Tip]
192+ > * Check the ` vllm_0.log` and ` vllm_1.log` to make sure the service is fully started before collecting performance numbers.
193+ > * If benchmarking over 2 nodes, ` --tensor-parallel-size 8` should be used and only run one ` vllm serve` instance per node.
194+
195+ 3. Use NGINX as load balancer
196+
197+ ` ` ` bash
198+ apt update && apt install -y nginx
199+ cp /workspace/benchmarks/llm/nginx.conf /etc/nginx/nginx.conf
200+ service nginx restart
201+ ` ` `
202+
203+ > [! Note]
204+ > If benchmarking over 2 nodes, the ` upstream` configuration will need to be updated to link to the ` vllm serve` on the second node.
205+
206+ 4. Collect the performance numbers as shown on the [Collecting Performance Numbers](# collecting-performance-numbers) section below.
169207
170208# # Collecting Performance Numbers
171209
172210Run the benchmarking script
211+
173212` ` ` bash
174- bash -x /workspace/examples /llm/benchmarks /perf.sh
213+ bash -x /workspace/benchmarks /llm/perf.sh
175214` ` `
176215
177- ## Future Roadmap
178-
179- * Results Interpretation
216+ > [ ! Tip]
217+ > See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)
218+ > for additional information about how to run GenAI-Perf and how to interpret results.
0 commit comments