ai-dynamo
diff --git a/‎benchmarks/llm/README.md‎
Lines changed: 1 addition & 0 deletions b/‎benchmarks/llm/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/llm/benchmarks/nginx.conf‎ renamed to ‎benchmarks/llm/nginx.conf‎ b/‎examples/llm/benchmarks/nginx.conf‎ renamed to ‎benchmarks/llm/nginx.conf‎
diff --git a/‎examples/llm/benchmarks/perf.sh‎ renamed to ‎benchmarks/llm/perf.sh‎ b/‎examples/llm/benchmarks/perf.sh‎ renamed to ‎benchmarks/llm/perf.sh‎
diff --git a/‎examples/llm/benchmarks/README.md‎
Lines changed: 138 additions & 99 deletions b/‎examples/llm/benchmarks/README.md‎
Lines changed: 138 additions & 99 deletions
@@ -0,0 +1 @@
+../../examples/llm/benchmarks/README.md
@@ -24,35 +24,47 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
 
 ## Prerequisites
 
-H100 80GB x8 node(s) are required for benchmarking.
+> [!Important] At least one 8xH100-80GB node is required for benchmarking.
+
+1. Build benchmarking image
+
+    ```bash
+    ./container/build.sh
+    ```
+
+2. Download model
+
+    ```bash
+    huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+    ```
+
+3. Start NATS and ETCD
+
+    ```bash
+    docker compose -f deploy/docker_compose.yml up -d
+    ```
 
 > [!NOTE]
 > This guide was tested on node(s) with the following hardware configuration:
-> * **GPUs**: 8xH100 80GB HBM3 (GPU Memory Bandwidth 3.2 TBs)
-> * **CPU**: 2x Intel Saphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
-> * **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
-> * **InfiniBand**: 8X400Gbit/s (Compute Links), 2X400Gbit/s (Storage Links)
+>
+> * **GPUs**:
+>   8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs)
+>
+> * **CPU**:
+>   2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
+>
+> * **NVLink**:
+>   NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
+>
+> * **InfiniBand**:
+>   8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
 >
 > Benchmarking with a different hardware configuration may yield suboptimal results.
 
-1\. Build benchmarking image
-```bash
-./container/build.sh
-```
-
-2\. Download model
-```bash
-huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
-```
-
-3\. Start NATS and ETCD
-```bash
-docker compose -f deploy/docker_compose.yml up -d
-```
 
 ## Disaggregated Single Node Benchmarking
 
-One H100 80GB x8 node is required for this setup.
+> [!Important] One 8xH100-80GB node is required for this setup.
 
 In the following setup we compare Dynamo disaggregated vLLM performance to
 [native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
@@ -64,24 +76,30 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
 
 With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
 
-1\. Run benchmarking container
-```bash
-./container/run.sh --mount-workspace
-```
-Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
+1. Run benchmarking container
 
-2\. Start disaggregated services
-```bash
-cd /workspace/examples/llm
-dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
-```
-Note: Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
+    ```bash
+    ./container/run.sh --mount-workspace
+    ```
+
+    > [!Tip]
+    > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
+
+2. Start disaggregated services
+
+    ```bash
+    cd /workspace/examples/llm
+    dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
+    ```
+
+    > [!Tip]
+    > Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
 
-Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
+3. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
 
 ## Disaggregated Multi Node Benchmarking
 
-Two H100 80GB x8 nodes are required for this setup.
+> [!Important] Two 8xH100-80GB nodes are required for this setup.
 
 In the following steps we compare Dynamo disaggregated vLLM performance to
 [native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
@@ -93,87 +111,108 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
 
 With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:
 
-1\. Run benchmarking container (node 0 & 1)
-```bash
-./container/run.sh --mount-workspace
-```
-Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
+1. Run benchmarking container (node 0 & 1)
 
-2\. Config NATS and ETCD (node 1)
-```bash
-export NATS_SERVER="nats://<node_0_ip_addr>"
-export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
-```
-Note: Node 1 must be able to reach Node 0 over the network for the above services.
+    ```bash
+    ./container/run.sh --mount-workspace
+    ```
 
-3\. Start workers (node 0)
-```bash
-cd /workspace/examples/llm
-dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
-```
-Note: Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
+    > [!Tip]
+    > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
 
-4\. Start workers (node 1)
-```bash
-cd /workspace/examples/llm
-dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
-```
-Note: Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
+2. Config NATS and ETCD (node 1)
 
-Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
+    ```bash
+    export NATS_SERVER="nats://<node_0_ip_addr>"
+    export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
+    ```
 
-## vLLM Aggregated Baseline Benchmarking
+    > [!Important]
+    > Node 1 must be able to reach Node 0 over the network for the above services.
 
-One (or two) H100 80GB x8 nodes are required for this setup.
+3. Start workers (node 0)
 
-With the Dynamo repository and the benchmarking image available, perform the following steps:
+    ```bash
+    cd /workspace/examples/llm
+    dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
+    ```
 
-1\. Run benchmarking container
-```bash
-./container/run.sh --mount-workspace
-```
-Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
+    > [!Tip]
+    > Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
 
-2\. Start vLLM serve
-```bash
-CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
-  --block-size 128 \
-  --max-model-len 3500 \
-  --max-num-batched-tokens 3500 \
-  --tensor-parallel-size 4 \
-  --gpu-memory-utilization 0.95 \
-  --disable-log-requests \
-  --port 8001 1> vllm_0.log 2>&1 &
-CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
-  --block-size 128 \
-  --max-model-len 3500 \
-  --max-num-batched-tokens 3500 \
-  --tensor-parallel-size 4 \
-  --gpu-memory-utilization 0.95 \
-  --disable-log-requests \
-  --port 8002 1> vllm_1.log 2>&1 &
-```
-Notes:
-* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
-* If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
+4. Start workers (node 1)
 
-3\. Use NGINX as load balancer
-```bash
-apt update && apt install -y nginx
-cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
-service nginx restart
-```
-Note: If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
+    ```bash
+    cd /workspace/examples/llm
+    dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
+    ```
+
+    > [!Tip]
+    > Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
+
+5. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
+
+## vLLM Aggregated Baseline Benchmarking
+
+> [!Important] One (or two) 8xH100-80GB nodes are required for this setup.
+
+With the Dynamo repository and the benchmarking image available, perform the following steps:
 
-Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
+1. Run benchmarking container
+
+    ```bash
+    ./container/run.sh --mount-workspace
+    ```
+
+    > [!Tip]
+    > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
+
+2. Start vLLM serve
+
+    ```bash
+    CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+      --block-size 128 \
+      --max-model-len 3500 \
+      --max-num-batched-tokens 3500 \
+      --tensor-parallel-size 4 \
+      --gpu-memory-utilization 0.95 \
+      --disable-log-requests \
+      --port 8001 1> vllm_0.log 2>&1 &
+    CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+      --block-size 128 \
+      --max-model-len 3500 \
+      --max-num-batched-tokens 3500 \
+      --tensor-parallel-size 4 \
+      --gpu-memory-utilization 0.95 \
+      --disable-log-requests \
+      --port 8002 1> vllm_1.log 2>&1 &
+    ```
+
+    > [!Tip]
+    > * Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
+    > * If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
+
+3. Use NGINX as load balancer
+
+    ```bash
+    apt update && apt install -y nginx
+    cp /workspace/benchmarks/llm/nginx.conf /etc/nginx/nginx.conf
+    service nginx restart
+    ```
+
+    > [!Note]
+    > If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
+
+4. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
 
 ## Collecting Performance Numbers
 
 Run the benchmarking script
+
 ```bash
-bash -x /workspace/examples/llm/benchmarks/perf.sh
+bash -x /workspace/benchmarks/llm/perf.sh
 ```
 
-## Future Roadmap
-
-* Results Interpretation
+> [!Tip]
+> See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)
+> for additional information about how to run GenAI-Perf and how to interpret results.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+../../examples/llm/benchmarks/README.md`