Skip to content

Commit d937608

Browse files
authored
Merge branch 'main' into rupei/load-metrics-push
2 parents 107ca09 + dbb4caa commit d937608

File tree

27 files changed

+1651
-347
lines changed

27 files changed

+1651
-347
lines changed

components/backends/llama_cpp/README.md

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,10 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]
1313

1414
## Request Migration
1515

16-
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
16+
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
1717

18-
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
19-
20-
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
21-
22-
For example,
2318
```bash
2419
python3 -m dynamo.llama_cpp ... --migration-limit=3
2520
```
26-
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
2721

28-
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
22+
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.

components/backends/sglang/README.md

Lines changed: 8 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
4343

4444
### Large Scale P/D and WideEP Features
4545

46-
| Feature | SGLang | Notes |
47-
|--------------------|--------|-----------------------------------------------------------------------|
48-
| **WideEP** |/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
49-
| **DP Rank Routing**| 🚧 | Direct routing supported. Process per DP rank is not supported |
50-
| **GB200 Support** | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
46+
| Feature | SGLang | Notes |
47+
|---------------------|--------|--------------------------------------------------------------|
48+
| **WideEP** | | Full support on H100s/GB200 |
49+
| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker |
50+
| **GB200 Support** | | |
5151

5252

5353
## Quick Start
@@ -94,12 +94,6 @@ cd $DYNAMO_ROOT/components/backends/sglang
9494

9595
### Aggregated Serving with KV Routing
9696

97-
> [!NOTE]
98-
> The current implementation of `components/backends/sglang/src/dynamo/sglang/worker/main.py` publishes _placeholder_ engine metrics to keep the Dynamo KV-router happy. Real-time metrics will be surfaced directly from the SGLang engine once the following pull requests are merged:
99-
> • Dynamo: [ai-dynamo/dynamo #1465](https://github.com/ai-dynamo/dynamo/pull/1465)_feat: receive kvmetrics from sglang scheduler_.
100-
>
101-
> After these are in, the TODOs in `main.py` will be resolved and the placeholder logic removed.
102-
10397
```bash
10498
cd $DYNAMO_ROOT/components/backends/sglang
10599
./launch/agg_router.sh
@@ -143,25 +137,19 @@ When using MoE models, you can also use the our implementation of the native SGL
143137

144138
## Request Migration
145139

146-
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
147-
148-
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
149-
150-
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
140+
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
151141

152-
For example,
153142
```bash
154143
python3 -m dynamo.sglang ... --migration-limit=3
155144
```
156-
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
157145

158-
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
146+
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
159147

160148
## Advanced Examples
161149

162150
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
163151

164-
### Run on multi-node
152+
### Run a multi-node sized model
165153
- **[Run a multi-node model](docs/multinode-examples.md)**
166154

167155
### Large scale P/D disaggregation with WideEP
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# Running DeepSeek-R1 Disaggregated with WideEP on GB200s
19+
20+
Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. Full end to end optimization is still a work in progress but you can get this up and running with the following steps. In ths example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 12 GB200 nodes (total 56 GPUs).
21+
22+
## Instructions
23+
24+
1. Build the Dynamo container
25+
26+
```bash
27+
cd $DYNAMO_ROOT
28+
docker build \
29+
-f container/Dockerfile.sglang-wideep \
30+
-t dynamo-wideep-gb200 \
31+
--build-arg MODE=blackwell \
32+
--build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \
33+
--build-arg ARCH=arm64 \
34+
--build-arg ARCH_ALT=aarch64 \
35+
.
36+
```
37+
38+
2. You can run this container on each 4xGB200 node using the following command.
39+
40+
> [!IMPORTANT]
41+
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
42+
43+
```bash
44+
docker run \
45+
--gpus all \
46+
-it \
47+
--rm \
48+
--network host \
49+
--volume /PATH_TO_DSR1_MODEL/:/model/ \
50+
--shm-size=10G \
51+
--ulimit memlock=-1 \
52+
--ulimit stack=67108864 \
53+
--ulimit nofile=65536:65536 \
54+
--cap-add CAP_SYS_PTRACE \
55+
--ipc host \
56+
dynamo-wideep-gb200:latest
57+
```
58+
59+
3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
60+
61+
```bash
62+
./utils/gen_env_vars.sh
63+
```
64+
65+
4. Run the ingress and prefill worker
66+
67+
```bash
68+
# run ingress
69+
python3 -m dynamo.frontend --http-port=8000 &
70+
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
71+
python3 utils/sgl_http_server.py --ns dynamo &
72+
# run prefill worker
73+
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
74+
MC_TE_METRIC=true \
75+
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
76+
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
77+
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
78+
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
79+
MC_FORCE_MNNVL=1 \
80+
NCCL_MNNVL_ENABLE=1 \
81+
NCCL_CUMEM_ENABLE=1 \
82+
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
83+
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
84+
PYTHONUNBUFFERED=1 \
85+
python3 components/worker.py \
86+
--served-model-name deepseek-ai/DeepSeek-R1 \
87+
--model-path /model/ \
88+
--skip-tokenizer-init \
89+
--trust-remote-code \
90+
--disaggregation-mode prefill \
91+
--dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
92+
--disaggregation-bootstrap-port 30001 \
93+
--disaggregation-transfer-backend nixl \
94+
--nnodes 2 \
95+
--node-rank 0 \
96+
--tp-size 8 \
97+
--dp-size 8 \
98+
--enable-dp-attention \
99+
--host 0.0.0.0 \
100+
--decode-log-interval 1 \
101+
--max-running-requests 6144 \
102+
--context-length 2716 \
103+
--disable-radix-cache \
104+
--enable-deepep-moe \
105+
--deepep-mode low_latency \
106+
--moe-dense-tp-size 1 \
107+
--enable-dp-lm-head \
108+
--disable-shared-experts-fusion \
109+
--ep-num-redundant-experts 32 \
110+
--ep-dispatch-algorithm static \
111+
--eplb-algorithm deepseek \
112+
--attention-backend cutlass_mla \
113+
--watchdog-timeout 1000000 \
114+
--disable-cuda-graph \
115+
--chunked-prefill-size 16384 \
116+
--max-total-tokens 32768 \
117+
--mem-fraction-static 0.8 \
118+
--log-level debug
119+
```
120+
121+
5. Run the decode worker on the head decode node
122+
123+
```bash
124+
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
125+
MC_TE_METRIC=true \
126+
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
127+
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
128+
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
129+
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
130+
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
131+
NCCL_MNNVL_ENABLE=1 \
132+
MC_FORCE_MNNVL=1 \
133+
NCCL_CUMEM_ENABLE=1 \
134+
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
135+
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
136+
PYTHONUNBUFFERED=1 \
137+
python3 components/decode_worker.py \
138+
--served-model-name deepseek-ai/DeepSeek-R1 \
139+
--model-path /model/ \
140+
--skip-tokenizer-init \
141+
--trust-remote-code \
142+
--disaggregation-mode decode \
143+
--dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
144+
--disaggregation-bootstrap-port 30001 \
145+
--nnodes 12 \
146+
--node-rank 0 \
147+
--tp-size 48 \
148+
--dp-size 48 \
149+
--enable-dp-attention \
150+
--host 0.0.0.0 \
151+
--decode-log-interval 1 \
152+
--max-running-requests 36864 \
153+
--context-length 2716 \
154+
--disable-radix-cache \
155+
--enable-deepep-moe \
156+
--deepep-mode low_latency \
157+
--moe-dense-tp-size 1 \
158+
--enable-dp-lm-head \
159+
--cuda-graph-bs 768 \
160+
--disable-shared-experts-fusion \
161+
--ep-num-redundant-experts 32 \
162+
--ep-dispatch-algorithm static \
163+
--eplb-algorithm deepseek \
164+
--attention-backend cutlass_mla \
165+
--watchdog-timeout 1000000 \
166+
--chunked-prefill-size 36864 \
167+
--mem-fraction-static 0.82 \
168+
--log-level debug
169+
```
170+
171+
On the other decode nodes (this example has 12 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

components/backends/sglang/docs/dsr1-wideep-h100.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca
99

1010
## Instructions
1111

12-
1. Pull the SGLang release `v0.4.8.post1` container. We are actively working on validating newer releases.
13-
14-
```bash
15-
docker pull lmsysorg/sglang:v0.4.8.post1-cu126
16-
```
17-
18-
You can also pull a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags)
19-
20-
2. Build the Dynamo container
12+
1. Build the Dynamo container
2113

2214
```bash
2315
cd $DYNAMO_ROOT
2416
docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache
2517
```
2618

27-
3. You can run this container on each 8xH100 node using the following command.
19+
You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
20+
21+
2. You can run this container on each 8xH100 node using the following command.
2822

2923
> [!IMPORTANT]
3024
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
@@ -47,17 +41,17 @@ docker run \
4741

4842
In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory.
4943

50-
4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
44+
3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
5145

5246
```bash
5347
./utils/gen_env_vars.sh
5448
```
5549

56-
5. Run the ingress and prefill worker
50+
4. Run the ingress and prefill worker
5751

5852
```bash
5953
# run ingress
60-
dynamo run in=http out=dyn &
54+
python3 -m dynamo.frontend --http-port=8000 &
6155
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
6256
python3 utils/sgl_http_server.py --ns dynamo &
6357
# run prefill worker
@@ -93,7 +87,7 @@ python3 -m dynamo.sglang.worker \
9387

9488
On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
9589

96-
7. Run the decode worker on the head decode node
90+
5. Run the decode worker on the head decode node
9791

9892
```bash
9993
python3 -m dynamo.sglang.decode_worker \
@@ -121,7 +115,7 @@ python3 -m dynamo.sglang.decode_worker \
121115
--deepep-mode low_latency \
122116
--mem-fraction-static 0.835 \
123117
--ep-num-redundant-experts 32 \
124-
--cuda-graph-bs 256
118+
--cuda-graph-bs 128
125119
```
126120

127121
On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
@@ -131,6 +125,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same
131125
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
132126

133127
prefill:
128+
134129
```bash
135130
...
136131
--max-running-requests 8192 \
@@ -142,6 +137,7 @@ prefill:
142137
```
143138

144139
decode:
140+
145141
```bash
146142
...
147143
--max-running-requests 18432 \
@@ -152,9 +148,10 @@ decode:
152148
We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
153149

154150
1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
155-
We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
151+
We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
156152

157153
Example usage:
154+
158155
```bash
159156
# warmup
160157
./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup
@@ -165,9 +162,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache
165162
```
166163

167164
2. **GenAI Perf to benchmark completions with custom dataset**
168-
We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
165+
We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAI Perf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
169166

170167
Example usage:
168+
171169
```bash
172170
# generate data
173171
python3 src/dynamo/sglang/utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1

0 commit comments

Comments
 (0)