|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +--> |
| 17 | + |
| 18 | +# Running DeepSeek-R1 Disaggregated with WideEP on GB200s |
| 19 | + |
| 20 | +Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. Full end to end optimization is still a work in progress but you can get this up and running with the following steps. In ths example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 12 GB200 nodes (total 56 GPUs). |
| 21 | + |
| 22 | +## Instructions |
| 23 | + |
| 24 | +1. Build the Dynamo container |
| 25 | + |
| 26 | +```bash |
| 27 | +cd $DYNAMO_ROOT |
| 28 | +docker build \ |
| 29 | + -f container/Dockerfile.sglang-wideep \ |
| 30 | + -t dynamo-wideep-gb200 \ |
| 31 | + --build-arg MODE=blackwell \ |
| 32 | + --build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \ |
| 33 | + --build-arg ARCH=arm64 \ |
| 34 | + --build-arg ARCH_ALT=aarch64 \ |
| 35 | + . |
| 36 | +``` |
| 37 | + |
| 38 | +2. You can run this container on each 4xGB200 node using the following command. |
| 39 | + |
| 40 | +> [!IMPORTANT] |
| 41 | +> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
| 42 | +
|
| 43 | +```bash |
| 44 | +docker run \ |
| 45 | + --gpus all \ |
| 46 | + -it \ |
| 47 | + --rm \ |
| 48 | + --network host \ |
| 49 | + --volume /PATH_TO_DSR1_MODEL/:/model/ \ |
| 50 | + --shm-size=10G \ |
| 51 | + --ulimit memlock=-1 \ |
| 52 | + --ulimit stack=67108864 \ |
| 53 | + --ulimit nofile=65536:65536 \ |
| 54 | + --cap-add CAP_SYS_PTRACE \ |
| 55 | + --ipc host \ |
| 56 | + dynamo-wideep-gb200:latest |
| 57 | +``` |
| 58 | + |
| 59 | +3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier. |
| 60 | + |
| 61 | +```bash |
| 62 | +./utils/gen_env_vars.sh |
| 63 | +``` |
| 64 | + |
| 65 | +4. Run the ingress and prefill worker |
| 66 | + |
| 67 | +```bash |
| 68 | +# run ingress |
| 69 | +python3 -m dynamo.frontend --http-port=8000 & |
| 70 | +# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below) |
| 71 | +python3 utils/sgl_http_server.py --ns dynamo & |
| 72 | +# run prefill worker |
| 73 | +SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \ |
| 74 | +MC_TE_METRIC=true \ |
| 75 | +SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ |
| 76 | +SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ |
| 77 | +SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ |
| 78 | +SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ |
| 79 | +MC_FORCE_MNNVL=1 \ |
| 80 | +NCCL_MNNVL_ENABLE=1 \ |
| 81 | +NCCL_CUMEM_ENABLE=1 \ |
| 82 | +SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ |
| 83 | +SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ |
| 84 | +PYTHONUNBUFFERED=1 \ |
| 85 | +python3 components/worker.py \ |
| 86 | + --served-model-name deepseek-ai/DeepSeek-R1 \ |
| 87 | + --model-path /model/ \ |
| 88 | + --skip-tokenizer-init \ |
| 89 | + --trust-remote-code \ |
| 90 | + --disaggregation-mode prefill \ |
| 91 | + --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \ |
| 92 | + --disaggregation-bootstrap-port 30001 \ |
| 93 | + --disaggregation-transfer-backend nixl \ |
| 94 | + --nnodes 2 \ |
| 95 | + --node-rank 0 \ |
| 96 | + --tp-size 8 \ |
| 97 | + --dp-size 8 \ |
| 98 | + --enable-dp-attention \ |
| 99 | + --host 0.0.0.0 \ |
| 100 | + --decode-log-interval 1 \ |
| 101 | + --max-running-requests 6144 \ |
| 102 | + --context-length 2716 \ |
| 103 | + --disable-radix-cache \ |
| 104 | + --enable-deepep-moe \ |
| 105 | + --deepep-mode low_latency \ |
| 106 | + --moe-dense-tp-size 1 \ |
| 107 | + --enable-dp-lm-head \ |
| 108 | + --disable-shared-experts-fusion \ |
| 109 | + --ep-num-redundant-experts 32 \ |
| 110 | + --ep-dispatch-algorithm static \ |
| 111 | + --eplb-algorithm deepseek \ |
| 112 | + --attention-backend cutlass_mla \ |
| 113 | + --watchdog-timeout 1000000 \ |
| 114 | + --disable-cuda-graph \ |
| 115 | + --chunked-prefill-size 16384 \ |
| 116 | + --max-total-tokens 32768 \ |
| 117 | + --mem-fraction-static 0.8 \ |
| 118 | + --log-level debug |
| 119 | +``` |
| 120 | + |
| 121 | +5. Run the decode worker on the head decode node |
| 122 | + |
| 123 | +```bash |
| 124 | +SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \ |
| 125 | +MC_TE_METRIC=true \ |
| 126 | +SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ |
| 127 | +SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ |
| 128 | +SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ |
| 129 | +SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \ |
| 130 | +SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ |
| 131 | +NCCL_MNNVL_ENABLE=1 \ |
| 132 | +MC_FORCE_MNNVL=1 \ |
| 133 | +NCCL_CUMEM_ENABLE=1 \ |
| 134 | +SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ |
| 135 | +SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ |
| 136 | +PYTHONUNBUFFERED=1 \ |
| 137 | +python3 components/decode_worker.py \ |
| 138 | + --served-model-name deepseek-ai/DeepSeek-R1 \ |
| 139 | + --model-path /model/ \ |
| 140 | + --skip-tokenizer-init \ |
| 141 | + --trust-remote-code \ |
| 142 | + --disaggregation-mode decode \ |
| 143 | + --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \ |
| 144 | + --disaggregation-bootstrap-port 30001 \ |
| 145 | + --nnodes 12 \ |
| 146 | + --node-rank 0 \ |
| 147 | + --tp-size 48 \ |
| 148 | + --dp-size 48 \ |
| 149 | + --enable-dp-attention \ |
| 150 | + --host 0.0.0.0 \ |
| 151 | + --decode-log-interval 1 \ |
| 152 | + --max-running-requests 36864 \ |
| 153 | + --context-length 2716 \ |
| 154 | + --disable-radix-cache \ |
| 155 | + --enable-deepep-moe \ |
| 156 | + --deepep-mode low_latency \ |
| 157 | + --moe-dense-tp-size 1 \ |
| 158 | + --enable-dp-lm-head \ |
| 159 | + --cuda-graph-bs 768 \ |
| 160 | + --disable-shared-experts-fusion \ |
| 161 | + --ep-num-redundant-experts 32 \ |
| 162 | + --ep-dispatch-algorithm static \ |
| 163 | + --eplb-algorithm deepseek \ |
| 164 | + --attention-backend cutlass_mla \ |
| 165 | + --watchdog-timeout 1000000 \ |
| 166 | + --chunked-prefill-size 36864 \ |
| 167 | + --mem-fraction-static 0.82 \ |
| 168 | + --log-level debug |
| 169 | +``` |
| 170 | + |
| 171 | +On the other decode nodes (this example has 12 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 |
0 commit comments