[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices by TianyuZhang1214 · Pull Request #11854 · sgl-project/sglang

TianyuZhang1214 · 2025-10-20T08:30:16Z

Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices

Introduction

We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.

Reproduction Steps

Pulling the Docker Image

To obtain the Docker image, use the following command:

docker pull ghcr.io/antgroup/sglang:h20-blog-release

The image is hosted at: https://github.com/orgs/antgroup/packages/container/package/sglang

Checking Environment Variables

All environment variables are stored in the /root/env.sh file, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.

Launching SGLang

We recommend running four containers: two for Prefill nodes and two for Decode nodes.

1. Launching Prefill Nodes (Identical Configuration for Both Nodes)

Note:

Both Prefill nodes use the same launch parameters.
Adjust the port number if there is a conflict.

PYTHONUNBUFFERED=1 \
SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=0 \
nohup python3 -m sglang.launch_server \
--trust-remote-code \
--model-path /path/to/DeepSeek-R1 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--host 0.0.0.0 \
--port 61001 \
--tp-size 8 \
--page-size 64 \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 16384 \
--max-running-requests 512 \
--context-length 65535 \
--enable-cache-report \
--log-level info \
--load-balance-method round_robin \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
> /home/admin/logs/stdout.log 2>&1 &

2. Launching Decode Nodes

Note:

Set {node_rank} to 0 or 1 for the respective node.
Replace {decode_master_ip} with the IP address of Node 0.
Adjust the port number if there is a conflict.

Node-0

PYTHONUNBUFFERED=1 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64 \
ENABLE_SWAPAB=1 \
nohup python3 -m sglang.launch_server \
--model-path /path/to/DeepSeek-R1 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--disaggregation-bootstrap-port 9000 \
--attention-backend flashmla \
--host 0.0.0.0 \
--port 61001 \
--trust-remote-code \
--dist-init-addr {decode_master_ip}:62001 \
--nnodes 2 \
--node-rank {node_rank} \
--tp-size 16 \
--dp-size 16 \
--enable-dp-attention \
--mem-fraction-static 0.88 \
--max-running-requests 512 \
--context-length 65535 \
--log-level info \
--decode-log-interval 50 \
--page-size 64 \
--schedule-conservativeness 0.3 \
--enable-cache-report \
--moe-dense-tp-size 1 \
--enable-deepep-moe \
--enable-dp-lm-head \
--cuda-graph-max-bs 32 \
--speculative-algorithm NEXTN \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--init-expert-location /root/expert_workload.json \
--prefill-round-robin-balance \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--deepep-mode low_latency_overlap \
--enable-single-batch-overlap \
> /home/admin/logs/stdout.log 2>&1 &

3. Launching SGLang Router

Note:

Replace {decode_master_ip}, {prefill_node_0_ip}, and {prefill_node_1_ip} with the respective IP addresses.
Adjust the port number if there is a conflict.

nohup python3 -m sglang_router.launch_router \
--pd-disaggregation \
--mini-lb \
--host 0.0.0.0 \
--decode http://{decode_master_ip}:61001 \
--port 8000 \
--prefill http://{prefill_node_0_ip}:61001 \
--prefill http://{prefill_node_1_ip}:61001 \
> /home/admin/logs/router.log 2>&1 &

Testing

1. Running the Benchmark

Note:

This script is designed to observe peak performance in logs. Since --request-rate is set to inf, all requests are sent at once, making TTFT and TPOT data less meaningful.
Replace {path-to-shareGPT} with the path to the ShareGPT dataset.

nohup python3 -m sglang.bench_serving \
--host 0.0.0.0 \
--port 8000 \
--dataset-path {path-to-shareGPT} \
--num-prompt 4096 \
--random-input 4096 \
--random-output 1536 \
--request-rate "inf" \
--max-concurrency 2048 \
--warmup-requests 0 \
--backend sglang \
--dataset-name random \
--random-range-ratio 1 \
> /home/local/workspace/bench.log 2>&1 &

2. Observing Logs

To monitor peak performance, filter logs for entries with running-req: 32:

grep -E 'Decode batch.*running-req: 32' /home/admin/logs/sglang.log

Example Output (for batch size = 32):

2025-10-20 03:02:22 INFO 31223 [DP3 TP3 EP3 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 157952, token usage: 0.21, accept len: 1.93, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 693.45, #queue-req: 0
2025-10-20 03:02:22 INFO 31225 [DP5 TP5 EP5 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 164224, token usage: 0.22, accept len: 1.92, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 674.19, #queue-req: 1
2025-10-20 03:02:22 INFO 31222 [DP2 TP2 EP2 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 162112, token usage: 0.22, accept len: 1.90, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 655.17, #queue-req: 1
2025-10-20 03:02:22 INFO 31224 [DP4 TP4 EP4 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 168768, token usage: 0.22, accept len: 1.93, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 679.00, #queue-req: 2
2025-10-20 03:02:22 INFO 31227 [DP7 TP7 EP7 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 157696, token usage: 0.21, accept len: 1.92, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 673.31, #queue-req: 0
2025-10-20 03:02:26 INFO 31222 [DP2 TP2 EP2 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 159488, token usage: 0.21, accept len: 1.92, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 679.66, #queue-req: 0
2025-10-20 03:02:27 INFO 31224 [DP4 TP4 EP4 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 160320, token usage: 0.21, accept len: 1.94, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 673.26, #queue-req: 0

Related PRs

Prefill
EPLB
- feat: Add Expert Affinity Aware EPLB algorithm
- feat: introduce async rebalance mode for expert load balancer
FP8 MLA
- update fp8 support
SwapAB GEMM
- support swapAB for m_grouped_fp8_gemm_nt_masked
SBO(single-batch-overlap)
DeepXTrace
- DeepXTrace

…when seq_lens is small

jeffye-dev · 2025-10-24T05:36:31Z

Hello, @TianyuZhang1214
When I try this image ghcr.io/antgroup/sglang:h20-blog-release and launch deepseek-r1 on my H20, I got stuck at startup phase. And when I switch to official sglang image, everything is going well. Do you help to double-check the image?

TianyuZhang1214 · 2025-10-29T02:35:28Z

Hello, @TianyuZhang1214 When I try this image ghcr.io/antgroup/sglang:h20-blog-release and launch deepseek-r1 on my H20, I got stuck at startup phase. And when I switch to official sglang image, everything is going well. Do you help to double-check the image?

@jeffye-dev This PR is no longer maintained. Please refer to the active fork at: antgroup#4.
Additionally, I’ve double-checked the Docker image, and it has been validated by multiple users. Please verify your local environment setup, including NCCL and GLOO.

xu-yfei and others added 10 commits October 14, 2025 15:12

tuning fused moe triton, add tma for down proj kernel

cef7056

fix try_get_optimal_moe_config error

da97088

move tma into orignal triton kernel

de0de5c

Merge branch 'main' into xyf/tune_moe

051220b

MHA chunked prefix: merge prefix and extend kv cache to run mha once …

35e69e4

…when seq_lens is small

rename to mha one hot & opt code

3a938d4

typo one shot

633ada0

Merge branch 'main' into xyf/tune_moe

8804815

Merge branch 'main' into xyf/tune_moe

ec996c7

Merge branch 'xyf/mha_merge' into h20-blog-20250926

820d692

TianyuZhang1214 closed this Oct 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices#11854

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices#11854
TianyuZhang1214 wants to merge 10 commits intosgl-project:mainfrom
antgroup:h20-blog-20250926

TianyuZhang1214 commented Oct 20, 2025 •

edited

Loading

Uh oh!

jeffye-dev commented Oct 24, 2025

Uh oh!

TianyuZhang1214 commented Oct 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

TianyuZhang1214 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices

Introduction

Reproduction Steps

Pulling the Docker Image

Checking Environment Variables

Launching SGLang

1. Launching Prefill Nodes (Identical Configuration for Both Nodes)

2. Launching Decode Nodes

Node-0

3. Launching SGLang Router

Testing

1. Running the Benchmark

2. Observing Logs

Related PRs

Uh oh!

jeffye-dev commented Oct 24, 2025

Uh oh!

TianyuZhang1214 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TianyuZhang1214 commented Oct 20, 2025 •

edited

Loading

TianyuZhang1214 commented Oct 29, 2025 •

edited

Loading