You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices
Introduction
We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.
Reproduction Steps
Pulling the Docker Image
To obtain the Docker image, use the following command:
All environment variables are stored in the /root/env.sh file, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.
Launching SGLang
We recommend running four containers: two for Prefill nodes and two for Decode nodes.
1. Launching Prefill Nodes (Identical Configuration for Both Nodes)
Note:
Both Prefill nodes use the same launch parameters.
This script is designed to observe peak performance in logs. Since --request-rate is set to inf, all requests are sent at once, making TTFT and TPOT data less meaningful.
Replace {path-to-shareGPT} with the path to the ShareGPT dataset.
Hello, @TianyuZhang1214
When I try this image ghcr.io/antgroup/sglang:h20-blog-release and launch deepseek-r1 on my H20, I got stuck at startup phase. And when I switch to official sglang image, everything is going well. Do you help to double-check the image?
Hello, @TianyuZhang1214 When I try this image ghcr.io/antgroup/sglang:h20-blog-release and launch deepseek-r1 on my H20, I got stuck at startup phase. And when I switch to official sglang image, everything is going well. Do you help to double-check the image?
@jeffye-dev This PR is no longer maintained. Please refer to the active fork at: antgroup#4.
Additionally, I’ve double-checked the Docker image, and it has been validated by multiple users. Please verify your local environment setup, including NCCL and GLOO.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices
Introduction
We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.
Reproduction Steps
Pulling the Docker Image
To obtain the Docker image, use the following command:
The image is hosted at: https://github.com/orgs/antgroup/packages/container/package/sglang
Checking Environment Variables
All environment variables are stored in the
/root/env.shfile, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.Launching SGLang
We recommend running four containers: two for Prefill nodes and two for Decode nodes.
1. Launching Prefill Nodes (Identical Configuration for Both Nodes)
Note:
2. Launching Decode Nodes
Note:
{node_rank}to0or1for the respective node.{decode_master_ip}with the IP address of Node 0.Node-0
PYTHONUNBUFFERED=1 \ SGL_ENABLE_JIT_DEEPGEMM=1 \ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64 \ ENABLE_SWAPAB=1 \ nohup python3 -m sglang.launch_server \ --model-path /path/to/DeepSeek-R1 \ --disaggregation-mode decode \ --disaggregation-transfer-backend mooncake \ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \ --disaggregation-bootstrap-port 9000 \ --attention-backend flashmla \ --host 0.0.0.0 \ --port 61001 \ --trust-remote-code \ --dist-init-addr {decode_master_ip}:62001 \ --nnodes 2 \ --node-rank {node_rank} \ --tp-size 16 \ --dp-size 16 \ --enable-dp-attention \ --mem-fraction-static 0.88 \ --max-running-requests 512 \ --context-length 65535 \ --log-level info \ --decode-log-interval 50 \ --page-size 64 \ --schedule-conservativeness 0.3 \ --enable-cache-report \ --moe-dense-tp-size 1 \ --enable-deepep-moe \ --enable-dp-lm-head \ --cuda-graph-max-bs 32 \ --speculative-algorithm NEXTN \ --speculative-num-steps 1 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 2 \ --init-expert-location /root/expert_workload.json \ --prefill-round-robin-balance \ --quantization fp8 \ --kv-cache-dtype fp8_e4m3 \ --deepep-mode low_latency_overlap \ --enable-single-batch-overlap \ > /home/admin/logs/stdout.log 2>&1 &3. Launching SGLang Router
Note:
{decode_master_ip},{prefill_node_0_ip}, and{prefill_node_1_ip}with the respective IP addresses.nohup python3 -m sglang_router.launch_router \ --pd-disaggregation \ --mini-lb \ --host 0.0.0.0 \ --decode http://{decode_master_ip}:61001 \ --port 8000 \ --prefill http://{prefill_node_0_ip}:61001 \ --prefill http://{prefill_node_1_ip}:61001 \ > /home/admin/logs/router.log 2>&1 &Testing
1. Running the Benchmark
Note:
--request-rateis set toinf, all requests are sent at once, making TTFT and TPOT data less meaningful.{path-to-shareGPT}with the path to the ShareGPT dataset.nohup python3 -m sglang.bench_serving \ --host 0.0.0.0 \ --port 8000 \ --dataset-path {path-to-shareGPT} \ --num-prompt 4096 \ --random-input 4096 \ --random-output 1536 \ --request-rate "inf" \ --max-concurrency 2048 \ --warmup-requests 0 \ --backend sglang \ --dataset-name random \ --random-range-ratio 1 \ > /home/local/workspace/bench.log 2>&1 &2. Observing Logs
To monitor peak performance, filter logs for entries with
running-req: 32:grep -E 'Decode batch.*running-req: 32' /home/admin/logs/sglang.logExample Output (for batch size = 32):
Related PRs