Deepseek-v4-Pro share expert tp1 on H20 by zhangxiaolei123456 · Pull Request #23911 · sgl-project/sglang

zhangxiaolei123456 · 2026-04-28T05:57:35Z

Motivation

This PR #23686 implements a TP16 deployment of deepseekv4-pro on the SM90, but since Share Expert cannot be deployed using TP16, this PR implements a TP1 deployment of Share Expert.
Co-authored-by: shiyu7

Modifications

Accuracy Tests

Command

SGLANG_SHARED_EXPERT_TP1=1 SGLANG_ENABLE_THINKING=1 SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Pro --tp 16 --dp-size 16  --enable-dp-attention --cuda-graph-max-bs 1 --max-running-requests 16 --enable-metrics --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --moe-runner-backend marlin --dist-init-addr 192.168.3.198:30300 --nnodes 2 --node-rank 0 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 

SGLANG_SHARED_EXPERT_TP1=1 SGLANG_ENABLE_THINKING=1 SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Pro --tp 16 --dp-size 16 --enable-dp-attention --cuda-graph-max-bs 1 --max-running-requests 16 --enable-metrics --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --moe-runner-backend marlin --dist-init-addr 192.168.3.198:30300 --nnodes 2 --node-rank 1 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4

GSM8K

python3 bench_sglang.py --host http://localhost  --port 8080 --data-path /data00 --num-questions 5000 --parallel 100
100%|██████████████████████████████████████████████████████████| 1319/1319 [08:41<00:00,  2.53it/s]
Accuracy: 0.949
Invalid: 0.001
Latency: 521.499 s
Output throughput: 232.396 token/s

MMLU

 python3 bench_sglang.py --parallel 128 --backend srt --host http://127.0.0.1 --port 8080 --data_dir /data00/mmlu


100%|████████████████████████████████████████████████████████| 14042/14042 [15:24<00:00, 15.19it/s]
subject: abstract_algebra, #q:100, acc: 0.860
subject: anatomy, #q:135, acc: 0.896
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.850
subject: clinical_knowledge, #q:265, acc: 0.932
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.710
subject: college_computer_science, #q:100, acc: 0.910
subject: college_mathematics, #q:100, acc: 0.820
subject: college_medicine, #q:173, acc: 0.896
subject: college_physics, #q:102, acc: 0.961
subject: computer_security, #q:100, acc: 0.850
subject: conceptual_physics, #q:235, acc: 0.957
subject: econometrics, #q:114, acc: 0.833
subject: electrical_engineering, #q:145, acc: 0.910
subject: elementary_mathematics, #q:378, acc: 0.966
subject: formal_logic, #q:126, acc: 0.802
subject: global_facts, #q:100, acc: 0.760
subject: high_school_biology, #q:310, acc: 0.968
subject: high_school_chemistry, #q:203, acc: 0.897
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.903
subject: high_school_geography, #q:198, acc: 0.955
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.936
subject: high_school_mathematics, #q:270, acc: 0.859
subject: high_school_microeconomics, #q:238, acc: 0.971
subject: high_school_physics, #q:151, acc: 0.907
subject: high_school_psychology, #q:545, acc: 0.969
subject: high_school_statistics, #q:216, acc: 0.926
subject: high_school_us_history, #q:204, acc: 0.941
subject: high_school_world_history, #q:237, acc: 0.966
subject: human_aging, #q:223, acc: 0.857
subject: human_sexuality, #q:131, acc: 0.901
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.898
subject: logical_fallacies, #q:163, acc: 0.920
subject: machine_learning, #q:112, acc: 0.893
subject: management, #q:103, acc: 0.971
subject: marketing, #q:234, acc: 0.966
subject: medical_genetics, #q:100, acc: 0.980
subject: miscellaneous, #q:783, acc: 0.966
subject: moral_disputes, #q:346, acc: 0.876
subject: moral_scenarios, #q:895, acc: 0.847
subject: nutrition, #q:306, acc: 0.925
subject: philosophy, #q:311, acc: 0.929
subject: prehistory, #q:324, acc: 0.951
subject: professional_accounting, #q:282, acc: 0.894
subject: professional_law, #q:1534, acc: 0.744
subject: professional_medicine, #q:272, acc: 0.945
subject: professional_psychology, #q:612, acc: 0.926
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.882
subject: sociology, #q:201, acc: 0.955
subject: us_foreign_policy, #q:100, acc: 0.940
subject: virology, #q:166, acc: 0.590
subject: world_religions, #q:171, acc: 0.930
Total latency: 924.528
Average accuracy: 0.896

longbench_v2

python result.py
file='DeepSeek-V4-Pro.jsonl’（easy_acc + hard_acc）/ Len（pred_data）=0.7272727272727273
['Model\tOverall\tEasy\tHard\tShort\tMedium\tLong', 'DeepSeek-V4-Pro\t72.7t73.2\t72.4\t69.0\t74.1\t100.0']

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Co-authored-by: Baizhou Zhang <baizhouzhang@radixark.ai> Co-authored-by: Baizhou Zhang <baizhou.zhang@radixark.ai> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Fridge003 <sobereddiezhang@gmail.com> Co-authored-by: Ke Bao <26454835+ispobock@users.noreply.github.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Mingyi Lu <wisclmy0611@gmail.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Co-authored-by: Yueming Yuan <yy28@illinois.edu> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: yueming-yuan <yym022502@gmail.com>

This reverts commit d40ca83.

This reverts commit 0d6856b.

sgl-project#23692)

…symbol (sgl-project#23699)

…s bypassing) (cherry picked from commit f2fb9795d1b4f0609bdf5c1339b542551e66ad69)

… events + debug_prev_state + silence none Re-port from feat/debug_prefill_delayer commit 1b02e2d4f. Changes: - _NegotiateOutput.debug_prev_state field for wait_success/wait_timeout timing - _record_single_pass_result: print no_wait/wait_success/delay/wait_timeout events - silence prefillable_status==none branch (was log explosion under decode-log-interval=1) - Computed _dbg_wait_seconds/_dbg_forward_passes from next_state OR debug_prev_state Gated by SGLANG_PREFILL_DELAYER_DEBUG_LOG=1. forward_pass_id alignment via existing built-in SGLANG_LOG_FORWARD_ITERS=1 (no extra patch needed).

fzyzcjy and others added 30 commits April 24, 2026 10:51

fix index_topk

d40ca83

Revert "fix index_topk"

0d6856b

This reverts commit d40ca83.

hisparse scheduling fix

927e149

fix: topk 1024

4807f6c

feat: support 1024 topk

5c59a71

Reapply "fix index_topk"

f5d03db

This reverts commit 0d6856b.

update Dockerfile for B300

a74b25f

fix dockerfile

5031406

add gb dockerfile

2045dc0

add h200/b200 dockerfile

c48efaf

fix gb dockerfile

0d4735b

update b300 dockerfile

0f94b5d

fix gb dockerfile

ca21ebe

fix b300

5e483b7

fix b300 dockerfile

8756f36

SGLANG_FIX_DSV4_BASE_MODEL_LOAD

dc2b507

Merge remote-tracking branch 'upstream/deepseek_v4' into deepseek_v4

6c396d5

Support dsv4 task / latest_reminder / content parts in OpenAI chat API (

cb591d3

sgl-project#23692)

[NSA] Fall back to fast_hadamard_transform when sgl_kernel lacks the …

4bf81c9

…symbol (sgl-project#23699)

route ignore_eos+disable_radix_cache path through prefill_delayer (wa…

02451ff

…s bypassing) (cherry picked from commit f2fb9795d1b4f0609bdf5c1339b542551e66ad69)

Merge remote-tracking branch 'upstream/deepseek_v4' into deepseek_v4

00651d8

fix: fix fast ep masked

7f58083

feat: new flags

2777a6f

swa split leaf on insert

2e43d2b

rm token_usage call in prefill delayer

05ab33b

hack bench_one_batch_server_internal.py

6a5a127

fix swa batch full

914bc4d

opt swa mem

97d73a1

Fridge003 closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek-v4-Pro share expert tp1 on H20#23911

Deepseek-v4-Pro share expert tp1 on H20#23911
zhangxiaolei123456 wants to merge 79 commits into
sgl-project:mainfrom
zhangxiaolei123456:deepseek_v4_share_expert_tp1

zhangxiaolei123456 commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

zhangxiaolei123456 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

zhangxiaolei123456 commented Apr 28, 2026 •

edited

Loading