Skip to content

Deepseek-v4-Pro share expert tp1 on H20#23911

Closed
zhangxiaolei123456 wants to merge 79 commits into
sgl-project:mainfrom
zhangxiaolei123456:deepseek_v4_share_expert_tp1
Closed

Deepseek-v4-Pro share expert tp1 on H20#23911
zhangxiaolei123456 wants to merge 79 commits into
sgl-project:mainfrom
zhangxiaolei123456:deepseek_v4_share_expert_tp1

Conversation

@zhangxiaolei123456
Copy link
Copy Markdown
Contributor

@zhangxiaolei123456 zhangxiaolei123456 commented Apr 28, 2026

Motivation

This PR #23686 implements a TP16 deployment of deepseekv4-pro on the SM90, but since Share Expert cannot be deployed using TP16, this PR implements a TP1 deployment of Share Expert.
Co-authored-by: shiyu7

Modifications

Accuracy Tests

Command

SGLANG_SHARED_EXPERT_TP1=1 SGLANG_ENABLE_THINKING=1 SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Pro --tp 16 --dp-size 16  --enable-dp-attention --cuda-graph-max-bs 1 --max-running-requests 16 --enable-metrics --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --moe-runner-backend marlin --dist-init-addr 192.168.3.198:30300 --nnodes 2 --node-rank 0 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 

SGLANG_SHARED_EXPERT_TP1=1 SGLANG_ENABLE_THINKING=1 SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Pro --tp 16 --dp-size 16 --enable-dp-attention --cuda-graph-max-bs 1 --max-running-requests 16 --enable-metrics --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --moe-runner-backend marlin --dist-init-addr 192.168.3.198:30300 --nnodes 2 --node-rank 1 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4

GSM8K

python3 bench_sglang.py --host http://localhost  --port 8080 --data-path /data00 --num-questions 5000 --parallel 100
100%|██████████████████████████████████████████████████████████| 1319/1319 [08:41<00:00,  2.53it/s]
Accuracy: 0.949
Invalid: 0.001
Latency: 521.499 s
Output throughput: 232.396 token/s

MMLU

 python3 bench_sglang.py --parallel 128 --backend srt --host http://127.0.0.1 --port 8080 --data_dir /data00/mmlu


100%|████████████████████████████████████████████████████████| 14042/14042 [15:24<00:00, 15.19it/s]
subject: abstract_algebra, #q:100, acc: 0.860
subject: anatomy, #q:135, acc: 0.896
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.850
subject: clinical_knowledge, #q:265, acc: 0.932
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.710
subject: college_computer_science, #q:100, acc: 0.910
subject: college_mathematics, #q:100, acc: 0.820
subject: college_medicine, #q:173, acc: 0.896
subject: college_physics, #q:102, acc: 0.961
subject: computer_security, #q:100, acc: 0.850
subject: conceptual_physics, #q:235, acc: 0.957
subject: econometrics, #q:114, acc: 0.833
subject: electrical_engineering, #q:145, acc: 0.910
subject: elementary_mathematics, #q:378, acc: 0.966
subject: formal_logic, #q:126, acc: 0.802
subject: global_facts, #q:100, acc: 0.760
subject: high_school_biology, #q:310, acc: 0.968
subject: high_school_chemistry, #q:203, acc: 0.897
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.903
subject: high_school_geography, #q:198, acc: 0.955
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.936
subject: high_school_mathematics, #q:270, acc: 0.859
subject: high_school_microeconomics, #q:238, acc: 0.971
subject: high_school_physics, #q:151, acc: 0.907
subject: high_school_psychology, #q:545, acc: 0.969
subject: high_school_statistics, #q:216, acc: 0.926
subject: high_school_us_history, #q:204, acc: 0.941
subject: high_school_world_history, #q:237, acc: 0.966
subject: human_aging, #q:223, acc: 0.857
subject: human_sexuality, #q:131, acc: 0.901
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.898
subject: logical_fallacies, #q:163, acc: 0.920
subject: machine_learning, #q:112, acc: 0.893
subject: management, #q:103, acc: 0.971
subject: marketing, #q:234, acc: 0.966
subject: medical_genetics, #q:100, acc: 0.980
subject: miscellaneous, #q:783, acc: 0.966
subject: moral_disputes, #q:346, acc: 0.876
subject: moral_scenarios, #q:895, acc: 0.847
subject: nutrition, #q:306, acc: 0.925
subject: philosophy, #q:311, acc: 0.929
subject: prehistory, #q:324, acc: 0.951
subject: professional_accounting, #q:282, acc: 0.894
subject: professional_law, #q:1534, acc: 0.744
subject: professional_medicine, #q:272, acc: 0.945
subject: professional_psychology, #q:612, acc: 0.926
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.882
subject: sociology, #q:201, acc: 0.955
subject: us_foreign_policy, #q:100, acc: 0.940
subject: virology, #q:166, acc: 0.590
subject: world_religions, #q:171, acc: 0.930
Total latency: 924.528
Average accuracy: 0.896

longbench_v2

python result.py
file='DeepSeek-V4-Pro.jsonl’(easy_acc + hard_acc)/ Len(pred_data)=0.7272727272727273
['Model\tOverall\tEasy\tHard\tShort\tMedium\tLong', 'DeepSeek-V4-Pro\t72.7t73.2\t72.4\t69.0\t74.1\t100.0']

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

fzyzcjy and others added 30 commits April 24, 2026 10:51
Co-authored-by: Baizhou Zhang <baizhouzhang@radixark.ai>
Co-authored-by: Baizhou Zhang <baizhou.zhang@radixark.ai>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com>
Co-authored-by: Fridge003 <sobereddiezhang@gmail.com>
Co-authored-by: Ke Bao <26454835+ispobock@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Mingyi Lu <wisclmy0611@gmail.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Yueming Yuan <yy28@illinois.edu>
Co-authored-by: Yueming Yuan <yym022502@gmail.com>
Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: yueming-yuan <yym022502@gmail.com>
This reverts commit d40ca83.
This reverts commit 0d6856b.
…s bypassing)

(cherry picked from commit f2fb9795d1b4f0609bdf5c1339b542551e66ad69)
… events + debug_prev_state + silence none

Re-port from feat/debug_prefill_delayer commit 1b02e2d4f. Changes:
- _NegotiateOutput.debug_prev_state field for wait_success/wait_timeout timing
- _record_single_pass_result: print no_wait/wait_success/delay/wait_timeout events
- silence prefillable_status==none branch (was log explosion under decode-log-interval=1)
- Computed _dbg_wait_seconds/_dbg_forward_passes from next_state OR debug_prev_state

Gated by SGLANG_PREFILL_DELAYER_DEBUG_LOG=1. forward_pass_id alignment via
existing built-in SGLANG_LOG_FORWARD_ITERS=1 (no extra patch needed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.