Deepseek v4 support dp attention is not TP size by zhangxiaolei123456 · Pull Request #24952 · sgl-project/sglang

zhangxiaolei123456 · 2026-05-11T07:14:29Z

Motivation

DeepSeekV4 branch PR is here: #23933

Modifications

Accuracy Tests

SGLANG_SHARED_EXPERT_TP1=1 SGLANG_ENABLE_THINKING=1 SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Pro --tp 16 --dp-size 2  --enable-dp-attention --cuda-graph-max-bs 8 --max-running-requests 16 --enable-metrics --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --moe-runner-backend marlin --dist-init-addr 192.168.3.198:30300 --nnodes 2 --node-rank 0 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4

SGLANG_SHARED_EXPERT_TP1=1 SGLANG_ENABLE_THINKING=1 SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Pro --tp 16 --dp-size 2 --enable-dp-attention --cuda-graph-max-bs 8 --max-running-requests 16 --enable-metrics --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --moe-runner-backend marlin --dist-init-addr 192.168.3.198:30300 --nnodes 2 --node-rank 1 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4

### MMLU
python3 bench_sglang.py --parallel 128 --backend srt --host http://127.0.0.1 --port 8080 --data_dir /data00/mmlu
100%|████████████████████████████████████████████| 14042/14042 [23:40<00:00,  9.88it/s]
subject: abstract_algebra, #q:100, acc: 0.830
subject: anatomy, #q:135, acc: 0.904
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.928
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.710
subject: college_computer_science, #q:100, acc: 0.930
subject: college_mathematics, #q:100, acc: 0.840
subject: college_medicine, #q:173, acc: 0.890
subject: college_physics, #q:102, acc: 0.971
subject: computer_security, #q:100, acc: 0.870
subject: conceptual_physics, #q:235, acc: 0.953
subject: econometrics, #q:114, acc: 0.851
subject: electrical_engineering, #q:145, acc: 0.910
subject: elementary_mathematics, #q:378, acc: 0.963
subject: formal_logic, #q:126, acc: 0.802
subject: global_facts, #q:100, acc: 0.780
subject: high_school_biology, #q:310, acc: 0.965
subject: high_school_chemistry, #q:203, acc: 0.897
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.903
subject: high_school_geography, #q:198, acc: 0.955
subject: high_school_government_and_politics, #q:193, acc: 0.995
subject: high_school_macroeconomics, #q:390, acc: 0.944
subject: high_school_mathematics, #q:270, acc: 0.844
subject: high_school_microeconomics, #q:238, acc: 0.975
subject: high_school_physics, #q:151, acc: 0.921
subject: high_school_psychology, #q:545, acc: 0.972
subject: high_school_statistics, #q:216, acc: 0.917
subject: high_school_us_history, #q:204, acc: 0.936
subject: high_school_world_history, #q:237, acc: 0.958
subject: human_aging, #q:223, acc: 0.874
subject: human_sexuality, #q:131, acc: 0.893
subject: international_law, #q:121, acc: 0.967
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.920
subject: machine_learning, #q:112, acc: 0.902
subject: management, #q:103, acc: 0.981
subject: marketing, #q:234, acc: 0.966
subject: medical_genetics, #q:100, acc: 0.970
subject: miscellaneous, #q:783, acc: 0.963
subject: moral_disputes, #q:346, acc: 0.879
subject: moral_scenarios, #q:895, acc: 0.848
subject: nutrition, #q:306, acc: 0.928
subject: philosophy, #q:311, acc: 0.923
subject: prehistory, #q:324, acc: 0.954
subject: professional_accounting, #q:282, acc: 0.890
subject: professional_law, #q:1534, acc: 0.745
subject: professional_medicine, #q:272, acc: 0.941
subject: professional_psychology, #q:612, acc: 0.922
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.890
subject: sociology, #q:201, acc: 0.950
subject: us_foreign_policy, #q:100, acc: 0.940
subject: virology, #q:166, acc: 0.578
subject: world_religions, #q:171, acc: 0.924
Total latency: 1420.867
Average accuracy: 0.896
### GSM8K
python3 bench_sglang.py --host http://localhost  --port 8080 --data-path /data00 --num-questions 5000 --parallel 100
100%|██████████████████████████████████████████████| 1319/1319 [07:59<00:00,  2.75it/s]
Accuracy: 0.952
Invalid: 0.000
Latency: 479.254 s
Output throughput: 252.254 token/s

@Fridge003

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request refines the tensor parallelism (TP) logic in the DeepseekV4 model, specifically adjusting how results are reduced and gathered across TP groups. It modifies the reduce_results condition in RowParallelLinear, introduces explicit all-reduce calls for partial TP scenarios, and replaces dp_gather_partial with dp_gather_replicate to correctly handle replicated activations. Review feedback suggests that removing a safety assertion for zero-sized tensors could lead to hangs and recommends simplifying a redundant boolean expression in the reduce_results assignment.

gemini-code-assist · 2026-05-11T07:17:22Z

        if not get_attn_tp_context().input_scattered and x.shape[0] == 0:
-            assert (
-                not self.wo_b.reduce_results
-            ), "short-circuiting allreduce will lead to hangs"
            return x


The removal of the assertion that checks self.wo_b.reduce_results could lead to hangs if not handled carefully. The original assertion assert (not self.wo_b.reduce_results) was a safeguard against short-circuiting an all-reduce operation when x.shape[0] == 0.

While the logic for input_scattered seems to cover the DP case, it's safer to retain a check to prevent potential hangs in non-DP scenarios where token distribution might be uneven across ranks, or if all ranks have zero tokens but reduce_results is true. A more robust approach would be to handle the zero-sized tensor case within the RowParallelLinear layer itself, but as a direct fix, consider reintroducing a check or ensuring that x.shape[0] == 0 implies all ranks have zero tokens when reduce_results is true.

gemini-code-assist · 2026-05-11T07:17:22Z

            bias=False,
            quant_config=quant_config,
-            reduce_results=attn_tp_size > 1,
+            reduce_results=attn_tp_size == get_tensor_model_parallel_world_size() and attn_tp_size > 1,


The condition and attn_tp_size > 1 is redundant here.

If get_tensor_model_parallel_world_size() is 1, attn_tp_size is also 1, making attn_tp_size > 1 false. The whole expression is false.

If get_tensor_model_parallel_world_size() is greater than 1, and attn_tp_size == get_tensor_model_parallel_world_size(), then attn_tp_size > 1 is implicitly true.

The RowParallelLinear layer already checks if tp_size > 1 before performing an all-reduce, so setting reduce_results=True when tp_size=1 has no effect. Simplifying this condition improves clarity.

Suggested change

reduce_results=attn_tp_size == get_tensor_model_parallel_world_size() and attn_tp_size > 1,

reduce_results=attn_tp_size == get_tensor_model_parallel_world_size(),

zhangxiaolei123456 · 2026-05-12T08:24:56Z

/tag-and-rerun-ci

Update deepseek_v4.py

ad78d18

github-actions Bot added the deepseek label May 11, 2026

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

zhangxiaolei123456 added 5 commits May 12, 2026 10:24

Merge branch 'main' into main_deepseek_dp_attention

bc4135a

Merge branch 'main' into main_deepseek_dp_attention

a9b5b05

Merge branch 'main' into main_deepseek_dp_attention

3944eda

Update deepseek_v4.py

45c6521

Update deepseek_v4.py

d4da6dc

Merge branch 'main' into main_deepseek_dp_attention

4813ac5

shiyu7 mentioned this pull request May 12, 2026

[rebase]Deepseek_v4 support w4(mxfp4)a16 on hopper #24986

Merged

5 tasks

zhangxiaolei123456 and others added 2 commits May 14, 2026 10:44

Merge branch 'main' into main_deepseek_dp_attention

eb1d8df

Feat/dp attn 0514 (#490)

5ec497c

zhangxiaolei123456 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners May 14, 2026 09:34

Merge branch 'main' into main_deepseek_dp_attention

fe29439

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek v4 support dp attention is not TP size#24952

Deepseek v4 support dp attention is not TP size#24952
zhangxiaolei123456 wants to merge 10 commits into
sgl-project:mainfrom
bytedance-iaas:main_deepseek_dp_attention

zhangxiaolei123456 commented May 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

zhangxiaolei123456 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	reduce_results=attn_tp_size == get_tensor_model_parallel_world_size() and attn_tp_size > 1,
	reduce_results=attn_tp_size == get_tensor_model_parallel_world_size(),

Conversation

zhangxiaolei123456 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhangxiaolei123456 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangxiaolei123456 commented May 11, 2026 •

edited

Loading