Deepseek-v4-Pro share expert tp1 by zhangxiaolei123456 · Pull Request #24949 · sgl-project/sglang

zhangxiaolei123456 · 2026-05-11T07:02:06Z

Motivation

Share Expert cannot be deployed using TP16, this PR implements a TP1 deployment of Share Expert. DeepSeekV4 branch PR is here: #23911

Co-authored-by: shiyu7

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request introduces the SGLANG_SHARED_EXPERT_TP1 environment variable to allow disabling Tensor Parallelism for shared experts in DeepSeek-V2 models. This change is designed to support checkpoints where shared scales are not divisible by the global TP size. The implementation ensures that when shared experts are replicated (TP1), their outputs are added after the all-reduce operation to avoid incorrect summation. A review comment suggests adding a null check for shared_output in forward_normal_dual_stream to improve robustness and ensure consistency with the _post_combine_hook method.

gemini-code-assist · 2026-05-11T07:09:43Z

            final_hidden_states = tensor_model_parallel_all_reduce(final_hidden_states)
+        # TP1 shared experts are replicated, so add them after all-reduce to
+        # avoid summing the same shared output once per TP rank.
+        if self._shared_expert_tp1:


For consistency and robustness, it's good practice to check if shared_output is not None before performing the addition, similar to the change in _post_combine_hook. Although the call site for forward_normal_dual_stream checks for hidden_states.shape[0] > 0, _forward_shared_experts could potentially return None if self.shared_experts is not initialized (e.g., if n_shared_experts is 0).

Suggested change

if self._shared_expert_tp1:

if shared_output is not None and self._shared_expert_tp1:

Fridge003 · 2026-05-12T02:02:27Z

@zhangxiaolei123456 Please fix lint and post accuracy results after rebasing to main

Fridge003 · 2026-05-12T03:15:40Z

/rerun-stage stage-c-test-dsv4-4-gpu-b200

Fridge003 · 2026-05-12T03:16:01Z

/rerun-stage stage-c-test-dsv4-8-gpu-h200

github-actions · 2026-05-12T03:16:10Z

🚀 Triggered stage-c-test-dsv4-4-gpu-b200 to run independently (skipping dependencies). View workflow run

github-actions · 2026-05-12T03:16:31Z

🚀 Triggered stage-c-test-dsv4-8-gpu-h200 to run independently (skipping dependencies). View workflow run

…ack) Brings in upstream sgl-project/sglang main commits since 096ad02 (merge base, Laguna-XS.2 model support). Total: 28 upstream commits composed. Custom-stack files preserved intact (entirely-ours, byte-identical to origin/main): - Blackwell CuTe kernel suite (warp_decode_cute, g1_attention_cute, gated_norm_cute, layersplit_cute, fused_store_index_cache) - TurboQuant 2.5-bit dense KV cache path - HIGGS 2-bit dense KV cache path (with split-K decode) - NVFP4 IndexCache dispatcher (active gate) - quantization_config_dispatch (HF-config-driven runtime routing) - All custom server-args flags and runtime methods preserved Verification: - 200+ merged Python files compile cleanly - Dispatcher symbol presence verified - HIGGS pool / TurboQuant pool classes present at expected lines - compressed_tensors_w4a4_nvfp4_moe imports clean - All custom server-args flags present (enable_higgs_dense_2bit_kv_cache, enable_turboquant_dense_kv_cache, turboquant_dense_kv_preset, indexer_quantization_declared, higgs_mla_decode_num_splits, etc.) Manual-merged shared files (auto-merge gave broken/mixed output; cleaned up post-merge): - python/sglang/srt/disaggregation/mooncake/conn.py: upstream's PR#24932 refactored maybe_send_extra into a state-types-loop. Replayed our LayerSplit NSA state-index-length-mismatch check inside the SWA/NSA branch of the new loop body. - sgl-kernel/python/sgl_kernel/__init__.py: upstream's PR#23449 (Apple Silicon Metal kernel) wrapped the entire module body in `if darwin/arm64: from sgl_kernel.metal import * else: ...`. The auto-merge duplicated the file body; rewrote cleanly with upstream's structure and re-injected our `g1_gate_forward`, `warp_decode_cute_moe_forward`, and `warp_decode_cute_moe_packed_forward` imports plus `g1_gate_forward` in _DEBUG_EXPORT_NAMES. - python/sglang/srt/managers/scheduler_output_processor_mixin.py: line 628 still referenced `result.num_accepted_drafts` (renamed by PR sgl-project#25038 to `num_correct_drafts`). Renamed in place. - python/sglang/srt/observability/scheduler_metrics_mixin.py: a block around the spec-decode logging path had mixed old/new names from auto-merge (lines 553/557/560). Renamed `spec_num_accepted_tokens` -> `spec_num_accept_tokens` and local `num_accepted_drafts` -> `num_correct_drafts` to match the rest of the file. - test/test_smc_info.py: stub Req mock used the old field names `spec_accepted_drafts` and `update_spec_acceptance_histogram`. Renamed to `spec_num_correct_drafts` and `update_spec_correct_drafts_histogram` per PR sgl-project#24081. Auto-merge cleanly integrated upstream changes to: - server_args.py (new fields: prefill_only_disable_kv_cache, weight_loader_drop_cache_after_load, prefill_delayer_queue_min_ratio, prefill_delayer_max_delay_ms, speculative_draft_window_size, etc.) - mem_cache/memory_pool.py (new NoOpMHATokenToKVPool) - model_executor/model_runner_kv_cache_mixin.py (NoOpMHATokenToKVPool pool factory + _validate_prefill_only_disable_kv_cache_pool_family) - layers/attention/nsa_backend.py (spec rename num_accepted_drafts -> num_correct_drafts; num_accepted_tokens -> num_accept_tokens) - layers/attention/nsa/nsa_indexer.py (new _apply_q_scale_and_softmax_scale compile method; torch.mm replaces deep_gemm wrapper) - 28+ disaggregation/spec/runner files with mostly clean upstream-side-only integration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> ----- upstream commit subjects (28) ----- fd3eb77 [Cookbook]: add Laguna-XS.2 (Poolside) (sgl-project#24730) 6be1a45 Fix swa component host hit (sgl-project#25085) 693f497 [NPU] use causal_conv1d_update_v2 for performance (sgl-project#24595) 1efe9e2 [Bug Fix] Reject incompatible combination of --disable-cuda-graph-padding and --enable-torch-compile (sgl-project#23903) 8d27ce7 Optimize uvicorn startup command (sgl-project#25041) b35fd5f [fix] skip legacy minicpmv conv template for MiniCPM-V 4.6 (sgl-project#24998) 7582237 [Tiny Fix] Disable BCG when inner layer_model unresolved (sgl-project#25021) ca3bc05 Deepseek-v4-Pro share expert tp1 (sgl-project#24949) a72d3ae [Spec] Multi-layer mamba scatter cleanup; fix positional call bug (sgl-project#25030) 7128533 Revert "Migrate Intel CPU cases to the test/registered." (sgl-project#25044) 1f985c5 [Spec] Rename `accepted_indices` -> `accept_indices`; drop `_token_id` suffix per Rule 5 (sgl-project#25038) ecf5d84 Migrate Intel CPU cases to the test/registered. (sgl-project#22670) d7f4761 [PD] Refactor hybrid state transfer (sgl-project#24932) 91907b7 [UnifiedTree]: Fix Unified HiCache tombstone lock release replay (sgl-project#24972) 4ad63ad [Spec] Rename `accepted_drafts` -> `correct_drafts` for unambiguous naming (sgl-project#24081) 6bfb365 [PD] Rate limit prefill inflight polling warnings (sgl-project#24967) 6bb79c1 [Linear Attn] Add CUSTOM enum and plugin extensibility for kernel backends (sgl-project#24937) cfc41d5 Fix kimi k2.5 mla eagle + dp attention (sgl-project#25033) 0f3932c [Fix] Qwen3-ASR config: set thinker_config before super().__init__ (sgl-project#24187) f526e3f [Spec] Mamba scatter cleanup; fix multi-layer positional bug; dflash naming (sgl-project#25029) 10375a1 [NIXL][XPU] Fix uint64 overflow for mismatched P/D TP sizes (e.g. prefill_tp=1, decode_tp=2) (sgl-project#24648) 0a37d24 [diffusion] hardware: support sage attention backend on MUSA (attn backend, 21/N) (sgl-project#24752) 5495026 [HiCache] feat: default storage prefetch timeout (sgl-project#23309) 186eb42 Feat: Support SWA (Sliding Window Attention) for EAGLE-3 drafter (sgl-project#24664) a75b79e Feat: Support newer EAGLE-3 drafters (sgl-project#24663) f3a8189 [Spec] Internal rename per N2 v2 naming rule (sgl-project#25014) bfc2eda [MUSA] Use MUSA-optimized operators in piecewise CUDA graph (sgl-project#23633) 74d70af [Apple Silicon] Add Metal kernel support in sgl-kernel (sgl-project#23449)

zhangxiaolei123456 added 5 commits May 11, 2026 14:18

Update environ.py

0d4bd0c

Update deepseek_v2.py

9d12b6b

Update model_runner.py

3d5b495

Update deepseek_v2.py

1c2bc5c

Update model_runner.py

1753cd7

zhangxiaolei123456 requested review from Fridge003, Ying1123, ch-wan, fzyzcjy, hnyls2002, ispobock and merrymercy as code owners May 11, 2026 07:02

github-actions Bot added the deepseek label May 11, 2026

Merge branch 'main' into main_deepseek_share_expert_tp1

0b1b57e

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Fridge003 approved these changes May 12, 2026

View reviewed changes

zhangxiaolei123456 added 2 commits May 12, 2026 11:07

Update deepseek_v2.py

3daa429

Merge branch 'main' into main_deepseek_share_expert_tp1

48683fe

Merge branch 'main' into main_deepseek_share_expert_tp1

b8fb040

Fridge003 merged commit ca3bc05 into sgl-project:main May 12, 2026
63 of 73 checks passed

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

Deepseek-v4-Pro share expert tp1 (sgl-project#24949)

fb9ddaf

xjpang pushed a commit to xjpang/sglang that referenced this pull request May 13, 2026

Deepseek-v4-Pro share expert tp1 (sgl-project#24949)

4e8b0f8

Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026

Deepseek-v4-Pro share expert tp1 (sgl-project#24949)

352d5f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek-v4-Pro share expert tp1#24949

Deepseek-v4-Pro share expert tp1#24949
Fridge003 merged 9 commits into
sgl-project:mainfrom
bytedance-iaas:main_deepseek_share_expert_tp1

zhangxiaolei123456 commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

Fridge003 commented May 12, 2026

Uh oh!

Fridge003 commented May 12, 2026

Uh oh!

Fridge003 commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if self._shared_expert_tp1:
	if shared_output is not None and self._shared_expert_tp1:

Conversation

zhangxiaolei123456 commented May 11, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented May 12, 2026

Uh oh!

Fridge003 commented May 12, 2026

Uh oh!

Fridge003 commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants