[DeepSeek V4] Fix meaningless numbers in chat output by adding swiglu_limit clamp to DeepseekV2MLP by GaoYusong · Pull Request #23776 · sgl-project/sglang

GaoYusong · 2026-04-26T21:08:59Z

Summary

DeepSeek-V4 chat output contains deterministic spurious tokens (e.g. plain 16, 14:05, <｜end▁of▁file｜> followed by training-data file paths and decode loops) at sentence-boundary positions. Reported in #23752 with this exemplar prompt:

"...Thiopentone is a barbiturate with anticonvulsant properties,16 used sometimes in status epilepticus. Propofol also has anticonvulsant effects and is16 used in refractory status epilepticus..."

The 16 tokens have IDs [49, 54] (ASCII '1' '6') — they are plain text picked by argmax at temperature=0.

Root cause

The routed-expert path already reads config.swiglu_limit and propagates it through FusedMoE → MoeRunnerConfig.gemm1_clamp_limit → triton/deepgemm kernels for SiLU clamping (see deepseek_v2.py:485 → layers/moe/moe_runner/triton_utils/fused_moe.py:299 _swiglu_silu_clamp_mul). However, the shared-expert path and the dense-MLP path use DeepseekV2MLP which never receives swiglu_limit. DeepseekV2MLP.forward calls bare SiluAndMul() on gate_up, leaving SiLU output unbounded.

For DeepSeek-V4 checkpoints trained with swiglu_limit=10.0 (e.g. V4-Pro / V4-Flash with the 2604B submode, see configs/config_backup_large.json), the missing clamp causes shared-expert SiLU output to grow to ±2000+ during inference. This pollutes the residual stream via final_hidden_states += shared_output, propagates through mhc_post, and degrades lm_head logits at sentence-boundary positions, producing the spurious tokens reported in #23752.

A mhc_post debug hook on V4-Pro (multiple layers, contributed by @xu-yfei) shows the magnitude:

hc_post #0 return: min=-1368 max=2576 mean=0.369627 nan=0 inf=0
hc_post #9 return: min=-2240 max=2224 mean=0.243731 nan=0 inf=0
hc_post #N return: min/max ±1000 to ±2576, mean ~0.1

mhc_post's output formula is post * x + sum(comb * residual, dim=1) where post ∈ [0, 2] (sigmoid-bounded) and comb is Sinkhorn-normalized to a doubly-stochastic matrix with elements in [0, 1]. Neither of these can blow up the output by themselves, so the ±2000+ values trace back to the residual / x inputs being poisoned by unclamped shared-expert SiLU output.

The 2604B-specific comment at deepseek_v4.py:1432-1433 already flagged the issue:

disable_reason = "2604B checkpoint requires different clamping for shared and routed experts"

Disabling shared-expert fusion was implemented but the corresponding clamp on the unfused DeepseekV2MLP path was missing. This PR closes that gap.

Fix

Three minimal edits to python/sglang/srt/models/deepseek_v2.py:

DeepseekV2MLP.__init__: add swiglu_limit: Optional[float] = None kwarg, store as self.swiglu_limit.

DeepseekV2MLP.forward: clamp gate_up chunks before SiluAndMul if self.swiglu_limit is set. Mirrors _swiglu_silu_clamp_mul semantics in the routed-expert kernel:

if self.swiglu_limit is not None:
    _g, _u = gate_up.chunk(2, dim=-1)
    _lim = float(self.swiglu_limit)
    gate_up = torch.cat(
        [_g.clamp(max=_lim), _u.clamp(min=-_lim, max=_lim)], dim=-1
    )

Both call sites (shared_experts at L530, dense-MLP self.mlp at L2589) pass swiglu_limit=getattr(config, "swiglu_limit", None) — same pattern already used at L485 for routed experts.

No env vars introduced; behavior is gated entirely on config.swiglu_limit. Models that don't set this (e.g. plain V2/V3) retain current behavior (no clamp).

Verification

Anaesthesia MCQ from [Bug] The deepseek-v4 pro model outputs meaningless numbers #23752 reproduction: spurious 16 tokens eliminated at all 3 positions.
Multi-prompt smoke test (medical / dialog / long-form explanation / code / math): 0 leaks across all prompts (without patch, multiple 16 interjections and a the16 the16 ... decode loop in the long-form case).
Tested on 97d1a672 of the deepseek_v4 branch.

Backwards compatibility

swiglu_limit=None is the existing implicit behavior; adding the kwarg with default None is non-breaking. Models without config.swiglu_limit see no behavioral change.

Routed-expert path already reads config.swiglu_limit and propagates it through FusedMoE -> MoeRunnerConfig.gemm1_clamp_limit -> triton/deepgemm kernels for SiLU clamping. However, the shared-expert path and the dense-MLP path use DeepseekV2MLP which never receives swiglu_limit; DeepseekV2MLP.forward calls bare SiluAndMul() on gate_up, leaving SiLU output unbounded. For DeepSeek-V4 checkpoints trained with swiglu_limit=10.0, the missing clamp causes shared-expert SiLU output to grow to ±2000+ during inference, polluting the residual stream and degrading lm_head logits at sentence boundaries. Symptoms include deterministic spurious tokens like '16', '14:05', or special tokens leaked into chat output. Fixes sgl-project#23752 Co-authored-by: Yongfei Xu <xu-yfei@users.noreply.github.com>

GaoYusong · 2026-04-26T21:09:28Z

cc @Fridge003 @fzyzcjy @ispobock for review — this is a small but high-impact fix for #23752 (V4-Pro chat output corruption)

gemini-code-assist

Code Review

This pull request introduces a swiglu_limit parameter to the DeepSeek-V2 model to clamp activations within the MLP layer. The review feedback suggests optimizing the implementation by pre-casting the limit to a float during initialization and utilizing in-place clamping operations to reduce memory overhead and avoid redundant tensor concatenations in the forward pass.

gemini-code-assist · 2026-04-26T21:14:02Z

    ) -> None:
        super().__init__()
        self.tp_size = tp_size
+        self.swiglu_limit = swiglu_limit


It is better to cast swiglu_limit to a float once during initialization to avoid repeated casting in the forward pass, which is on the hot path.

Suggested change

self.swiglu_limit = swiglu_limit

self.swiglu_limit = float(swiglu_limit) if swiglu_limit is not None else None

gemini-code-assist · 2026-04-26T21:14:03Z

+        if self.swiglu_limit is not None:
+            _g, _u = gate_up.chunk(2, dim=-1)
+            _lim = float(self.swiglu_limit)
+            gate_up = torch.cat(
+                [_g.clamp(max=_lim), _u.clamp(min=-_lim, max=_lim)], dim=-1
+            )


Using torch.cat creates a new tensor and involves extra memory allocation and copying. Since gate_up is a fresh tensor returned by the linear projection, you can perform the clamping in-place on its chunks. This is more efficient and avoids unnecessary overhead in the forward pass.

if self.swiglu_limit is not None: _lim = self.swiglu_limit _g, _u = gate_up.chunk(2, dim=-1) _g.clamp_(max=_lim) _u.clamp_(min=-_lim, max=_lim)

Fridge003 · 2026-04-26T21:57:04Z

@GaoYusong Thanks we are checking

xu-yfei · 2026-04-27T00:59:52Z

MoE values grow uncontrollably.

L	attn hc_post OUT.return	moe hc_pre OUT.hidden_states	moe hc_pre OUT.post	moe hc_pre OUT.comb	moe hc_post IN.hidden_states	moe hc_post OUT.return
0	-0.9062~0.8086	-0.8906~0.8242	6.86e-07~1.06	9.21e-10~0.9988	-36.75~49.75	-27.00~36.50
1	-15.31~21.25	-22.62~31.12	3.44e-08~1.96	1.14e-09~1.0000	-111.00~198.00	-218.00~390.00
2	-218.00~390.00	-64.50~118.00	1.74e-09~1.56	5.00e-08~0.9998	-56.75~153.00	-218.00~392.00
3	-218.00~392.00	-140.00~328.00	1.61e-09~1.99	6.44e-10~0.9999	-199.00~238.00	-604.00~864.00
4	-604.00~864.00	-22.25~45.75	3.74e-06~1.39	2.93e-08~0.9998	-26.25~32.50	-604.00~868.00

GaoYusong requested review from Fridge003, ch-wan, fzyzcjy, ispobock and merrymercy as code owners April 26, 2026 21:09

github-actions Bot added the deepseek label Apr 26, 2026

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

Fridge003 added the high priority label Apr 26, 2026

Fridge003 mentioned this pull request Apr 26, 2026

DeepSeek V4 Roadmap #23602

Open

33 tasks

benchislett mentioned this pull request Apr 26, 2026

[DSV4][Bugfix] Apply swiglu_limit to DSV2 SharedExpert MLP zyongye/vllm#8

Closed

yudian0504 mentioned this pull request Apr 27, 2026

[Bug] The deepseek-v4 pro model outputs meaningless numbers #23752

Closed

5 tasks

Fridge003 approved these changes Apr 27, 2026

View reviewed changes

Fridge003 merged commit a7e27be into sgl-project:deepseek_v4 Apr 27, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSeek V4] Fix meaningless numbers in chat output by adding swiglu_limit clamp to DeepseekV2MLP#23776

[DeepSeek V4] Fix meaningless numbers in chat output by adding swiglu_limit clamp to DeepseekV2MLP#23776
Fridge003 merged 1 commit intosgl-project:deepseek_v4from
antgroup:deepseek_v4

GaoYusong commented Apr 26, 2026

Uh oh!

GaoYusong commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

Fridge003 commented Apr 26, 2026

Uh oh!

xu-yfei commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	self.swiglu_limit = swiglu_limit
	self.swiglu_limit = float(swiglu_limit) if swiglu_limit is not None else None

Conversation

GaoYusong commented Apr 26, 2026

Summary

Root cause

Fix

Verification

Backwards compatibility

Uh oh!

GaoYusong commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Apr 26, 2026

Uh oh!

xu-yfei commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants