Skip to content

[DeepSeek V4] Fix meaningless numbers in chat output by adding swiglu_limit clamp to DeepseekV2MLP#23776

Merged
Fridge003 merged 1 commit intosgl-project:deepseek_v4from
antgroup:deepseek_v4
Apr 27, 2026
Merged

[DeepSeek V4] Fix meaningless numbers in chat output by adding swiglu_limit clamp to DeepseekV2MLP#23776
Fridge003 merged 1 commit intosgl-project:deepseek_v4from
antgroup:deepseek_v4

Conversation

@GaoYusong
Copy link
Copy Markdown
Contributor

Fixes #23752

Summary

DeepSeek-V4 chat output contains deterministic spurious tokens (e.g. plain 16, 14:05, <|end▁of▁file|> followed by training-data file paths and decode loops) at sentence-boundary positions. Reported in #23752 with this exemplar prompt:

"...Thiopentone is a barbiturate with anticonvulsant properties,16 used sometimes in status epilepticus. Propofol also has anticonvulsant effects and is16 used in refractory status epilepticus..."

The 16 tokens have IDs [49, 54] (ASCII '1' '6') — they are plain text picked by argmax at temperature=0.

Root cause

The routed-expert path already reads config.swiglu_limit and propagates it through FusedMoEMoeRunnerConfig.gemm1_clamp_limit → triton/deepgemm kernels for SiLU clamping (see deepseek_v2.py:485layers/moe/moe_runner/triton_utils/fused_moe.py:299 _swiglu_silu_clamp_mul). However, the shared-expert path and the dense-MLP path use DeepseekV2MLP which never receives swiglu_limit. DeepseekV2MLP.forward calls bare SiluAndMul() on gate_up, leaving SiLU output unbounded.

For DeepSeek-V4 checkpoints trained with swiglu_limit=10.0 (e.g. V4-Pro / V4-Flash with the 2604B submode, see configs/config_backup_large.json), the missing clamp causes shared-expert SiLU output to grow to ±2000+ during inference. This pollutes the residual stream via final_hidden_states += shared_output, propagates through mhc_post, and degrades lm_head logits at sentence-boundary positions, producing the spurious tokens reported in #23752.

A mhc_post debug hook on V4-Pro (multiple layers, contributed by @xu-yfei) shows the magnitude:

hc_post #0 return: min=-1368 max=2576 mean=0.369627 nan=0 inf=0
hc_post #9 return: min=-2240 max=2224 mean=0.243731 nan=0 inf=0
hc_post #N return: min/max ±1000 to ±2576, mean ~0.1

mhc_post's output formula is post * x + sum(comb * residual, dim=1) where post ∈ [0, 2] (sigmoid-bounded) and comb is Sinkhorn-normalized to a doubly-stochastic matrix with elements in [0, 1]. Neither of these can blow up the output by themselves, so the ±2000+ values trace back to the residual / x inputs being poisoned by unclamped shared-expert SiLU output.

The 2604B-specific comment at deepseek_v4.py:1432-1433 already flagged the issue:

disable_reason = "2604B checkpoint requires different clamping for shared and routed experts"

Disabling shared-expert fusion was implemented but the corresponding clamp on the unfused DeepseekV2MLP path was missing. This PR closes that gap.

Fix

Three minimal edits to python/sglang/srt/models/deepseek_v2.py:

  1. DeepseekV2MLP.__init__: add swiglu_limit: Optional[float] = None kwarg, store as self.swiglu_limit.

  2. DeepseekV2MLP.forward: clamp gate_up chunks before SiluAndMul if self.swiglu_limit is set. Mirrors _swiglu_silu_clamp_mul semantics in the routed-expert kernel:

    if self.swiglu_limit is not None:
        _g, _u = gate_up.chunk(2, dim=-1)
        _lim = float(self.swiglu_limit)
        gate_up = torch.cat(
            [_g.clamp(max=_lim), _u.clamp(min=-_lim, max=_lim)], dim=-1
        )
  3. Both call sites (shared_experts at L530, dense-MLP self.mlp at L2589) pass swiglu_limit=getattr(config, "swiglu_limit", None) — same pattern already used at L485 for routed experts.

No env vars introduced; behavior is gated entirely on config.swiglu_limit. Models that don't set this (e.g. plain V2/V3) retain current behavior (no clamp).

Verification

  • Anaesthesia MCQ from [Bug] The deepseek-v4 pro model outputs meaningless numbers #23752 reproduction: spurious 16 tokens eliminated at all 3 positions.
  • Multi-prompt smoke test (medical / dialog / long-form explanation / code / math): 0 leaks across all prompts (without patch, multiple 16 interjections and a the16 the16 ... decode loop in the long-form case).
  • Tested on 97d1a672 of the deepseek_v4 branch.

Backwards compatibility

swiglu_limit=None is the existing implicit behavior; adding the kwarg with default None is non-breaking. Models without config.swiglu_limit see no behavioral change.

Routed-expert path already reads config.swiglu_limit and propagates it
through FusedMoE -> MoeRunnerConfig.gemm1_clamp_limit -> triton/deepgemm
kernels for SiLU clamping. However, the shared-expert path and the
dense-MLP path use DeepseekV2MLP which never receives swiglu_limit;
DeepseekV2MLP.forward calls bare SiluAndMul() on gate_up, leaving SiLU
output unbounded.

For DeepSeek-V4 checkpoints trained with swiglu_limit=10.0, the missing
clamp causes shared-expert SiLU output to grow to ±2000+ during inference,
polluting the residual stream and degrading lm_head logits at sentence
boundaries. Symptoms include deterministic spurious tokens like '16',
'14:05', or special tokens leaked into chat output.

Fixes sgl-project#23752

Co-authored-by: Yongfei Xu <xu-yfei@users.noreply.github.com>
@GaoYusong
Copy link
Copy Markdown
Contributor Author

cc @Fridge003 @fzyzcjy @ispobock for review — this is a small but high-impact fix for #23752 (V4-Pro chat output corruption)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a swiglu_limit parameter to the DeepSeek-V2 model to clamp activations within the MLP layer. The review feedback suggests optimizing the implementation by pre-casting the limit to a float during initialization and utilizing in-place clamping operations to reduce memory overhead and avoid redundant tensor concatenations in the forward pass.

) -> None:
super().__init__()
self.tp_size = tp_size
self.swiglu_limit = swiglu_limit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is better to cast swiglu_limit to a float once during initialization to avoid repeated casting in the forward pass, which is on the hot path.

Suggested change
self.swiglu_limit = swiglu_limit
self.swiglu_limit = float(swiglu_limit) if swiglu_limit is not None else None

Comment on lines +288 to +293
if self.swiglu_limit is not None:
_g, _u = gate_up.chunk(2, dim=-1)
_lim = float(self.swiglu_limit)
gate_up = torch.cat(
[_g.clamp(max=_lim), _u.clamp(min=-_lim, max=_lim)], dim=-1
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using torch.cat creates a new tensor and involves extra memory allocation and copying. Since gate_up is a fresh tensor returned by the linear projection, you can perform the clamping in-place on its chunks. This is more efficient and avoids unnecessary overhead in the forward pass.

        if self.swiglu_limit is not None:
            _lim = self.swiglu_limit
            _g, _u = gate_up.chunk(2, dim=-1)
            _g.clamp_(max=_lim)
            _u.clamp_(min=-_lim, max=_lim)

@Fridge003
Copy link
Copy Markdown
Collaborator

@GaoYusong Thanks we are checking

@xu-yfei
Copy link
Copy Markdown
Contributor

xu-yfei commented Apr 27, 2026

MoE values grow uncontrollably.

L attn hc_post OUT.return moe hc_pre OUT.hidden_states moe hc_pre OUT.post moe hc_pre OUT.comb moe hc_post IN.hidden_states moe hc_post OUT.return
0 -0.9062~0.8086 -0.8906~0.8242 6.86e-07~1.06 9.21e-10~0.9988 -36.75~49.75 -27.00~36.50
1 -15.31~21.25 -22.62~31.12 3.44e-08~1.96 1.14e-09~1.0000 -111.00~198.00 -218.00~390.00
2 -218.00~390.00 -64.50~118.00 1.74e-09~1.56 5.00e-08~0.9998 -56.75~153.00 -218.00~392.00
3 -218.00~392.00 -140.00~328.00 1.61e-09~1.99 6.44e-10~0.9999 -199.00~238.00 -604.00~864.00
4 -604.00~864.00 -22.25~45.75 3.74e-06~1.39 2.93e-08~0.9998 -26.25~32.50 -604.00~868.00

@Fridge003 Fridge003 merged commit a7e27be into sgl-project:deepseek_v4 Apr 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants