[DeepSeek V4] Fix meaningless numbers in chat output by adding swiglu_limit clamp to DeepseekV2MLP#23776
Conversation
Routed-expert path already reads config.swiglu_limit and propagates it through FusedMoE -> MoeRunnerConfig.gemm1_clamp_limit -> triton/deepgemm kernels for SiLU clamping. However, the shared-expert path and the dense-MLP path use DeepseekV2MLP which never receives swiglu_limit; DeepseekV2MLP.forward calls bare SiluAndMul() on gate_up, leaving SiLU output unbounded. For DeepSeek-V4 checkpoints trained with swiglu_limit=10.0, the missing clamp causes shared-expert SiLU output to grow to ±2000+ during inference, polluting the residual stream and degrading lm_head logits at sentence boundaries. Symptoms include deterministic spurious tokens like '16', '14:05', or special tokens leaked into chat output. Fixes sgl-project#23752 Co-authored-by: Yongfei Xu <xu-yfei@users.noreply.github.com>
|
cc @Fridge003 @fzyzcjy @ispobock for review — this is a small but high-impact fix for #23752 (V4-Pro chat output corruption) |
There was a problem hiding this comment.
Code Review
This pull request introduces a swiglu_limit parameter to the DeepSeek-V2 model to clamp activations within the MLP layer. The review feedback suggests optimizing the implementation by pre-casting the limit to a float during initialization and utilizing in-place clamping operations to reduce memory overhead and avoid redundant tensor concatenations in the forward pass.
| ) -> None: | ||
| super().__init__() | ||
| self.tp_size = tp_size | ||
| self.swiglu_limit = swiglu_limit |
There was a problem hiding this comment.
| if self.swiglu_limit is not None: | ||
| _g, _u = gate_up.chunk(2, dim=-1) | ||
| _lim = float(self.swiglu_limit) | ||
| gate_up = torch.cat( | ||
| [_g.clamp(max=_lim), _u.clamp(min=-_lim, max=_lim)], dim=-1 | ||
| ) |
There was a problem hiding this comment.
Using torch.cat creates a new tensor and involves extra memory allocation and copying. Since gate_up is a fresh tensor returned by the linear projection, you can perform the clamping in-place on its chunks. This is more efficient and avoids unnecessary overhead in the forward pass.
if self.swiglu_limit is not None:
_lim = self.swiglu_limit
_g, _u = gate_up.chunk(2, dim=-1)
_g.clamp_(max=_lim)
_u.clamp_(min=-_lim, max=_lim)|
@GaoYusong Thanks we are checking |
|
MoE values grow uncontrollably.
|
Fixes #23752
Summary
DeepSeek-V4 chat output contains deterministic spurious tokens (e.g. plain
16,14:05,<|end▁of▁file|>followed by training-data file paths and decode loops) at sentence-boundary positions. Reported in #23752 with this exemplar prompt:The
16tokens have IDs[49, 54](ASCII'1' '6') — they are plain text picked by argmax attemperature=0.Root cause
The routed-expert path already reads
config.swiglu_limitand propagates it throughFusedMoE→MoeRunnerConfig.gemm1_clamp_limit→ triton/deepgemm kernels for SiLU clamping (seedeepseek_v2.py:485→layers/moe/moe_runner/triton_utils/fused_moe.py:299_swiglu_silu_clamp_mul). However, the shared-expert path and the dense-MLP path useDeepseekV2MLPwhich never receivesswiglu_limit.DeepseekV2MLP.forwardcalls bareSiluAndMul()ongate_up, leaving SiLU output unbounded.For DeepSeek-V4 checkpoints trained with
swiglu_limit=10.0(e.g. V4-Pro / V4-Flash with the2604Bsubmode, seeconfigs/config_backup_large.json), the missing clamp causes shared-expert SiLU output to grow to ±2000+ during inference. This pollutes the residual stream viafinal_hidden_states += shared_output, propagates throughmhc_post, and degradeslm_headlogits at sentence-boundary positions, producing the spurious tokens reported in #23752.A
mhc_postdebug hook on V4-Pro (multiple layers, contributed by @xu-yfei) shows the magnitude:mhc_post's output formula ispost * x + sum(comb * residual, dim=1)wherepost ∈ [0, 2](sigmoid-bounded) andcombis Sinkhorn-normalized to a doubly-stochastic matrix with elements in[0, 1]. Neither of these can blow up the output by themselves, so the ±2000+ values trace back to theresidual/xinputs being poisoned by unclamped shared-expert SiLU output.The 2604B-specific comment at
deepseek_v4.py:1432-1433already flagged the issue:Disabling shared-expert fusion was implemented but the corresponding clamp on the unfused
DeepseekV2MLPpath was missing. This PR closes that gap.Fix
Three minimal edits to
python/sglang/srt/models/deepseek_v2.py:DeepseekV2MLP.__init__: addswiglu_limit: Optional[float] = Nonekwarg, store asself.swiglu_limit.DeepseekV2MLP.forward: clampgate_upchunks beforeSiluAndMulifself.swiglu_limitis set. Mirrors_swiglu_silu_clamp_mulsemantics in the routed-expert kernel:Both call sites (
shared_expertsat L530, dense-MLPself.mlpat L2589) passswiglu_limit=getattr(config, "swiglu_limit", None)— same pattern already used at L485 for routed experts.No env vars introduced; behavior is gated entirely on
config.swiglu_limit. Models that don't set this (e.g. plain V2/V3) retain current behavior (no clamp).Verification
16tokens eliminated at all 3 positions.16interjections and athe16 the16 ...decode loop in the long-form case).97d1a672of thedeepseek_v4branch.Backwards compatibility
swiglu_limit=Noneis the existing implicit behavior; adding the kwarg with defaultNoneis non-breaking. Models withoutconfig.swiglu_limitsee no behavioral change.