feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint by Fridge003 · Pull Request #23681 · sgl-project/sglang

Fridge003 · 2026-04-25T00:42:05Z

To be updated

…00) (#27) * feat(w4a16-deepseek): add SM90 W4A16 MoE path for DSv4 FP4 checkpoint Add DeepSeekW4A16MoEMethod, the H200/SM90 counterpart to DeepSeekMxfp4MoEMethod. Both classes consume the same DSv4 FP4 checkpoint (SGLANG_DSV4_MODE=2604 SGLANG_DSV4_FP4_EXPERTS=1); mxfp4_deepseek targets B200's trtllm_fp4_block_scale_routed_moe (MXFP8xMXFP4), and this new path targets flashinfer's SM90 mixed-input cutlass_fused_moe(..., use_w4_group_scaling=True) (BF16xMXFP4) introduced in flashinfer-ai/flashinfer#3084. Key differences from mxfp4_deepseek: - Pre-interleaves FP4 weights and MXFP4 block scales at load time via flashinfer's interleave_moe_weights_for_hopper_mixed_gemm / interleave_moe_scales_for_hopper_mixed_gemm helpers. Without this the SM90 LDSM-based FP4->BF16 pipeline reads LUT bytes from wrong positions and the output decorrelates for K > 128 (DSv4 has K=4096). - Kernel takes raw (token_selected_experts, token_final_scales) rather than the packed int32 (id<<16 | weight_bf16) that the TRT-LLM routed kernel expects; no PackTopkIds step. - Local-expert filtering is done via ep_size/ep_rank parameters on cutlass_fused_moe, so topk_ids are handed over in the GLOBAL id space (same as the mxfp4_deepseek dispatcher-mapping-undo logic). - SwiGLU clamp is plumbed via swiglu_limit (no separate gemm1_clamp_limit). w13 row order is unchanged: checkpoint stores [w1(gate), w3(up)] and we reorder to [w3(up), w1(gate)] to match the SM90 kernel's reference (test_moe_bf16_mxfp4 splits as `w3, w1 = torch.chunk(w31, 2, dim=0)`). Enabled by --moe-runner-backend flashinfer_w4a16. * fix(w4a16-deepseek): bump sunrise_moe_code_path_checker.observed on 260415 deepseek_v4.py forward asserts observed == 1 exactly once per MoE layer under SGLANG_DSV4_2604_SUBMODE=260415 and then resets. mxfp4_deepseek bumps the counter; the W4A16 path was forgetting, so the server crashed at the first real forward on the DSv4 0415_v5 checkpoint. Mirror the mxfp4_deepseek bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(w4a16-deepseek): add SGLANG_HACK_DEBUG_W4A16_REMOVE_SWIGLU_LIMIT Force-override swiglu_limit to None in cutlass_fused_moe call when flag is set, to A/B-test whether the swiglu_limit=10.0 up-branch clamp path is the root cause of the AIME25 accuracy regression observed on the DSv4 260415 FP4 checkpoint. Keeps _swiglu_limit_tensor non-None so create_moe_runner's sanity assert and the sunrise_moe_code_path_checker bump remain unaffected. Flag defaults to False: production behavior unchanged. See journal 2026-04-21-024 for the repro plan. * fix(w4a16-deepseek): rename interleave_moe_*_for_hopper → _sm90 Upstream flashinfer PR #3084 branch (samuellees/flashinfer@feat/w4a16-moe-kernel commit cb90611) renamed interleave_moe_{weights,scales}_for_{Hopper,hopper}_mixed_gemm to the _sm90_ variants. Sync the sglang W4A16 MoE wrapper to the new names. * journal(2026-04-21-022): DSv4 W4A16 H200 with cuda graph enabled First cuda-graph run of the W4A16 path on DSv4 FP4 ckpt. Records a clean 8m52s cold start (41s capture for 36 batch sizes), ~6000 tok/s aggregate peak at mc=256, and 17k+ decode batches all dispatching under graph with no kernel-level issues. AIME25 x16 ran ~5h to 12/16 seeds before a gloo TCP connection reset peeled the scheduler; partial pass@1[avg-of-12] = 75.56% ± 6.25%. GPQA queued but did not run. Next: relaunch + GPQA. * revert: remove 2026-04-21-022 journal (belongs on rcli-config branch) Journal committed to wrong branch; this branch (w4a16 PR) contains code changes only. Journal will be added under rcli-config per convention. * feat(w4a16-deepseek): add aime25_q6 single-question bench dataset Generated via sunrise/filter_nemo_skills_questions.py from the canonical nemo_skills aime25 source jsonl (see generate.sh for the reproducible one-liner). Purpose: one-question dense A/B subset for the W4A16 accuracy regression investigation. Journal 0421-024 observed pred=271 clustering on 11/31 wrong seeds for aime25-6 across all three arms; aime25_q6:64 lets us run a 64-repeat concentration experiment in minutes instead of 9 hours. * debug(w4a16-deepseek): assert fp32→UE8M0 scale conversion is lossless UE8M0 stores only the biased exponent, so a float32 block scale is only preserved when it's an exact power of 2. If DSv4's ckpt stores scales that aren't pure powers of 2, the .to(float8_e8m0fnu) round-trip silently rounds and feeds the kernel wrong scales — a plausible culprit for the AIME25 accuracy drop. This helper crashes loudly on the first mismatch with a sample of bad values instead of silently degrading. * feat(w4a16-deepseek): add SGLANG_HACK_DEBUG_W4A16_USE_BF16_API for dequant-ref When the flag is set, process_weights_after_loading dequants FP4+UE8M0 expert weights into plain bf16 (post reorder_w1w3_to_w3w1) and drops the scale parameters, and apply() calls cutlass_fused_moe with bf16 weights, quant_scales=None, use_w4_group_scaling=False. This routes the MoE through flashinfer's CutlassMoeFCRunner<bf16, bf16> specialization — a numerically independent reference path that does not share the SM90 mixed-input dequant/interleave code of PR #3084, so any acc gap it closes isolates the regression to the W4A16 kernel / interleave side. Flag defaults to False; W4A16 behavior unchanged when off. * docs(w4a16-deepseek): cite flashinfer source for _dequant_mxfp4 copy Body and LUT copied verbatim from flashinfer-sunrise PR #3084 (commit 77746b81) at tests/moe/test_trtllm_cutlass_fused_moe.py lines 2419-2452 (_MXFP4_LUT + _dequant_mxfp4_on_device). Bitwise equivalence verified on 5 random uint8 shapes (CPU torch.equal on bf16 output; NaN-position agreement on UE8M0=255 case separately). * test(sunrise): add verify_dequant_mxfp4.py Bitwise equivalence check between flashinfer-sunrise PR #3084's _dequant_mxfp4_on_device (tests/moe/test_trtllm_cutlass_fused_moe.py @ 77746b81) and the sglang local copy in w4a16_deepseek.py. Paths are resolved relative to the script, with FLASHINFER_SUNRISE_TEST_FILE env override. CPU-only; no CUDA needed. * fix(w4a16-deepseek): extend StandardDispatcher skip_local_expert_mapping to flashinfer_w4a16 Previously the skip gate was `enable_flashinfer_mxfp4_moe and SGLANG_OPT_MXFP4_SKIP_DISPATCHER_MAPPING`, so for --moe-runner-backend=flashinfer_w4a16 the dispatcher always applied the global->local+sentinel mapping, while w4a16_deepseek.apply() (copy of mxfp4 logic) skipped the inverse undo when the env default (True) was active. Net effect under --ep>1: cutlass_fused_moe received local-index+(-1)-sentinel topk_ids interpreted as globals, causing ep_rank>0 experts to be filtered out and producing garbage output (degenerate token loops). TP-only arm masked it because local-id == global-id and no sentinels fire when num_local_experts == num_experts. Fix: include flashinfer_w4a16 in the skip gate alongside flashinfer_mxfp4_moe. Repro + diagnosis: sunrise/bench_records/journals/2026-04-22-003-w4a16-ep-garbage-bug-repro.md * feat(w4a16-deepseek): add SGLANG_HACK_DEBUG_W4A16_USE_TORCH_REF env flag Gate a new pure-torch MoE reference path as an acc-investigation arm that sits one level deeper than SGLANG_HACK_DEBUG_W4A16_USE_BF16_API: both dequant FP4 to bf16 once at load time, but BF16_API still calls the flashinfer bf16 grouped GEMM while TORCH_REF bypasses it entirely. * feat(w4a16-deepseek): add pure-torch MoE ref in debug_utils torch_ref_cutlass_fused_moe mirrors the flashinfer cutlass_fused_moe signature so the w4a16_deepseek apply() site can swap one for the other via a single local import, matching the pattern used by mxfp4_deepseek/naive_torch_trtllm_fp4_block_scale_routed_moe. Body adapted from flashinfer-sunrise tests/moe/test_trtllm_cutlass_fused_moe.py _compute_with_active_experts (commit 77746b81). * feat(w4a16-deepseek): wire torch-ref path through single MoE call site Gate the bf16-weight dequant branch in process_weights_after_loading on either BF16_API or TORCH_REF (mutually exclusive), and swap the MoE function in apply() via local import rather than duplicating the call with different args. * test(sunrise): add verify_torch_ref_w4a16_moe.py Element-wise smoke comparing torch_ref_cutlass_fused_moe against the flashinfer cutlass_fused_moe(use_w4_group_scaling=True, swiglu_limit=...) kernel on tiny random MXFP4 weights (shapes borrowed from flashinfer's own W4A16_CORRECTNESS_CONFIGS). Before committing bench-scale wall-clock to the torch-ref path, we want this to show that the two agree within ~1% at small shape. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com>

gemini-code-assist · 2026-04-25T00:42:09Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

github-actions Bot added quant LLM Quantization deepseek labels Apr 25, 2026

Fridge003 changed the title ~~feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint (H2…~~ feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint Apr 25, 2026

seindum mentioned this pull request Apr 25, 2026

DeepSeek V4 Roadmap #23602

Open

34 tasks

samuellees mentioned this pull request May 11, 2026

Add FlashInfer SM90 cutlass MXFP4 MoE backend (W4A16) for GPT-OSS + DeepSeek-V4 #24816

Merged

Fridge003 closed this May 12, 2026

Fridge003 deleted the w4a16_v4 branch May 13, 2026 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint#23681

feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint#23681
Fridge003 wants to merge 1 commit into
deepseek_v4from
w4a16_v4

Fridge003 commented Apr 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fridge003 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fridge003 commented Apr 25, 2026 •

edited

Loading