Skip to content

feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint#23681

Closed
Fridge003 wants to merge 1 commit into
deepseek_v4from
w4a16_v4
Closed

feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint#23681
Fridge003 wants to merge 1 commit into
deepseek_v4from
w4a16_v4

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 commented Apr 25, 2026

To be updated

…00) (#27)

* feat(w4a16-deepseek): add SM90 W4A16 MoE path for DSv4 FP4 checkpoint

Add DeepSeekW4A16MoEMethod, the H200/SM90 counterpart to DeepSeekMxfp4MoEMethod.
Both classes consume the same DSv4 FP4 checkpoint (SGLANG_DSV4_MODE=2604
SGLANG_DSV4_FP4_EXPERTS=1); mxfp4_deepseek targets B200's
trtllm_fp4_block_scale_routed_moe (MXFP8xMXFP4), and this new path targets
flashinfer's SM90 mixed-input cutlass_fused_moe(..., use_w4_group_scaling=True)
(BF16xMXFP4) introduced in flashinfer-ai/flashinfer#3084.

Key differences from mxfp4_deepseek:
- Pre-interleaves FP4 weights and MXFP4 block scales at load time via
  flashinfer's interleave_moe_weights_for_hopper_mixed_gemm /
  interleave_moe_scales_for_hopper_mixed_gemm helpers. Without this the
  SM90 LDSM-based FP4->BF16 pipeline reads LUT bytes from wrong positions
  and the output decorrelates for K > 128 (DSv4 has K=4096).
- Kernel takes raw (token_selected_experts, token_final_scales) rather than
  the packed int32 (id<<16 | weight_bf16) that the TRT-LLM routed kernel
  expects; no PackTopkIds step.
- Local-expert filtering is done via ep_size/ep_rank parameters on
  cutlass_fused_moe, so topk_ids are handed over in the GLOBAL id space
  (same as the mxfp4_deepseek dispatcher-mapping-undo logic).
- SwiGLU clamp is plumbed via swiglu_limit (no separate gemm1_clamp_limit).

w13 row order is unchanged: checkpoint stores [w1(gate), w3(up)] and we
reorder to [w3(up), w1(gate)] to match the SM90 kernel's reference
(test_moe_bf16_mxfp4 splits as `w3, w1 = torch.chunk(w31, 2, dim=0)`).

Enabled by --moe-runner-backend flashinfer_w4a16.

* fix(w4a16-deepseek): bump sunrise_moe_code_path_checker.observed on 260415

deepseek_v4.py forward asserts observed == 1 exactly once per MoE layer
under SGLANG_DSV4_2604_SUBMODE=260415 and then resets. mxfp4_deepseek
bumps the counter; the W4A16 path was forgetting, so the server crashed
at the first real forward on the DSv4 0415_v5 checkpoint. Mirror the
mxfp4_deepseek bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(w4a16-deepseek): add SGLANG_HACK_DEBUG_W4A16_REMOVE_SWIGLU_LIMIT

Force-override swiglu_limit to None in cutlass_fused_moe call when flag
is set, to A/B-test whether the swiglu_limit=10.0 up-branch clamp path
is the root cause of the AIME25 accuracy regression observed on the
DSv4 260415 FP4 checkpoint.

Keeps _swiglu_limit_tensor non-None so create_moe_runner's sanity assert
and the sunrise_moe_code_path_checker bump remain unaffected. Flag
defaults to False: production behavior unchanged.

See journal 2026-04-21-024 for the repro plan.

* fix(w4a16-deepseek): rename interleave_moe_*_for_hopper → _sm90

Upstream flashinfer PR #3084 branch (samuellees/flashinfer@feat/w4a16-moe-kernel
commit cb90611) renamed interleave_moe_{weights,scales}_for_{Hopper,hopper}_mixed_gemm
to the _sm90_ variants. Sync the sglang W4A16 MoE wrapper to the new names.

* journal(2026-04-21-022): DSv4 W4A16 H200 with cuda graph enabled

First cuda-graph run of the W4A16 path on DSv4 FP4 ckpt. Records a clean
8m52s cold start (41s capture for 36 batch sizes), ~6000 tok/s aggregate
peak at mc=256, and 17k+ decode batches all dispatching under graph with
no kernel-level issues. AIME25 x16 ran ~5h to 12/16 seeds before a gloo
TCP connection reset peeled the scheduler; partial pass@1[avg-of-12] =
75.56% ± 6.25%. GPQA queued but did not run. Next: relaunch + GPQA.

* revert: remove 2026-04-21-022 journal (belongs on rcli-config branch)

Journal committed to wrong branch; this branch (w4a16 PR) contains code
changes only. Journal will be added under rcli-config per convention.

* feat(w4a16-deepseek): add aime25_q6 single-question bench dataset

Generated via sunrise/filter_nemo_skills_questions.py from the canonical
nemo_skills aime25 source jsonl (see generate.sh for the reproducible
one-liner).

Purpose: one-question dense A/B subset for the W4A16 accuracy regression
investigation. Journal 0421-024 observed pred=271 clustering on 11/31
wrong seeds for aime25-6 across all three arms; aime25_q6:64 lets us run
a 64-repeat concentration experiment in minutes instead of 9 hours.

* debug(w4a16-deepseek): assert fp32→UE8M0 scale conversion is lossless

UE8M0 stores only the biased exponent, so a float32 block scale is only
preserved when it's an exact power of 2. If DSv4's ckpt stores scales that
aren't pure powers of 2, the .to(float8_e8m0fnu) round-trip silently rounds
and feeds the kernel wrong scales — a plausible culprit for the AIME25
accuracy drop. This helper crashes loudly on the first mismatch with a
sample of bad values instead of silently degrading.

* feat(w4a16-deepseek): add SGLANG_HACK_DEBUG_W4A16_USE_BF16_API for dequant-ref

When the flag is set, process_weights_after_loading dequants FP4+UE8M0
expert weights into plain bf16 (post reorder_w1w3_to_w3w1) and drops the
scale parameters, and apply() calls cutlass_fused_moe with bf16 weights,
quant_scales=None, use_w4_group_scaling=False. This routes the MoE
through flashinfer's CutlassMoeFCRunner<bf16, bf16> specialization —
a numerically independent reference path that does not share the SM90
mixed-input dequant/interleave code of PR #3084, so any acc gap it
closes isolates the regression to the W4A16 kernel / interleave side.

Flag defaults to False; W4A16 behavior unchanged when off.

* docs(w4a16-deepseek): cite flashinfer source for _dequant_mxfp4 copy

Body and LUT copied verbatim from flashinfer-sunrise PR #3084 (commit
77746b81) at tests/moe/test_trtllm_cutlass_fused_moe.py lines 2419-2452
(_MXFP4_LUT + _dequant_mxfp4_on_device). Bitwise equivalence verified
on 5 random uint8 shapes (CPU torch.equal on bf16 output; NaN-position
agreement on UE8M0=255 case separately).

* test(sunrise): add verify_dequant_mxfp4.py

Bitwise equivalence check between flashinfer-sunrise PR #3084's
_dequant_mxfp4_on_device (tests/moe/test_trtllm_cutlass_fused_moe.py
@ 77746b81) and the sglang local copy in w4a16_deepseek.py. Paths are
resolved relative to the script, with FLASHINFER_SUNRISE_TEST_FILE env
override. CPU-only; no CUDA needed.

* fix(w4a16-deepseek): extend StandardDispatcher skip_local_expert_mapping to flashinfer_w4a16

Previously the skip gate was `enable_flashinfer_mxfp4_moe and SGLANG_OPT_MXFP4_SKIP_DISPATCHER_MAPPING`,
so for --moe-runner-backend=flashinfer_w4a16 the dispatcher always applied the
global->local+sentinel mapping, while w4a16_deepseek.apply() (copy of mxfp4 logic)
skipped the inverse undo when the env default (True) was active. Net effect under
--ep>1: cutlass_fused_moe received local-index+(-1)-sentinel topk_ids interpreted
as globals, causing ep_rank>0 experts to be filtered out and producing garbage
output (degenerate token loops). TP-only arm masked it because local-id == global-id
and no sentinels fire when num_local_experts == num_experts.

Fix: include flashinfer_w4a16 in the skip gate alongside flashinfer_mxfp4_moe.

Repro + diagnosis: sunrise/bench_records/journals/2026-04-22-003-w4a16-ep-garbage-bug-repro.md

* feat(w4a16-deepseek): add SGLANG_HACK_DEBUG_W4A16_USE_TORCH_REF env flag

Gate a new pure-torch MoE reference path as an acc-investigation arm that
sits one level deeper than SGLANG_HACK_DEBUG_W4A16_USE_BF16_API: both
dequant FP4 to bf16 once at load time, but BF16_API still calls the
flashinfer bf16 grouped GEMM while TORCH_REF bypasses it entirely.

* feat(w4a16-deepseek): add pure-torch MoE ref in debug_utils

torch_ref_cutlass_fused_moe mirrors the flashinfer cutlass_fused_moe
signature so the w4a16_deepseek apply() site can swap one for the other
via a single local import, matching the pattern used by
mxfp4_deepseek/naive_torch_trtllm_fp4_block_scale_routed_moe. Body
adapted from flashinfer-sunrise tests/moe/test_trtllm_cutlass_fused_moe.py
_compute_with_active_experts (commit 77746b81).

* feat(w4a16-deepseek): wire torch-ref path through single MoE call site

Gate the bf16-weight dequant branch in process_weights_after_loading on
either BF16_API or TORCH_REF (mutually exclusive), and swap the MoE
function in apply() via local import rather than duplicating the call
with different args.

* test(sunrise): add verify_torch_ref_w4a16_moe.py

Element-wise smoke comparing torch_ref_cutlass_fused_moe against the
flashinfer cutlass_fused_moe(use_w4_group_scaling=True, swiglu_limit=...)
kernel on tiny random MXFP4 weights (shapes borrowed from flashinfer's
own W4A16_CORRECTNESS_CONFIGS). Before committing bench-scale wall-clock
to the torch-ref path, we want this to show that the two agree within
~1% at small shape.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added quant LLM Quantization deepseek labels Apr 25, 2026
@Fridge003 Fridge003 changed the title feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint (H2… feat(w4a16-deepseek): SM90 W4A16 MoE path for DSv4 FP4 checkpoint Apr 25, 2026
@seindum seindum mentioned this pull request Apr 25, 2026
34 tasks
@Fridge003 Fridge003 closed this May 12, 2026
@Fridge003 Fridge003 deleted the w4a16_v4 branch May 13, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant