revert flashinfer 0.6.11 bumps by hnyls2002 · Pull Request #25310 · sgl-project/sglang

hnyls2002 · 2026-05-14T21:05:46Z

Reverts #24452 and #25129. The 0.6.11 bump causes a CUDA illegal-address crash in the triton_kernels mxfp4 MoE matmul (_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu) during piecewise CUDA graph capture for gpt-oss-120b on 4xH100. Same failure signature on the original PR CI and on main scheduled CI. Pin back to 0.6.8.post1 until upstream fixes the mxfp4 MoE path.

Failure links:

This reverts commit 51a9403.

This reverts commit d5f3254.

gemini-code-assist · 2026-05-14T21:05:50Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

hnyls2002 · 2026-05-14T21:06:50Z

/rerun-test test_gpt_oss_4gpu.py

github-actions · 2026-05-14T21:07:21Z

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/4-gpu-models/test_gpt_oss_4gpu.py

b8zhong · 2026-05-14T21:29:10Z

I'm testing if #24281 could fix it as well. Currently, the versions of triton and triton_kernels are not aligned

Earlier draft (b0591ab) blamed the flashinfer dependency bump d5f3254 (0.6.8.post1 -> 0.6.11). That was wrong, and PR sgl-project#25310 reverting the bump did NOT fix the Mistral NVFP4 test - confirmed by running the test locally at the post-revert SHA 0fde615 and getting the identical "RuntimeError: shape '[1]' is invalid for input of size 128" traceback under flashinfer 0.6.8.post1. Corrected root cause: 1d80a1a (PR sgl-project#23745, "Use Cute-DSL NVFP4 quantization kernels", merged 2026-05-10) wraps flashinfer.fp4_quantize in fp4_utils.py and unconditionally selects backend="cute-dsl" on SM100/B200. The cute-dsl kernel does `global_scale.float().reshape(1)`, which crashes when the MoE apply_weights call site passes the per-expert `layer.w13_input_scale_quant` (shape [num_experts] = 128). The same kernel exists in both flashinfer 0.6.8.post1 and 0.6.11+, so the flashinfer version is irrelevant to this failure path. Evidence: - CI metrics show Mistral NVFP4 was last green in run 607 (2026-05-11, SHA aa7a9af, flashinfer 0.6.8.post1) and first red in run 608 (2026-05-12, SHA 74d70af, still flashinfer 0.6.8.post1). 1d80a1a lands in that window. - Run 613 (2026-05-15, post-revert SHA 0fde615, flashinfer 0.6.8.post1) still fails on partition 3 - the Mistral NVFP4 row is absent from the per-partition metrics artifact for runs 608/609/610/613. - Local experiment matrix: A) 13afe8a (pre-1d80a1a) + flashinfer 0.6.8.post1: PASS gsm8k 0.951 B) 34c0029 (CI failure SHA) + cute-dsl backend + flashinfer 0.6.11.post1: FAIL reshape '[1]' C) 34c0029 + cuda backend patch + flashinfer 0.6.11.post1: FAIL globalScale strict check D) 0fde615 (post-revert) + cute-dsl backend + flashinfer 0.6.8.post1: FAIL reshape '[1]' E) 0fde615 + cuda backend one-line patch + flashinfer 0.6.8.post1: PASS gsm8k 0.949 One-line fix on post-revert main: sed -i 's|"cute-dsl" if is_sm100_supported() else "cuda"|"cuda"|' \ python/sglang/srt/layers/quantization/fp4_utils.py Verified passes gsm8k 0.949 in 23m 43s. The proper long-term fix is to collapse layer.w13_input_scale_quant to scalar or per-token before calling fp4_quantize in compressed_tensors_w4a4_nvfp4_moe.py:315. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hnyls2002 added 2 commits May 14, 2026 14:05

Revert "Update flashinfer to 0.6.11.post1 (#25129)"

c873672

This reverts commit 51a9403.

Revert "[Dependency] Flashinfer 0.6.8post1 -> 0.6.11 (#24452)"

3b19404

This reverts commit d5f3254.

hnyls2002 requested review from AniZpZ, BBuf, CatherineSue, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, JustinTong0323, Ying1123, b8zhong, ch-wan, ishandhanani, ispobock, merrymercy, slin1237 and yctseng0211 as code owners May 14, 2026 21:05

github-actions Bot added the dependencies Pull requests that update a dependency file label May 14, 2026

b8zhong mentioned this pull request May 14, 2026

Try quickly avoid Flashinfer upgrade revert #25312

Closed

hnyls2002 merged commit 22dfcda into main May 14, 2026
72 of 82 checks passed

hnyls2002 deleted the lsyin/revert-flashinfer-0611 branch May 14, 2026 22:29

Fridge003 pushed a commit that referenced this pull request May 14, 2026

revert flashinfer 0.6.11 bumps (#25310)

913df7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revert flashinfer 0.6.11 bumps#25310

revert flashinfer 0.6.11 bumps#25310
hnyls2002 merged 2 commits into
mainfrom
lsyin/revert-flashinfer-0611

hnyls2002 commented May 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 14, 2026

Uh oh!

hnyls2002 commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

b8zhong commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hnyls2002 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented May 14, 2026

Uh oh!

hnyls2002 commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b8zhong commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hnyls2002 commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

b8zhong commented May 14, 2026 •

edited

Loading