Skip to content

revert flashinfer 0.6.11 bumps#25310

Merged
hnyls2002 merged 2 commits into
mainfrom
lsyin/revert-flashinfer-0611
May 14, 2026
Merged

revert flashinfer 0.6.11 bumps#25310
hnyls2002 merged 2 commits into
mainfrom
lsyin/revert-flashinfer-0611

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented May 14, 2026

Reverts #24452 and #25129. The 0.6.11 bump causes a CUDA illegal-address crash in the triton_kernels mxfp4 MoE matmul (_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu) during piecewise CUDA graph capture for gpt-oss-120b on 4xH100. Same failure signature on the original PR CI and on main scheduled CI. Pin back to 0.6.8.post1 until upstream fixes the mxfp4 MoE path.

Failure links:

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 14, 2026
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_gpt_oss_4gpu.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/4-gpu-models/test_gpt_oss_4gpu.py

@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented May 14, 2026

I'm testing if #24281 could fix it as well. Currently, the versions of triton and triton_kernels are not aligned

@hnyls2002 hnyls2002 merged commit 22dfcda into main May 14, 2026
72 of 82 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/revert-flashinfer-0611 branch May 14, 2026 22:29
Fridge003 pushed a commit that referenced this pull request May 14, 2026
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 15, 2026
Earlier draft (b0591ab) blamed the flashinfer dependency bump d5f3254
(0.6.8.post1 -> 0.6.11). That was wrong, and PR sgl-project#25310 reverting the
bump did NOT fix the Mistral NVFP4 test - confirmed by running the
test locally at the post-revert SHA 0fde615 and getting the identical
"RuntimeError: shape '[1]' is invalid for input of size 128" traceback
under flashinfer 0.6.8.post1.

Corrected root cause: 1d80a1a (PR sgl-project#23745, "Use Cute-DSL NVFP4
quantization kernels", merged 2026-05-10) wraps flashinfer.fp4_quantize
in fp4_utils.py and unconditionally selects backend="cute-dsl" on
SM100/B200. The cute-dsl kernel does
`global_scale.float().reshape(1)`, which crashes when the MoE
apply_weights call site passes the per-expert
`layer.w13_input_scale_quant` (shape [num_experts] = 128). The same
kernel exists in both flashinfer 0.6.8.post1 and 0.6.11+, so the
flashinfer version is irrelevant to this failure path.

Evidence:
- CI metrics show Mistral NVFP4 was last green in run 607 (2026-05-11,
  SHA aa7a9af, flashinfer 0.6.8.post1) and first red in run 608
  (2026-05-12, SHA 74d70af, still flashinfer 0.6.8.post1). 1d80a1a
  lands in that window.
- Run 613 (2026-05-15, post-revert SHA 0fde615, flashinfer 0.6.8.post1)
  still fails on partition 3 - the Mistral NVFP4 row is absent from the
  per-partition metrics artifact for runs 608/609/610/613.
- Local experiment matrix:
  A) 13afe8a (pre-1d80a1a) + flashinfer 0.6.8.post1: PASS gsm8k 0.951
  B) 34c0029 (CI failure SHA) + cute-dsl backend + flashinfer 0.6.11.post1: FAIL reshape '[1]'
  C) 34c0029 + cuda backend patch + flashinfer 0.6.11.post1: FAIL globalScale strict check
  D) 0fde615 (post-revert) + cute-dsl backend + flashinfer 0.6.8.post1: FAIL reshape '[1]'
  E) 0fde615 + cuda backend one-line patch + flashinfer 0.6.8.post1: PASS gsm8k 0.949

One-line fix on post-revert main:
  sed -i 's|"cute-dsl" if is_sm100_supported() else "cuda"|"cuda"|' \
    python/sglang/srt/layers/quantization/fp4_utils.py

Verified passes gsm8k 0.949 in 23m 43s. The proper long-term fix is to
collapse layer.w13_input_scale_quant to scalar or per-token before
calling fp4_quantize in compressed_tensors_w4a4_nvfp4_moe.py:315.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 15, 2026
Earlier draft (b0591ab) blamed the flashinfer dependency bump d5f3254
(0.6.8.post1 -> 0.6.11). That was wrong, and PR sgl-project#25310 reverting the
bump did NOT fix the Mistral NVFP4 test - confirmed by running the
test locally at the post-revert SHA 0fde615 and getting the identical
"RuntimeError: shape '[1]' is invalid for input of size 128" traceback
under flashinfer 0.6.8.post1.

Corrected root cause: 1d80a1a (PR sgl-project#23745, "Use Cute-DSL NVFP4
quantization kernels", merged 2026-05-10) wraps flashinfer.fp4_quantize
in fp4_utils.py and unconditionally selects backend="cute-dsl" on
SM100/B200. The cute-dsl kernel does
`global_scale.float().reshape(1)`, which crashes when the MoE
apply_weights call site passes the per-expert
`layer.w13_input_scale_quant` (shape [num_experts] = 128). The same
kernel exists in both flashinfer 0.6.8.post1 and 0.6.11+, so the
flashinfer version is irrelevant to this failure path.

Evidence:
- CI metrics show Mistral NVFP4 was last green in run 607 (2026-05-11,
  SHA aa7a9af, flashinfer 0.6.8.post1) and first red in run 608
  (2026-05-12, SHA 74d70af, still flashinfer 0.6.8.post1). 1d80a1a
  lands in that window.
- Run 613 (2026-05-15, post-revert SHA 0fde615, flashinfer 0.6.8.post1)
  still fails on partition 3 - the Mistral NVFP4 row is absent from the
  per-partition metrics artifact for runs 608/609/610/613.
- Local experiment matrix:
  A) 13afe8a (pre-1d80a1a) + flashinfer 0.6.8.post1: PASS gsm8k 0.951
  B) 34c0029 (CI failure SHA) + cute-dsl backend + flashinfer 0.6.11.post1: FAIL reshape '[1]'
  C) 34c0029 + cuda backend patch + flashinfer 0.6.11.post1: FAIL globalScale strict check
  D) 0fde615 (post-revert) + cute-dsl backend + flashinfer 0.6.8.post1: FAIL reshape '[1]'
  E) 0fde615 + cuda backend one-line patch + flashinfer 0.6.8.post1: PASS gsm8k 0.949

One-line fix on post-revert main:
  sed -i 's|"cute-dsl" if is_sm100_supported() else "cuda"|"cuda"|' \
    python/sglang/srt/layers/quantization/fp4_utils.py

Verified passes gsm8k 0.949 in 23m 43s. The proper long-term fix is to
collapse layer.w13_input_scale_quant to scalar or per-token before
calling fp4_quantize in compressed_tensors_w4a4_nvfp4_moe.py:315.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants