revert flashinfer 0.6.11 bumps#25310
Merged
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/rerun-test test_gpt_oss_4gpu.py |
Contributor
|
🚀 |
Collaborator
|
I'm testing if #24281 could fix it as well. Currently, the versions of triton and triton_kernels are not aligned |
Fridge003
pushed a commit
that referenced
this pull request
May 14, 2026
Jiminator
added a commit
to Jiminator/sglang
that referenced
this pull request
May 15, 2026
Earlier draft (b0591ab) blamed the flashinfer dependency bump d5f3254 (0.6.8.post1 -> 0.6.11). That was wrong, and PR sgl-project#25310 reverting the bump did NOT fix the Mistral NVFP4 test - confirmed by running the test locally at the post-revert SHA 0fde615 and getting the identical "RuntimeError: shape '[1]' is invalid for input of size 128" traceback under flashinfer 0.6.8.post1. Corrected root cause: 1d80a1a (PR sgl-project#23745, "Use Cute-DSL NVFP4 quantization kernels", merged 2026-05-10) wraps flashinfer.fp4_quantize in fp4_utils.py and unconditionally selects backend="cute-dsl" on SM100/B200. The cute-dsl kernel does `global_scale.float().reshape(1)`, which crashes when the MoE apply_weights call site passes the per-expert `layer.w13_input_scale_quant` (shape [num_experts] = 128). The same kernel exists in both flashinfer 0.6.8.post1 and 0.6.11+, so the flashinfer version is irrelevant to this failure path. Evidence: - CI metrics show Mistral NVFP4 was last green in run 607 (2026-05-11, SHA aa7a9af, flashinfer 0.6.8.post1) and first red in run 608 (2026-05-12, SHA 74d70af, still flashinfer 0.6.8.post1). 1d80a1a lands in that window. - Run 613 (2026-05-15, post-revert SHA 0fde615, flashinfer 0.6.8.post1) still fails on partition 3 - the Mistral NVFP4 row is absent from the per-partition metrics artifact for runs 608/609/610/613. - Local experiment matrix: A) 13afe8a (pre-1d80a1a) + flashinfer 0.6.8.post1: PASS gsm8k 0.951 B) 34c0029 (CI failure SHA) + cute-dsl backend + flashinfer 0.6.11.post1: FAIL reshape '[1]' C) 34c0029 + cuda backend patch + flashinfer 0.6.11.post1: FAIL globalScale strict check D) 0fde615 (post-revert) + cute-dsl backend + flashinfer 0.6.8.post1: FAIL reshape '[1]' E) 0fde615 + cuda backend one-line patch + flashinfer 0.6.8.post1: PASS gsm8k 0.949 One-line fix on post-revert main: sed -i 's|"cute-dsl" if is_sm100_supported() else "cuda"|"cuda"|' \ python/sglang/srt/layers/quantization/fp4_utils.py Verified passes gsm8k 0.949 in 23m 43s. The proper long-term fix is to collapse layer.w13_input_scale_quant to scalar or per-token before calling fp4_quantize in compressed_tensors_w4a4_nvfp4_moe.py:315. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jiminator
added a commit
to Jiminator/sglang
that referenced
this pull request
May 15, 2026
Earlier draft (b0591ab) blamed the flashinfer dependency bump d5f3254 (0.6.8.post1 -> 0.6.11). That was wrong, and PR sgl-project#25310 reverting the bump did NOT fix the Mistral NVFP4 test - confirmed by running the test locally at the post-revert SHA 0fde615 and getting the identical "RuntimeError: shape '[1]' is invalid for input of size 128" traceback under flashinfer 0.6.8.post1. Corrected root cause: 1d80a1a (PR sgl-project#23745, "Use Cute-DSL NVFP4 quantization kernels", merged 2026-05-10) wraps flashinfer.fp4_quantize in fp4_utils.py and unconditionally selects backend="cute-dsl" on SM100/B200. The cute-dsl kernel does `global_scale.float().reshape(1)`, which crashes when the MoE apply_weights call site passes the per-expert `layer.w13_input_scale_quant` (shape [num_experts] = 128). The same kernel exists in both flashinfer 0.6.8.post1 and 0.6.11+, so the flashinfer version is irrelevant to this failure path. Evidence: - CI metrics show Mistral NVFP4 was last green in run 607 (2026-05-11, SHA aa7a9af, flashinfer 0.6.8.post1) and first red in run 608 (2026-05-12, SHA 74d70af, still flashinfer 0.6.8.post1). 1d80a1a lands in that window. - Run 613 (2026-05-15, post-revert SHA 0fde615, flashinfer 0.6.8.post1) still fails on partition 3 - the Mistral NVFP4 row is absent from the per-partition metrics artifact for runs 608/609/610/613. - Local experiment matrix: A) 13afe8a (pre-1d80a1a) + flashinfer 0.6.8.post1: PASS gsm8k 0.951 B) 34c0029 (CI failure SHA) + cute-dsl backend + flashinfer 0.6.11.post1: FAIL reshape '[1]' C) 34c0029 + cuda backend patch + flashinfer 0.6.11.post1: FAIL globalScale strict check D) 0fde615 (post-revert) + cute-dsl backend + flashinfer 0.6.8.post1: FAIL reshape '[1]' E) 0fde615 + cuda backend one-line patch + flashinfer 0.6.8.post1: PASS gsm8k 0.949 One-line fix on post-revert main: sed -i 's|"cute-dsl" if is_sm100_supported() else "cuda"|"cuda"|' \ python/sglang/srt/layers/quantization/fp4_utils.py Verified passes gsm8k 0.949 in 23m 43s. The proper long-term fix is to collapse layer.w13_input_scale_quant to scalar or per-token before calling fp4_quantize in compressed_tensors_w4a4_nvfp4_moe.py:315. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reverts #24452 and #25129. The 0.6.11 bump causes a CUDA illegal-address crash in the triton_kernels mxfp4 MoE matmul (
_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu) during piecewise CUDA graph capture for gpt-oss-120b on 4xH100. Same failure signature on the original PR CI and on main scheduled CI. Pin back to 0.6.8.post1 until upstream fixes the mxfp4 MoE path.Failure links: