[Fix] Fix gpt oss triton kernels and upgrade flashinfer back to 0.6.11.post1#25335
Merged
Conversation
This commit updates the sglang-kernel version across SGLang files to match
the version defined in sgl-kernel/pyproject.toml.
Files updated:
- docker/Dockerfile
- python/pyproject.toml
- python/sglang/srt/entrypoints/engine.py
🤖 Generated with GitHub Actions
This reverts commit 22dfcda.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
b8zhong
approved these changes
May 15, 2026
Collaborator
Author
|
/rerun-test test_gpt_oss_4gpu.py |
Contributor
|
🚀 |
Contributor
|
@Fridge003 what about https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.6.11.post2 with some more bugfixes on FI side? |
Collaborator
Author
|
@mmangkad we can open another PR for that |
This reverts commit 1913cb4.
Collaborator
Author
|
/rerun-test test_gpt_oss_4gpu.py |
Contributor
|
🚀 |
Fridge003
added a commit
that referenced
this pull request
May 15, 2026
…1.post1 (#25335) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@users.noreply.github.com> Co-authored-by: mmangkad <mmangkad@users.noreply.github.com>
Jiminator
added a commit
to Jiminator/sglang
that referenced
this pull request
May 15, 2026
…2c1034 Two findings appended to the bisect report: 1. PR sgl-project#25335 ("Fix gpt oss triton kernels and upgrade flashinfer back to 0.6.11.post1") re-bumped flashinfer past PR sgl-project#25310's revert. The one-line fix in fp4_utils.py:22 (cute-dsl -> cuda) is therefore no longer sufficient on latest main: experiment G reproduces the strict cuda-side check from fp4Quantize.cpp:64 ("globalScale should have shape [1] or [num_tokens]"), identical to experiment C. The proper fix is now at the call site in compressed_tensors_w4a4_nvfp4_moe.py:315: collapse layer.w13_input_scale_quant (shape [num_experts]) to scalar [1] or per-token [num_tokens] before passing as global_scale. 2. The TP8+MTP variant has its own separate pre-existing regression, bisected to d2c1034 ("[Gemma 4] Adding MTP support", PR sgl-project#24436). That PR added _resolve_speculative_algorithm_alias in server_args.py:318-342 which unconditionally calls AutoConfig.from_pretrained on the draft path to detect Gemma4 drafts. It crashes on any draft in Mistral native format (params.json, no HF config.json), even when --speculative-algorithm is already explicit EAGLE. Empirical proof for (2): - d2c1034 + TP8+MTP-only test: FAIL with "Unrecognized model in ...Eagle. Should have a model_type key in its config.json", total wall time 60.7s (crashes before model load). - f1395af (parent of d2c1034) + same test: PASS, gsm8k 0.949. Both with flashinfer 0.6.8.post1, sglang-kernel 0.4.2.post1+cu130, torch 2.11.0+cu130, SGLANG_IS_IN_CI=true, SGLANG_ENABLE_JIT_DEEPGEMM=0, SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1. Minimal fix for (2): wrap the AutoConfig.from_pretrained call in _resolve_speculative_algorithm_alias with try/except, or short-circuit when speculative_algorithm is already explicit and the user did not request NEXTN aliasing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
co-author: @b8zhong @mmangkad
Modified upon #25312
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci