Skip to content

[Fix] Fix gpt oss triton kernels and upgrade flashinfer back to 0.6.11.post1#25335

Merged
Fridge003 merged 5 commits into
mainfrom
fix_flashinfer
May 15, 2026
Merged

[Fix] Fix gpt oss triton kernels and upgrade flashinfer back to 0.6.11.post1#25335
Fridge003 merged 5 commits into
mainfrom
fix_flashinfer

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 commented May 15, 2026

Motivation

co-author: @b8zhong @mmangkad
Modified upon #25312

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

sglang-bot and others added 3 commits May 15, 2026 02:10
This commit updates the sglang-kernel version across SGLang files to match
the version defined in sgl-kernel/pyproject.toml.

Files updated:
          - docker/Dockerfile
          - python/pyproject.toml
          - python/sglang/srt/entrypoints/engine.py

🤖 Generated with GitHub Actions
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 15, 2026
@Fridge003 Fridge003 changed the title [Fix] Fix gpt oss triton kernels and upgrade flashinfer back to 0.6.11.post2 [Fix] Fix gpt oss triton kernels and upgrade flashinfer back to 0.6.11.post1 May 15, 2026
@Fridge003
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_gpt_oss_4gpu.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/4-gpu-models/test_gpt_oss_4gpu.py

@mmangkad
Copy link
Copy Markdown
Contributor

@Fridge003
Copy link
Copy Markdown
Collaborator Author

@mmangkad we can open another PR for that

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_gpt_oss_4gpu.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/4-gpu-models/test_gpt_oss_4gpu.py

@Fridge003 Fridge003 merged commit 0c19540 into main May 15, 2026
123 of 139 checks passed
@Fridge003 Fridge003 deleted the fix_flashinfer branch May 15, 2026 08:04
Fridge003 added a commit that referenced this pull request May 15, 2026
…1.post1 (#25335)

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
Co-authored-by: mmangkad <mmangkad@users.noreply.github.com>
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 15, 2026
…2c1034

Two findings appended to the bisect report:

1. PR sgl-project#25335 ("Fix gpt oss triton kernels and upgrade flashinfer back
   to 0.6.11.post1") re-bumped flashinfer past PR sgl-project#25310's revert.
   The one-line fix in fp4_utils.py:22 (cute-dsl -> cuda) is therefore
   no longer sufficient on latest main: experiment G reproduces the
   strict cuda-side check from fp4Quantize.cpp:64 ("globalScale should
   have shape [1] or [num_tokens]"), identical to experiment C.
   The proper fix is now at the call site in
   compressed_tensors_w4a4_nvfp4_moe.py:315: collapse
   layer.w13_input_scale_quant (shape [num_experts]) to scalar [1] or
   per-token [num_tokens] before passing as global_scale.

2. The TP8+MTP variant has its own separate pre-existing regression,
   bisected to d2c1034 ("[Gemma 4] Adding MTP support", PR sgl-project#24436).
   That PR added _resolve_speculative_algorithm_alias in
   server_args.py:318-342 which unconditionally calls
   AutoConfig.from_pretrained on the draft path to detect Gemma4
   drafts. It crashes on any draft in Mistral native format (params.json,
   no HF config.json), even when --speculative-algorithm is already
   explicit EAGLE.

Empirical proof for (2):
- d2c1034 + TP8+MTP-only test: FAIL with
  "Unrecognized model in ...Eagle. Should have a model_type key in
  its config.json", total wall time 60.7s (crashes before model load).
- f1395af (parent of d2c1034) + same test: PASS, gsm8k 0.949.

Both with flashinfer 0.6.8.post1, sglang-kernel 0.4.2.post1+cu130,
torch 2.11.0+cu130, SGLANG_IS_IN_CI=true, SGLANG_ENABLE_JIT_DEEPGEMM=0,
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1.

Minimal fix for (2): wrap the AutoConfig.from_pretrained call in
_resolve_speculative_algorithm_alias with try/except, or
short-circuit when speculative_algorithm is already explicit and
the user did not request NEXTN aliasing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants