Align `triton_kernels` with Triton 3.6.0 and fix SM120 MXFP4 MoE performance by mmangkad · Pull Request #24281 · sgl-project/sglang

mmangkad · 2026-05-02T23:04:12Z

Summary

Following the Torch 2.11 upgrade in [Dependency] Upgrade to Torch 2.11.0 #21247, align the bundled triton_kernels source with the Triton 3.6.0 version shipped by Torch.
Update SGLang's triton_kernels MoE integration for Triton 3.6.0 API changes:
- GatherIndx, RoutingData, and ScatterIndx moved from triton_kernels.routing to triton_kernels.matmul_ogs.
- triton_kernels.routing is no longer exposed, so SGLang now rebuilds routing from triton_kernels.topk and ragged tensor metadata.
- swiglu fused activation now passes reduction_n=2 through FnSpecs.
Tighten is_triton_kernels_available() so stale pre-3.6 installs are not treated as compatible.
Fix severe SM120 GPT-OSS MXFP4 decode slowdown after the Triton 3.6.0 update.

The SM120 performance issue comes from this triton_kernels heuristic change:

# Triton 3.5.1
return max(block_m * block_n // 4096, 4)

# Triton 3.6.0
return max(block_m * block_n // 4096, 4 if is_persistent else 1)

For small decode/ragged MoE batches on SM120 this can select num_warps=1, which makes throughput very slow. Without this patch, I was seeing SM120 decode around 35-36 token/s. This patch restores the old 4-warp floor only for SM120 non-persistent StridedLayout MXFP4 matmuls.

This also removes the explicit SM120 block_k=128 override added in #20040. That override was added because the older triton_kernels path could hit assert num_stages >= 1 during GPT-OSS startup on SM120. I think this was most likely being tripped during PCG warmup/capture. With the Triton 3.6.0 triton_kernels path, I can no longer reproduce that failure without the override, so this PR lets Triton choose its default block_k again, which is currently 256 for this path.

The SM120 num_warps patch is ugly, but I think this is the only practical way to restore SM120 performance from SGLang right now. triton_kernels does not expose num_warps as an opt-flag constraint, so this patch narrowly adjusts the heuristic only for the SM120 non-persistent StridedLayout MXFP4 path.

Accuracy Tests

H200 (MXFP4):

python -m gpt_oss.evals --model openai/gpt-oss-20b --eval gpqa --n-threads 256 --reasoning-effort low --base-url http://127.0.0.1:30000/v1

Writing report to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_212928.html
{'chars': np.float64(52.16729797979798), 'chars:std': np.float64(218.80938828184415), 'score': np.float64(0.5744949494949495), 'score:std': np.float64(0.494419358945162)}
Writing results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_212928.json
Writing all results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_212928_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260502_212928', 'metric': 0.5744949494949495}]

GB300 (MXFP4):

python -m gpt_oss.evals --model openai/gpt-oss-20b --eval gpqa --n-threads 512 --reasoning-effort low --base-url http://127.0.0.1:30000/v1

Writing report to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_214250.html
{'chars': np.float64(52.530934343434346), 'chars:std': np.float64(200.0052121862134), 'score': np.float64(0.5549242424242424), 'score:std': np.float64(0.4969741719587881)}
Writing results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_214250.json
Writing all results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_214250_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260502_214250', 'metric': 0.5549242424242424}]

GB300 (BF16):

python -m gpt_oss.evals --model lmsys/gpt-oss-20b-bf16 --eval gpqa --n-threads 512 --reasoning-effort low --base-url http://127.0.0.1:30000/v1

Writing report to /tmp/gpqa_lmsys__gpt-oss-20b-bf16-low_temp1.0_20260502_214425.html
{'chars': np.float64(50.92550505050505), 'chars:std': np.float64(214.67684371696978), 'score': np.float64(0.5568181818181818), 'score:std': np.float64(0.4967612044180544)}
Writing results to /tmp/gpqa_lmsys__gpt-oss-20b-bf16-low_temp1.0_20260502_214425.json
Writing all results to /tmp/gpqa_lmsys__gpt-oss-20b-bf16-low_temp1.0_20260502_214425_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'lmsys__gpt-oss-20b-bf16-low_temp1.0_20260502_214425', 'metric': 0.5568181818181818}]

RTX PRO 6000 (MXFP4):

python -m gpt_oss.evals --model openai/gpt-oss-20b --eval gpqa --n-threads 256 --reasoning-effort low --base-url http://127.0.0.1:30000/v1

Writing report to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_223913.html
{'chars': np.float64(52.41856060606061), 'chars:std': np.float64(206.45098944881033), 'score': np.float64(0.553030303030303), 'score:std': np.float64(0.4971798336221153)}
Writing results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_223913.json
Writing all results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_223913_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260502_223913', 'metric': 0.553030303030303}]

gemini-code-assist

Code Review

This pull request updates the Triton dependency to version 3.6.0 and adapts the MoE and quantization layers accordingly. Key changes include a local implementation of the routing function in topk.py, which introduces a regression by disabling simulated expert parallelism, and the addition of a monkey-patching utility in mxfp4.py to enforce a minimum warp count for specific matmuls on SM120 hardware. Additionally, the is_triton_kernels_available check was expanded to include ragged tensor metadata, and FusedActivation calls were updated to include explicit reduction parameters. One piece of feedback notes the loss of functionality for simulated expert parallelism.

mmangkad · 2026-05-02T23:10:44Z

/rerun-failed-ci

johnnynunez · 2026-05-04T08:18:23Z

could you align with new 3.7.0? It fix behavior agx thor and dgx spark
#24351

tbraun96 · 2026-05-10T22:41:25Z

Disclosure: Atlas maintainer. We carry a patch on roughly this same gap, dropping it here in case it shortens the review.

The Triton MXFP4 path on sm_121 (GB10) is unsalvageable for the same reason it's broken on RTX 50 series consumer Blackwell: the kernel emits .tile::scatter4 PTX which is SM10/Hopper-only. SGLang 3.6.0 alignment fixes the perf regression where it lands at all, but on sm_121 you'll still hit the codegen wall.

What worked for us was bypassing the PTX path entirely with a software E2M1 conversion. We patch FlashInfer's CUTLASS headers at container build time:

docker/gb10/fix_flashinfer_e2m1_sm121.py

The patch is roughly 30 lines, disables CUDA_PTX_FP4FP6_CVT_ENABLED for SM121 in float_subbyte.h, and adds a __float_as_uint based fallback. Applies cleanly to FlashInfer mainline at the time of writing. The bench numbers we got on Qwen3.6-35B-A3B-NVFP4 on a single Spark: 214.6 tok/s decode at c=1 with MTP K=2, with sparkrun-benchmark provenance bundle attached on this PR for verification:

Avarok-Cybersecurity/atlas-recipes#2

For the SGLang-internal route specifically, the equivalent NVFP4-via-Marlin-W4A16 fallback is probably the path of least resistance, since you avoid the FlashInfer dependency in the kernel build. Either way, happy to land the FlashInfer patch as a PR if it'd help reviewers compare approaches.

b8zhong · 2026-05-14T21:24:09Z

@mmangkad Can you please fix the conflicts

Fridge003 · 2026-05-15T00:21:09Z

To be included in #25312

mmangkad added 2 commits May 3, 2026 05:14

Update triton_kernels integration to 3.6.0

bfe7938

Tune SM120 MXFP4 triton_kernels flags

5c3cc35

mmangkad requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, b8zhong, ch-wan, ispobock, merrymercy and yizhang2077 as code owners May 2, 2026 23:04

github-actions Bot added the sgl-kernel label May 2, 2026

gemini-code-assist Bot reviewed May 2, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/moe/topk.py

github-actions Bot added the run-ci label May 2, 2026

Keep custom sgl-kernel wheel in CUDA CI

be133a2

This was referenced May 14, 2026

revert flashinfer 0.6.11 bumps #25310

Merged

Try quickly avoid Flashinfer upgrade revert #25312

Closed

Fridge003 closed this May 15, 2026

mmangkad deleted the update-triton-kernels-3.6.0 branch May 15, 2026 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align `triton_kernels` with Triton 3.6.0 and fix SM120 MXFP4 MoE performance#24281

Align `triton_kernels` with Triton 3.6.0 and fix SM120 MXFP4 MoE performance#24281
mmangkad wants to merge 3 commits into
sgl-project:mainfrom
mmangkad:update-triton-kernels-3.6.0

mmangkad commented May 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mmangkad commented May 2, 2026 •

edited

Loading

Uh oh!

johnnynunez commented May 4, 2026 •

edited

Loading

Uh oh!

tbraun96 commented May 10, 2026

Uh oh!

b8zhong commented May 14, 2026

Uh oh!

Fridge003 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mmangkad commented May 2, 2026

Summary

Accuracy Tests

H200 (MXFP4):

GB300 (MXFP4):

GB300 (BF16):

RTX PRO 6000 (MXFP4):

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mmangkad commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnynunez commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbraun96 commented May 10, 2026

Uh oh!

b8zhong commented May 14, 2026

Uh oh!

Fridge003 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mmangkad commented May 2, 2026 •

edited

Loading

johnnynunez commented May 4, 2026 •

edited

Loading