Skip to content

Align triton_kernels with Triton 3.6.0 and fix SM120 MXFP4 MoE performance#24281

Closed
mmangkad wants to merge 3 commits into
sgl-project:mainfrom
mmangkad:update-triton-kernels-3.6.0
Closed

Align triton_kernels with Triton 3.6.0 and fix SM120 MXFP4 MoE performance#24281
mmangkad wants to merge 3 commits into
sgl-project:mainfrom
mmangkad:update-triton-kernels-3.6.0

Conversation

@mmangkad
Copy link
Copy Markdown
Contributor

@mmangkad mmangkad commented May 2, 2026

Summary

  • Following the Torch 2.11 upgrade in [Dependency] Upgrade to Torch 2.11.0 #21247, align the bundled triton_kernels source with the Triton 3.6.0 version shipped by Torch.
  • Update SGLang's triton_kernels MoE integration for Triton 3.6.0 API changes:
    • GatherIndx, RoutingData, and ScatterIndx moved from triton_kernels.routing to triton_kernels.matmul_ogs.
    • triton_kernels.routing is no longer exposed, so SGLang now rebuilds routing from triton_kernels.topk and ragged tensor metadata.
    • swiglu fused activation now passes reduction_n=2 through FnSpecs.
  • Tighten is_triton_kernels_available() so stale pre-3.6 installs are not treated as compatible.
  • Fix severe SM120 GPT-OSS MXFP4 decode slowdown after the Triton 3.6.0 update.

The SM120 performance issue comes from this triton_kernels heuristic change:

# Triton 3.5.1
return max(block_m * block_n // 4096, 4)

# Triton 3.6.0
return max(block_m * block_n // 4096, 4 if is_persistent else 1)

For small decode/ragged MoE batches on SM120 this can select num_warps=1, which makes throughput very slow. Without this patch, I was seeing SM120 decode around 35-36 token/s. This patch restores the old 4-warp floor only for SM120 non-persistent StridedLayout MXFP4 matmuls.

This also removes the explicit SM120 block_k=128 override added in #20040. That override was added because the older triton_kernels path could hit assert num_stages >= 1 during GPT-OSS startup on SM120. I think this was most likely being tripped during PCG warmup/capture. With the Triton 3.6.0 triton_kernels path, I can no longer reproduce that failure without the override, so this PR lets Triton choose its default block_k again, which is currently 256 for this path.

The SM120 num_warps patch is ugly, but I think this is the only practical way to restore SM120 performance from SGLang right now. triton_kernels does not expose num_warps as an opt-flag constraint, so this patch narrowly adjusts the heuristic only for the SM120 non-persistent StridedLayout MXFP4 path.

Accuracy Tests

H200 (MXFP4):

python -m gpt_oss.evals --model openai/gpt-oss-20b --eval gpqa --n-threads 256 --reasoning-effort low --base-url http://127.0.0.1:30000/v1

Writing report to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_212928.html
{'chars': np.float64(52.16729797979798), 'chars:std': np.float64(218.80938828184415), 'score': np.float64(0.5744949494949495), 'score:std': np.float64(0.494419358945162)}
Writing results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_212928.json
Writing all results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_212928_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260502_212928', 'metric': 0.5744949494949495}]

GB300 (MXFP4):

python -m gpt_oss.evals --model openai/gpt-oss-20b --eval gpqa --n-threads 512 --reasoning-effort low --base-url http://127.0.0.1:30000/v1

Writing report to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_214250.html
{'chars': np.float64(52.530934343434346), 'chars:std': np.float64(200.0052121862134), 'score': np.float64(0.5549242424242424), 'score:std': np.float64(0.4969741719587881)}
Writing results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_214250.json
Writing all results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_214250_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260502_214250', 'metric': 0.5549242424242424}]

GB300 (BF16):

python -m gpt_oss.evals --model lmsys/gpt-oss-20b-bf16 --eval gpqa --n-threads 512 --reasoning-effort low --base-url http://127.0.0.1:30000/v1

Writing report to /tmp/gpqa_lmsys__gpt-oss-20b-bf16-low_temp1.0_20260502_214425.html
{'chars': np.float64(50.92550505050505), 'chars:std': np.float64(214.67684371696978), 'score': np.float64(0.5568181818181818), 'score:std': np.float64(0.4967612044180544)}
Writing results to /tmp/gpqa_lmsys__gpt-oss-20b-bf16-low_temp1.0_20260502_214425.json
Writing all results to /tmp/gpqa_lmsys__gpt-oss-20b-bf16-low_temp1.0_20260502_214425_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'lmsys__gpt-oss-20b-bf16-low_temp1.0_20260502_214425', 'metric': 0.5568181818181818}]

RTX PRO 6000 (MXFP4):

python -m gpt_oss.evals --model openai/gpt-oss-20b --eval gpqa --n-threads 256 --reasoning-effort low --base-url http://127.0.0.1:30000/v1

Writing report to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_223913.html
{'chars': np.float64(52.41856060606061), 'chars:std': np.float64(206.45098944881033), 'score': np.float64(0.553030303030303), 'score:std': np.float64(0.4971798336221153)}
Writing results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_223913.json
Writing all results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20260502_223913_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260502_223913', 'metric': 0.553030303030303}]

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Triton dependency to version 3.6.0 and adapts the MoE and quantization layers accordingly. Key changes include a local implementation of the routing function in topk.py, which introduces a regression by disabling simulated expert parallelism, and the addition of a monkey-patching utility in mxfp4.py to enforce a minimum warp count for specific matmuls on SM120 hardware. Additionally, the is_triton_kernels_available check was expanded to include ragged tensor metadata, and FusedActivation calls were updated to include explicit reduction parameters. One piece of feedback notes the loss of functionality for simulated expert parallelism.

Comment thread python/sglang/srt/layers/moe/topk.py
@mmangkad
Copy link
Copy Markdown
Contributor Author

mmangkad commented May 2, 2026

/rerun-failed-ci

@github-actions github-actions Bot added the run-ci label May 2, 2026
@johnnynunez
Copy link
Copy Markdown
Contributor

johnnynunez commented May 4, 2026

could you align with new 3.7.0? It fix behavior agx thor and dgx spark
#24351

@tbraun96
Copy link
Copy Markdown

Disclosure: Atlas maintainer. We carry a patch on roughly this same gap, dropping it here in case it shortens the review.

The Triton MXFP4 path on sm_121 (GB10) is unsalvageable for the same reason it's broken on RTX 50 series consumer Blackwell: the kernel emits .tile::scatter4 PTX which is SM10/Hopper-only. SGLang 3.6.0 alignment fixes the perf regression where it lands at all, but on sm_121 you'll still hit the codegen wall.

What worked for us was bypassing the PTX path entirely with a software E2M1 conversion. We patch FlashInfer's CUTLASS headers at container build time:

docker/gb10/fix_flashinfer_e2m1_sm121.py

The patch is roughly 30 lines, disables CUDA_PTX_FP4FP6_CVT_ENABLED for SM121 in float_subbyte.h, and adds a __float_as_uint based fallback. Applies cleanly to FlashInfer mainline at the time of writing. The bench numbers we got on Qwen3.6-35B-A3B-NVFP4 on a single Spark: 214.6 tok/s decode at c=1 with MTP K=2, with sparkrun-benchmark provenance bundle attached on this PR for verification:

Avarok-Cybersecurity/atlas-recipes#2

For the SGLang-internal route specifically, the equivalent NVFP4-via-Marlin-W4A16 fallback is probably the path of least resistance, since you avoid the FlashInfer dependency in the kernel build. Either way, happy to land the FlashInfer patch as a PR if it'd help reviewers compare approaches.

@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented May 14, 2026

@mmangkad Can you please fix the conflicts

@Fridge003
Copy link
Copy Markdown
Collaborator

To be included in #25312

@Fridge003 Fridge003 closed this May 15, 2026
@mmangkad mmangkad deleted the update-triton-kernels-3.6.0 branch May 15, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants