Optimize Gemma4 H200 MoE and extend attention by BBuf · Pull Request #26588 · sgl-project/sglang

BBuf · 2026-05-28T14:20:34Z

Summary

Add H200 Triton fused MoE configs for Gemma4 E=128,N=704 normal and down projections.
Tune Hopper extend-attention block sizes for Lq=129..256 to reduce TTFT/TPOT on Gemma4 prefill-heavy serving.
Add a small-batch (M ≤ 256) per-head Triton kernel for Gemma4 QKV RMSNorm to lower MTP draft latency.
~~Add a Gemma4-specific Triton routing kernel that fuses top-k selection, top-k softmax, and per-expert scale into gemma4_topk_softmax_scale.~~ Reverted (commit 27cb94c45) due to BF16 precision regression on MoE routing — see "Reverted optimizations" below.
~~Replace Gemma4Router.norm(x) * fused_scale with a single fused rmsnorm(x, fused_scale, eps) call on the CUDA fast path.~~ Reverted (commit 0d3c9c694) for the same reason.
Keep the optimization scoped to Gemma4/Hopper paths so existing model families keep their current behavior.

Performance

Re-measured on ion8-h200, sglang_bbuf, single NVIDIA H200, TP=1, google/gemma-4-26B-A4B-it, BF16, --attention-backend triton, random 1024/256, request rate 8, max concurrency 64, 80 prompts. Baseline = latest origin/main (376635c1e); patched = PR head (27cb94c45, with c2/c3 reverted).

Workload	Build	Output tok/s	p50 TTFT ms	p50 TPOT ms	Median E2E ms	Result
Random `1024/256`, 80 prompts	SGLang baseline (`376635c1e`)	1670.72	96.56	20.30	5295.91	Baseline
Random `1024/256`, 80 prompts	SGLang patched (`27cb94c45`)	1725.50	74.04	16.62	4338.60	+3.28% output tok/s, -23.3% TTFT, -18.2% TPOT, -18.1% E2E

Full benchmark details:

Build	Completed	Duration s	Request throughput req/s	Total tok/s	Median E2E ms
SGLang baseline (`376635c1e`)	80	12.26	6.526	8353.59	5295.91
SGLang patched (`27cb94c45`)	80	11.87	6.740	8627.50	4338.60

The remaining speedup comes from the H200 MoE Triton configs + Hopper extend-attention block-size tuning + small-batch QKV RMSNorm; the heavier routing-fusion optimizations were dropped to keep accuracy on the MTP path.

Accuracy

Validated on ion8-h200 via the registered Gemma4 MTP CI test (test_gemma4_mtp_26b_a4b_extra::test_gsm8k_mtp, TP=2, --enable-deterministic-inference, NEXTN spec decode, 200 GSM8K examples, 5-shot). With the two reverted commits, GSM8K MTP score is back to baseline:

Build (TP=2, deterministic)	GSM8K MTP topk=1	avg accept length	Threshold 0.41
SGLang baseline (`376635c1e`, all reverts = main)	0.445	4.494	✓ pass
SGLang patched (`27cb94c45`, c1+c4 kept, c2+c3 reverted)	0.445	4.494	✓ pass
~~PR head before revert (`10ab189e3`, all 4 commits)~~	~~0.360~~	~~4.475~~	✗ fail (-5pp)

Standalone GSM8K (TP=1, no MTP, 8-shot, 200 questions, no deterministic mode) shows the typical run-to-run variance (~5pp swing) but no systematic regression — patched and baseline land within noise of each other (patched: 0.41 / 0.495 across two runs; baseline: 0.465 / 0.445).

Reverted optimizations

Bisecting test_gemma4_mtp_26b_a4b_extra on ion8-h200 (TP=2, deterministic):

Configuration	GSM8K topk=1	Δ vs PR head
PR head (all 4 commits)	0.360	0
Revert only `fabb5b5ee` (QKV RMSNorm small batch)	0.360	0 — innocuous
Revert only `d72d246a3` (Router RMSNorm fuse)	0.395	+0.035
Revert only `03826cdd9` (MoE topk fuse)	0.380	+0.020
Revert only `3398cb7af` (extend-attn + MoE configs)	0.360	0 — innocuous
Revert `d72d246a3` + `03826cdd9` together	0.445	+0.085 (pass)
Revert all 4 (= origin/main)	0.445	identical to c2+c3 revert

Both reverted commits modify the Gemma4 MoE routing path in gemma4_causal.py:

d72d246a3: rmsnorm(x, fused_scale, eps) fast path in Gemma4Router.forward. The math is equivalent to (self.norm(x) with weight=1) * fused_scale, but BF16 accumulation order in the fused kernel diverges from the two-step path when fused_scale ≈ hidden_size**-0.5 ≈ 0.022.
03826cdd9: _gemma4_topk_softmax_scale_kernel doing topk + stable-softmax + per-expert scale in one pass. Equivalent to softmax(topk_logits) * scale[topk_ids] for topk ≤ 8, but its FP32-internal exp(top_logit - top1) / sum_top_exp ordering doesn't match torch.nn.functional.softmax bit-for-bit.

Either alone causes a ~0.02-0.035 GSM8K drop; together the routing logits drift enough to flip enough expert selections that the MTP draft/target distribution gap widens (MTP avg_accept_length still looks healthy at 4.48, masking the underlying quality regression).

If you want to recover these speedups, the kernels need to be made bit-equivalent to the eager paths — e.g. accept weight=ones in the fused router RMSNorm and apply fused_scale as a separate elementwise op, and have _gemma4_topk_softmax_scale_kernel mirror PyTorch's exact softmax floating-point order (compute max, subtract, exp, sum, divide — all in fp32 with the same reduction tree).

Validation

Bisect, perf re-measurement, and Gemma4 MTP accuracy validation all completed on ion8-h200, container sglang_bbuf.
Python syntax check passed for the changed Python files.
New H200 Triton config JSON files (E=128,N=704) parse successfully.
Extend-attention block-size sanity checked for Lq=128,192,256,288.
Server benchmark + Gemma4 MTP CI test (TP=2, deterministic, 200-question GSM8K) confirm patched ≡ baseline accuracy (both 0.445).

CI States

Latest PR Test (Base): ❌ Run #26700275243
Latest PR Test (Extra): ✅ Run #26700275191

gemini-code-assist · 2026-05-28T14:20:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

pyc96 · 2026-05-28T21:44:55Z

@BBuf Could you please take a look at #26502 for fused router and #25461 for norm ? We can potentially merge the efforts

Just quickly checked myself, #26502 seems better due to less top-k reduce complexity. #25461 seems complementary to your changes

BBuf · 2026-05-30T01:59:59Z

@BBuf Could you please take a look at #26502 for fused router and #25461 for norm ? We can potentially merge the efforts

Just quickly checked myself, #26502 seems better due to less top-k reduce complexity. #25461 seems complementary to your changes

Ok, I'll review it.

…-attn

BBuf · 2026-05-30T02:12:30Z

/tag-and-rerun-ci

BBuf · 2026-05-30T10:04:24Z

/rerun-failed-ci

BBuf · 2026-05-30T13:01:19Z

/rerun-failed-ci

BBuf · 2026-05-30T13:20:59Z

/rerun-failed-ci

BBuf · 2026-05-30T13:58:45Z

/rerun-failed-ci

BBuf · 2026-05-30T14:02:49Z

/tag-and-rerun-ci extra

BBuf · 2026-05-30T14:29:42Z

/rerun-failed-ci

BBuf · 2026-05-30T15:03:33Z

/rerun-failed-ci

BBuf · 2026-05-31T03:48:39Z

/rerun-failed-ci

BBuf · 2026-05-31T04:18:38Z

/rerun-failed-ci

BBuf · 2026-05-31T04:48:41Z

/rerun-failed-ci

BBuf · 2026-05-31T05:18:44Z

/rerun-failed-ci

BBuf · 2026-05-31T05:48:38Z

/rerun-failed-ci

BBuf · 2026-05-31T09:49:26Z

/rerun-failed-ci

BBuf · 2026-05-31T10:18:58Z

/rerun-failed-ci

BBuf · 2026-05-31T10:48:40Z

/rerun-failed-ci

BBuf · 2026-05-31T11:18:46Z

/rerun-failed-ci

BBuf · 2026-05-31T11:48:48Z

/rerun-failed-ci

BBuf · 2026-05-31T12:18:50Z

/rerun-failed-ci

BBuf · 2026-05-31T12:48:45Z

/rerun-failed-ci

BBuf · 2026-05-31T13:19:18Z

/rerun-failed-ci

BBuf · 2026-05-31T13:50:20Z

/rerun-failed-ci

BBuf · 2026-05-31T14:19:43Z

/rerun-failed-ci

BBuf · 2026-05-31T14:48:51Z

/rerun-failed-ci

BBuf · 2026-05-31T15:18:41Z

/rerun-failed-ci

BBuf · 2026-05-31T15:48:43Z

/rerun-failed-ci

BBuf · 2026-05-31T16:18:50Z

/rerun-failed-ci

BBuf · 2026-05-31T16:48:40Z

/rerun-failed-ci

BBuf · 2026-05-31T17:18:39Z

/rerun-failed-ci

BBuf · 2026-05-31T17:48:39Z

/rerun-failed-ci

BBuf · 2026-05-31T18:18:42Z

/rerun-failed-ci

BBuf · 2026-05-31T18:48:52Z

/rerun-failed-ci

BBuf · 2026-05-31T19:18:39Z

/rerun-failed-ci

BBuf · 2026-05-31T19:49:04Z

/rerun-failed-ci

BBuf · 2026-05-31T20:18:40Z

/rerun-failed-ci

BBuf · 2026-05-31T20:48:41Z

/rerun-failed-ci

BBuf · 2026-05-31T21:18:54Z

/rerun-failed-ci

BBuf · 2026-05-31T21:48:40Z

/rerun-failed-ci

BBuf added 5 commits May 27, 2026 11:14

Optimize Gemma4 H200 MoE and extend attention

3398cb7

Optimize Gemma4 MoE routing topk

03826cd

Fuse Gemma4 router RMSNorm scale

d72d246

Optimize Gemma4 QKV RMSNorm for small batches

fabb5b5

Merge remote-tracking branch 'bbuf/main' into codex/gemma4-h200-moe-attn

256e1d6

BBuf requested review from Edwardf0t1, Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, ch-wan, hebiao064, ispobock, kpham-sgl and merrymercy as code owners May 28, 2026 14:20

kpham-sgl mentioned this pull request May 28, 2026

[Roadmap] Gemma4 #26596

Open

7 tasks

Merge remote-tracking branch 'origin/main' into codex/gemma4-h200-moe…

67f9d13

…-attn

github-actions Bot added the run-ci label May 30, 2026

github-actions Bot added the run-ci-extra label May 30, 2026

Conversation

BBuf commented May 28, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Accuracy

Reverted optimizations

Validation

CI States

Uh oh!

gemini-code-assist Bot commented May 28, 2026

Uh oh!

pyc96 commented May 28, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 30, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 31, 2026

Uh oh!

BBuf commented May 28, 2026 •

edited by github-actions Bot

Loading