Skip to content

Optimize Gemma4 H200 MoE and extend attention#26588

Open
BBuf wants to merge 9 commits into
sgl-project:mainfrom
BBuf:codex/gemma4-h200-moe-attn
Open

Optimize Gemma4 H200 MoE and extend attention#26588
BBuf wants to merge 9 commits into
sgl-project:mainfrom
BBuf:codex/gemma4-h200-moe-attn

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented May 28, 2026

Summary

  • Add H200 Triton fused MoE configs for Gemma4 E=128,N=704 normal and down projections.
  • Tune Hopper extend-attention block sizes for Lq=129..256 to reduce TTFT/TPOT on Gemma4 prefill-heavy serving.
  • Add a small-batch (M ≤ 256) per-head Triton kernel for Gemma4 QKV RMSNorm to lower MTP draft latency.
  • Add a Gemma4-specific Triton routing kernel that fuses top-k selection, top-k softmax, and per-expert scale into gemma4_topk_softmax_scale. Reverted (commit 27cb94c45) due to BF16 precision regression on MoE routing — see "Reverted optimizations" below.
  • Replace Gemma4Router.norm(x) * fused_scale with a single fused rmsnorm(x, fused_scale, eps) call on the CUDA fast path. Reverted (commit 0d3c9c694) for the same reason.
  • Keep the optimization scoped to Gemma4/Hopper paths so existing model families keep their current behavior.

Performance

Re-measured on ion8-h200, sglang_bbuf, single NVIDIA H200, TP=1, google/gemma-4-26B-A4B-it, BF16, --attention-backend triton, random 1024/256, request rate 8, max concurrency 64, 80 prompts. Baseline = latest origin/main (376635c1e); patched = PR head (27cb94c45, with c2/c3 reverted).

Workload Build Output tok/s p50 TTFT ms p50 TPOT ms Median E2E ms Result
Random 1024/256, 80 prompts SGLang baseline (376635c1e) 1670.72 96.56 20.30 5295.91 Baseline
Random 1024/256, 80 prompts SGLang patched (27cb94c45) 1725.50 74.04 16.62 4338.60 +3.28% output tok/s, -23.3% TTFT, -18.2% TPOT, -18.1% E2E

Full benchmark details:

Build Completed Duration s Request throughput req/s Total tok/s Median E2E ms
SGLang baseline (376635c1e) 80 12.26 6.526 8353.59 5295.91
SGLang patched (27cb94c45) 80 11.87 6.740 8627.50 4338.60

The remaining speedup comes from the H200 MoE Triton configs + Hopper extend-attention block-size tuning + small-batch QKV RMSNorm; the heavier routing-fusion optimizations were dropped to keep accuracy on the MTP path.

Accuracy

Validated on ion8-h200 via the registered Gemma4 MTP CI test (test_gemma4_mtp_26b_a4b_extra::test_gsm8k_mtp, TP=2, --enable-deterministic-inference, NEXTN spec decode, 200 GSM8K examples, 5-shot). With the two reverted commits, GSM8K MTP score is back to baseline:

Build (TP=2, deterministic) GSM8K MTP topk=1 avg accept length Threshold 0.41
SGLang baseline (376635c1e, all reverts = main) 0.445 4.494 ✓ pass
SGLang patched (27cb94c45, c1+c4 kept, c2+c3 reverted) 0.445 4.494 ✓ pass
PR head before revert (10ab189e3, all 4 commits) 0.360 4.475 ✗ fail (-5pp)

Standalone GSM8K (TP=1, no MTP, 8-shot, 200 questions, no deterministic mode) shows the typical run-to-run variance (~5pp swing) but no systematic regression — patched and baseline land within noise of each other (patched: 0.41 / 0.495 across two runs; baseline: 0.465 / 0.445).

Reverted optimizations

Bisecting test_gemma4_mtp_26b_a4b_extra on ion8-h200 (TP=2, deterministic):

Configuration GSM8K topk=1 Δ vs PR head
PR head (all 4 commits) 0.360 0
Revert only fabb5b5ee (QKV RMSNorm small batch) 0.360 0 — innocuous
Revert only d72d246a3 (Router RMSNorm fuse) 0.395 +0.035
Revert only 03826cdd9 (MoE topk fuse) 0.380 +0.020
Revert only 3398cb7af (extend-attn + MoE configs) 0.360 0 — innocuous
Revert d72d246a3 + 03826cdd9 together 0.445 +0.085 (pass)
Revert all 4 (= origin/main) 0.445 identical to c2+c3 revert

Both reverted commits modify the Gemma4 MoE routing path in gemma4_causal.py:

  • d72d246a3: rmsnorm(x, fused_scale, eps) fast path in Gemma4Router.forward. The math is equivalent to (self.norm(x) with weight=1) * fused_scale, but BF16 accumulation order in the fused kernel diverges from the two-step path when fused_scale ≈ hidden_size**-0.5 ≈ 0.022.
  • 03826cdd9: _gemma4_topk_softmax_scale_kernel doing topk + stable-softmax + per-expert scale in one pass. Equivalent to softmax(topk_logits) * scale[topk_ids] for topk ≤ 8, but its FP32-internal exp(top_logit - top1) / sum_top_exp ordering doesn't match torch.nn.functional.softmax bit-for-bit.

Either alone causes a ~0.02-0.035 GSM8K drop; together the routing logits drift enough to flip enough expert selections that the MTP draft/target distribution gap widens (MTP avg_accept_length still looks healthy at 4.48, masking the underlying quality regression).

If you want to recover these speedups, the kernels need to be made bit-equivalent to the eager paths — e.g. accept weight=ones in the fused router RMSNorm and apply fused_scale as a separate elementwise op, and have _gemma4_topk_softmax_scale_kernel mirror PyTorch's exact softmax floating-point order (compute max, subtract, exp, sum, divide — all in fp32 with the same reduction tree).

Validation

  • Bisect, perf re-measurement, and Gemma4 MTP accuracy validation all completed on ion8-h200, container sglang_bbuf.
  • Python syntax check passed for the changed Python files.
  • New H200 Triton config JSON files (E=128,N=704) parse successfully.
  • Extend-attention block-size sanity checked for Lq=128,192,256,288.
  • Server benchmark + Gemma4 MTP CI test (TP=2, deterministic, 200-question GSM8K) confirm patched ≡ baseline accuracy (both 0.445).

CI States

Latest PR Test (Base): ❌ Run #26700275243
Latest PR Test (Extra): ✅ Run #26700275191

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kpham-sgl kpham-sgl mentioned this pull request May 28, 2026
7 tasks
@pyc96
Copy link
Copy Markdown
Collaborator

pyc96 commented May 28, 2026

@BBuf Could you please take a look at #26502 for fused router and #25461 for norm ? We can potentially merge the efforts

Just quickly checked myself, #26502 seems better due to less top-k reduce complexity. #25461 seems complementary to your changes

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

@BBuf Could you please take a look at #26502 for fused router and #25461 for norm ? We can potentially merge the efforts

Just quickly checked myself, #26502 seems better due to less top-k reduce complexity. #25461 seems complementary to your changes

Ok, I'll review it.

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

/rerun-failed-ci

3 similar comments
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

/tag-and-rerun-ci extra

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

/rerun-failed-ci

1 similar comment
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 30, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

29 similar comments
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 31, 2026

/rerun-failed-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants