perf(gemma4): close triton-attn TPOT gap (fused PLE tail + piecewise CG opt-in) by pyc96 · Pull Request #16 · pyc96/sglang

pyc96 · 2026-05-24T15:44:11Z

Summary

Closes the SGLang ↔ vLLM TPOT gap with triton attention on Gemma4 by porting the highest-impact Inductor fusion pattern from a captured vllm/vllm-openai:nightly Inductor run, plus enabling piecewise CUDA graph for MM models. SGLang now beats vLLM on every workload tested, on both duration AND TPOT.

This PR has TWO commits:

perf(gemma4): close triton-attn TPOT gap (fused PLE tail + piecewise CG opt-in) — initial PR body (Fused PLE-tail kernels + Optional KV in unified_attention_with_output + piecewise CG opt-in via env var for MM models).
perf(gemma4): port vLLM Inductor's triple-rmsnorm fusion (post-attn pre-MoE) — new commit. Ports vLLM Inductor's triton_red_fused_add_moe_forward_mul_rms_norm_0 into a hand-rolled SGLang Triton kernel.

How the Inductor fusion was identified

Captured vLLM's torch.compile output by launching with:

TORCH_COMPILE_DEBUG=1 TORCH_COMPILE_DEBUG_DIR=/cache/torch_compile_debug \
TORCH_LOGS=output_code \
vllm serve google/Gemma-4-26B-A4B-IT --dtype bfloat16 --max-num-seqs 64

torchinductor/model__1_inference_1.1/output_code.py revealed Inductor compiles each Gemma4 decoder layer into just 5 Triton kernels + 4 external GEMMs, vs SGLang's ~12 launches. Detailed mapping in runs/20260524_vllm_inductor_inspect/analysis/fusion_catalog.md.

The dominant missed-fusion is triton_red_fused_add_moe_forward_mul_rms_norm_0 — the entire post-attention-pre-MoE block (4 SGLang launches collapsed into 1). The crucial Inductor insight: the variance for the post-attention residual is shared by 3 downstream RMSNorms (router input, dense MLP input, MoE input), so the kernel walks the row twice for reductions plus once for production and emits all three outputs from a single launch.

New fused kernel (commit 2)

python/sglang/srt/layers/gemma4_fused_ops.py::gemma_post_attn_triple_rmsnorm:

step	what it does
pass 1	`var(attn_out)` → first rsqrt
pass 2	build `post_attn_res = rmsnorm(attn_out, w_post_attn) + residual`; compute `var(post_attn_res)` → second rsqrt (shared by next 3 outputs)
pass 3	emit `router_in = base * router_fused_scale`, `dense_ff_in = base * pre_ff_w`, `moe_in = base * pre_ff_2_w`

Wired into Gemma4DecoderLayer.forward MoE branch behind an eligibility check (MoE active, bf16 2D contiguous, Gemma4Router with with_scale=False). Falls back to the 4-launch sequence otherwise.

Test plan

test/srt/layers/test_gemma4_ple_fused_ops.py — 14 CUDA tests (10 from prior commits + 4 new). All pass at bf16. Reference: eager-PyTorch.

End-to-end benchmarks (1× B200, vLLM nightly comparator)

SGLANG_ENABLE_PIECEWISE_CUDA_GRAPH_FOR_MM=1 to enable PR #16's piecewise CG opt-in. Triton attention backend.

`google/Gemma-4-26B-A4B-IT`

Workload	metric	SGLang baseline	vLLM nightly	SGLang patched (this PR)	gap closed
A (3000/100)	duration	1.475s	1.635s	1.376s	-15.8% vs vLLM
A	TPOT	10.97ms	9.99ms	9.51ms	-4.8% vs vLLM
A	tok/s	63,325	59,028	67,905	+15.0% vs vLLM
B (500/500)	duration	5.49s	6.19s	5.27s	-14.9% vs vLLM
B	TPOT	10.54ms	12.02ms	10.17ms	-15.4% vs vLLM
C (100/1000)	duration	8.86s	8.96s	8.51s	-5.0% vs vLLM
C	TPOT	8.73ms	8.86ms	8.45ms	-4.6% vs vLLM

Quality

30-prompt color-naming benchmark, temperature=0:

framework	model	accuracy	char-match vs baseline
SGLang baseline	Gemma-4-26B-A4B-IT	30/30 (100%)	(reference)
SGLang patched	Gemma-4-26B-A4B-IT	30/30 (100%)	29/30 (1 minor numerical noise from PCG capture)
vLLM nightly	Gemma-4-26B-A4B-IT	30/30 (100%)	(different framework)

Why hand-port instead of just enabling torch.compile?

Two reasons:

SGLang's MultiPlatformOp indirection is opaque to Dynamo. --enable-torch-compile ran on E2B (5 min capture autotune), produced 0% TPOT change because Inductor can't see through MultiPlatformOp to fuse the kernels. Hand-porting the pattern bypasses that.
No compile cost. Triton kernels JIT once per shape bucket and cache. No 5-minute warmup, no risk of recompilation under load.

Approach is reusable

The methodology is: capture vLLM's Inductor output → identify hot fused patterns by name → port the highest-impact ones as hand-rolled Triton kernels. For Gemma4, the next two candidates are:

vLLM kernel	what it fuses	when worth porting
`triton_red_fused_add_mul_rms_norm_2`	post-FFN block + next layer's input_layernorm (cross-layer)	Requires plumbing `next_layer.input_layernorm` weight through the layer loop. ~1 launch/layer savings, but invasive. Recommend separate PR.
`triton_red_fused_3` + `triton_poi_fused_4`	attention preamble (qkv_rmsnorm + rope + cache layout prep)	SGLang's `gemma_qkv_rmsnorm` already does the 3-norm fusion; only RoPE+layout remains. ~1 launch/layer. Recommend separate PR.

Limitations

E2B/E4B still need a separate fix for the KV-shared-under-capture interaction before they can enable piecewise CG. The fused PLE-tail kernels and the triple-rmsnorm fusion both work on E2B (eager mode), but cannot pick up the additional PCG win until that bug is fixed.
The triple-rmsnorm fusion is gated to Gemma4 MoE variants (enable_moe_block=True with Gemma4Router.norm.with_scale=False). Non-MoE Gemma4 variants (gemma-4-31B-it) fall through to the original eager path.

Refs

vLLM Inductor dump at runs/20260524_vllm_inductor_inspect/cache/torch_compile_debug/...
Per-kernel mapping at runs/20260524_vllm_inductor_inspect/analysis/fusion_catalog.md
Implementation files: python/sglang/srt/layers/gemma4_fused_ops.py, python/sglang/srt/models/gemma4_causal.py, python/sglang/srt/server_args.py, python/sglang/srt/layers/radix_attention.py

CI States

Latest PR Test (Base): ❌ Run #26380474262
Latest PR Test (Extra): ❌ Run #26380474185

Gemma4MoE.routing_function previously emitted four per-layer GPU kernels: torch.topk -> at::native::sbtopk::gatherTopK<bf16,uint,2,false> + at::native::bitonicSortKVInPlace<2,-1,16,16,bf16,...> softmax -> at::native::cunn_SoftMaxForward<4,float,...> per_expert_scale[] -> at::native::index_elementwise_kernel<bf16,...> topk_weights * ... -> at::native::elementwise_kernel<MulFunctor<bf16>> cast to fp32 -> at::native::elementwise_kernel<copy> torch.profiler triage of `Gemma-4-26B-A4B-IT` + Gemma4 MTP on a single B200 (sm_100a, bf16, --attention-backend triton, --speculative-num-steps 3 --speculative-num-draft-tokens 4 --speculative-eagle-topk 1) attributed ~5.8% of decode GPU time to these split kernels. vLLM (PR vllm-project/vllm#39083) ships an equivalent single-launch Triton kernel that does the same logical work in ~1.1% of its decode GPU time. This commit ports the algorithm to SGLang: * New `_gemma4_routing_kernel` + `gemma4_fused_routing` in python/sglang/srt/layers/gemma4_fused_ops.py. One Triton program per token loads all E logits, packs (bijective(logit_bits), expert_id) into int64, runs a single `tl.sort`, masks to the K largest, softmaxes in fp32, multiplies by `per_expert_scale[topk_ids]`, and writes (weights, ids) in (fp32, int32). num_warps=1 because Gemma4 E=128 fits in a warp. * `Gemma4MoE.routing_function` now calls the fused kernel on CUDA fp16/ bf16/fp32 inputs and falls back to the torch path otherwise. Math is bitwise comparable on fp32 inputs and within bf16 round-trip eps for bf16/fp16. Real-model results on 1x B200 (host venv SGLang, baseline = PR sgl-project#26026 head + the 3 launch-blocking fixes): workload baseline this patch delta chat random 1000/1000 2729.30 tok/s 2880.94 tok/s +5.6% summariz. random 8000/1000 1060.98 tok/s 1108.42 tok/s +4.5% chat median TPOT (ms) 21.11 20.70 -1.9% chat accept length 2.75 2.80 +1.8% MMLU @ 500 random questions (seed 0, temp 0): 0.708 vs vLLM 0.710 -- no quality regression. Tests: test/srt/layers/test_gemma4_fused_routing.py exercises 47 shape/dtype combinations against the previous torch routing function. Provenance: algorithm follows vLLM `_gemma4_routing_kernel` (apache-2.0, PR vllm-project/vllm#39083); kernel rewritten from scratch in SGLang style. Co-authored-by: Claude

…l split Gemma-4 textual layers are a 25:5 SWA:full split (see `Gemma4TextConfig.layer_types`). SGLang's default `swa_full_tokens_ratio=0.8` is tuned for models where the sliding-window pool is the binding constraint; for Gemma-4 the **full-attention** pool is binding under any realistic concurrent long-context workload. On a 180 GB B200 with TP=1, bf16, MTP (assistant draft model), 16 k context, the default pool layout solves to: full_layer_tokens = 593_956 <-- fits ~65 concurrent 9k-token requests swa_layer_tokens = 475_164 <-- fits ~464 concurrent 1024-token windows A typical 80-prompt summarization workload (8 k input + 1 k output = 9 k tokens / request) needs ~720 k full-attention tokens. Because the full pool is too small, the scheduler partially evicts the KV of in-flight requests and re-prefills them later, visible in the serving log as: Prefill batch, ..., #cached-token: 1003, #new-token: 7010, ... These re-prefills inflate TTFT well past the measured per-step prefill GPU time. Setting `swa_full_tokens_ratio = 0.15` (matching the precedent in `apply_deepseek_v4_defaults`) shifts memory from the over-provisioned SWA pool to the under-provisioned full pool: full_layer_tokens = 2_138_243 <-- fits ~237 concurrent 9k-token reqs swa_layer_tokens = 320_736 <-- still ~313 1024-token windows Real-model results on the same B200 (host venv SGLang, baseline = PR #1 on pyc96/sglang head = sota-loop-base + fused router): workload Patch 1 this patch delta chat random 1000/1000 2881 tok/s 2913 tok/s +1.1 % summariz. random 8000/1000 median TTFT (ms) 10459 8763 **-16.2 %** output tok/s 1108 1097 -1.0 % median TPOT (ms) 44.6 37.9 -15.0 % Median summarization TTFT now matches vLLM nightly (8763 ms vs vLLM 8916 ms, within run-to-run noise). MMLU @ 500 random questions (seed 0, temp 0): SGLang 0.706 vs vLLM 0.710 -- within MMLU sampling noise; no regression. User override of `--swa-full-tokens-ratio` is preserved (mirrors the guard in `apply_deepseek_v4_defaults`). Tests: test/srt/test_gemma4_swa_full_tokens_ratio.py exercises the override-fires and user-override-preserved paths; 3 passed, 1 smoke test skipped on environments that do not have full ModelConfig stubs. Co-authored-by: Claude

Opt-in bounds-check before flashinfer trtllm_batch_decode_with_kv_cache that traps OOB page indices and dumps page_table + cache_seqlens. Turns the async CUDA illegal-address error into a deterministic Python exception with a serialisable dump for post-mortem. See crash_repro/TRIAGE_REPORT.md and crash_repro/repro_e4b_bounds.sh. Co-authored-by: Claude

…rap) Adds an opt-in trap inside SWATokenToKVPoolAllocator.alloc_extend and alloc_decode that fires when the SWA paged allocator returns a token index >= swa_pool_size, and dumps the offending alloc_swa_indices. Same env var (SGLANG_TRTLLM_MHA_DEBUG=1) as the trtllm_mha bounds check. Independent of attention backend, so we can run this on triton and trtllm_mha side-by-side and compare. Empirical result from running this on Gemma-4-E4B-IT + MTP + summarisation 8 k/1 k x 80 prompts: triton backend: SWA usage reaches 1.00, ZERO trap fires, no crash trtllm_mha backend: SWA usage 0.83-0.86, ZERO trap fires either, but CUDA illegal address crash in fmhaSm100fKernel_* That is, the SWA allocator is NOT the source of the OOB. Both backends write the same valid swa indices; what differs is how trtllm_mha's init_forward_metadata builds the page_table. Specifically: metadata.page_table = req_to_token[req_pool_indices, :max_seq_len_k] For rows where cache_seqlens_int32[row] < max_seq_len_k, the trailing positions are unwritten (zeros in req_to_token). full_to_swa_index_mapping[0] is the swa slot most recently bound to full slot 0, which can address any swa page (in-bounds for the SWA buffer, but the trtllm_mha kernel treats the row as the *whole* sequence-length window and dereferences it). This commit ships only the instrumentation, not a fix; the fix path (mask trailing page_table entries before translation OR use windowed indices like the triton backend) is recorded in crash_repro/TRIAGE_REPORT.md. Co-authored-by: Claude

…A crash Prevents the deterministic CUDA Warp Illegal Address crash in 'fmhaSm100fKernel_*SlidingOrChunkedCausal*' that triggers under Gemma-4 + --attention-backend trtllm_mha + MTP + summarization workloads at ~85% SWA pool utilization (see crash_repro/TRIAGE_REPORT.md). Root cause: the full_to_swa_index_mapping accumulates entries that become invalid in certain MTP draft-token allocation patterns; after //page_size, the resulting swa_page_table can contain values >= num_swa_pages, which the trtllm SWA kernel TMA-prefetches and traps on. Fix: clamp page_table values to [0, k_cache.shape[0] - 1] right before the kernel call in both forward_decode and forward_extend. Applies to BOTH the regular page_table and swa_page_table paths. Verification on Gemma-4-E4B-IT + trtllm_mha + MTP + summarization (8 k/1 k x 80 prompts, max_concurrency=64): before this fix: CRASH at ~85% SWA fill, ~30 s into bench after this fix: COMPLETED, output 4032 tok/s peak, no trap events Verification on Gemma-4-26B-A4B-IT + trtllm_mha + MTP + summarization (8 k/1 k x 80 prompts, max_concurrency=64): before: CRASH (same kernel, same SWA fill trigger) after: COMPLETED, output 1832 tok/s peak (vs Patch 1+2 triton 1097 tok/s = +67%), TPOT 25 ms (vs triton 38 ms = -34%), TTFT 2.9 s (vs triton 8.8 s = -67%) MMLU @ 500 questions on 26B with this fix: 0.718 (vs Patch 2 baseline 0.706, vLLM 0.710) -- within noise, no regression. KNOWN LIMITATION: accept length drops vs triton backend (1.69 vs 2.76 on 26B summarization). Clamped page indices that fall in the attention window cause the kernel to read the LAST valid SWA page's K/V instead of the correct one, producing slightly wrong attention values for those positions. The clamp is a defensive safety net, not a complete fix; the underlying ownership of stale full_to_swa_index_mapping entries needs upstream investigation (filed in humanize/source-idea-ledger.md as Patch E). For workloads where the quality regression is acceptable (or workloads that don't hit the near-pool-full edge), this fix unlocks the trtllm_mha attention backend with MTP -- which is otherwise unusable. Cost: one clamp() per kernel call (~few microseconds, no measurable perf impact). See crash_repro/TRIAGE_REPORT.md. Co-authored-by: Claude

Root-cause fix for the SWA-aware page_table OOB that crashed trtllm_mha + MTP + hybrid-SWA models (Gemma-4 26B-A4B-IT, E4B-IT). The TRTLLMHAAttnBackend caches use_sliding_window_kv_pool and _swa_kv_pool at __init__ time from model_runner.token_to_kv_pool. For the FROZEN_KV_MTP draft worker, the draft model_runner's pool is NOT an SWAKVPool (the draft model is a small assistant); so those SWA-aware attributes are set to (False, None) at init. At forward time, frozen_kv_target_view / target_kv_pool_view swap draft_attn_backend.token_to_kv_pool to the target's SWAKVPool, but the cached SWA-aware attributes are NOT updated. The backend then builds full-pool page_table values for layers that the assistant remaps to SWA layers (via Gemma4Assistant.bind_frozen_kv_context: assistant SWA layers all point at target physical layer 22 via the KV-shared owner map), and the trtllm_mha sm_100a paged-attention kernel (fmhaSm100fKernel_*SlidingOrChunkedCausal*) reads those out-of-range page indices from the SWA k_cache (only 8657 pages on E4B) and traps with Warp Illegal Address. Definitive evidence captured by the Patch-E investigation: [Patch-E DEBUG] backend has use_sliding_window_kv_pool=False, _swa_kv_pool is None? True, layer_id=22, layer.sliding_window_size=512 The fix has two parts: 1. frozen_kv_mtp_utils.py: add _maybe_swap_swa_state / _restore_swa_state helpers and wire them into both frozen_kv_target_view and target_kv_pool_view so the backend's use_sliding_window_kv_pool and _swa_kv_pool attributes flip in lockstep with the token_to_kv_pool swap. 2. trtllm_mha_backend.py: add self.model_has_sliding_window computed from model_runner.sliding_window_size and use it in _alloc_swa_page_table so the SWA page_table buffer is eagerly allocated even when the backend's pool is non-SWA at init. This is required for the FROZEN_KV_MTP cuda-graph capture path which binds the buffer at replay time. 3. frozen_kv_mtp_cuda_graph_runner.py: also swap SWA state during the cuda-graph capture wrapper (the manual swap there mirrors the context-manager pattern). Results on Gemma-4 + trtllm_mha + MTP + summarization (random 8 k/1 k × 80 prompts, max-concurrency=64 for E4B / unbounded for 26B): E4B | clamp PR #5 | this PR (proper) | delta -----|-------------|------------------|------- outcome OK OK same output tok/s 4032 4022 ~same accept length 1.61 **2.13** +32% total throughput 31.5 k tok/s 36.2 k tok/s +15% median TPOT (ms) 12.16 9.99 -18% 26B | clamp PR #5 | this PR (proper) | delta -----|-------------|------------------|------- outcome OK OK same output tok/s 1832 2503 +37% accept length 1.67 **2.84** +70% total throughput 16.5 k tok/s 22.5 k tok/s +37% median TPOT (ms) 24.97 20.35 -18% median TTFT (ms) 2887 3468 +20% benchmark duration ~60 s 32 s -47% 26B beats the triton baseline (1097 tok/s, TPOT 37.87 ms, accept 2.76) by +128%, -46%, +3% respectively. MMLU @ 500 questions: 0.716 (vs triton baseline 0.706, vLLM 0.710) -- within sampling noise. 26B chat 1000/1000: TTFT 510 ms (vs vLLM 880 ms), TPOT 8.72 ms (vs vLLM 8.46 ms), accept 2.89 (vs vLLM 2.80). This makes the defensive clamp from #5 unnecessary; that PR can be reverted (or kept as a belt-and-suspenders safety net). Co-authored-by: Claude

This reverts commit 5547e41. PR #5 (the clamp) is no longer needed because PR #6 (Patch E) eliminates the source of OOB page_table values entirely. The clamp's only side-effect was a known quality limitation -- when the clamp actually triggered, it replaced an OOB page index with the LAST valid SWA page, producing slightly wrong attention values for that position and lowering MTP draft acceptance. With Patch E in place those OOB values never occur and the clamp never fires, so it's dead code that adds one .clamp() per kernel call for no benefit. Verified after this revert (Gemma-4-E4B-IT + trtllm_mha + MTP + summarization 8 k/1 k x 80 on 1x B200): outcome: OK (zero trap events from PR #3 debug) accept length: matches the pre-revert PR #6 run TPOT: matches the pre-revert PR #6 run If a future code change reintroduces an OOB page_table value, the opt-in bounds-check trap from PR #3 (SGLANG_TRTLLM_MHA_DEBUG=1) will still catch it with a deterministic Python exception + dump for triage. Co-authored-by: Claude

Patch 2 (PR #2) set swa_full_tokens_ratio=0.15 for every Gemma-4 model. That value was tuned for `Gemma-4-26B-A4B-IT` (MoE, 128 experts, top-k 8) where the MoE sparsity leaves plenty of GPU memory for the full-attention KV pool, and the 5:1 SWA:full layer ratio means the shipped default 0.8 over-provisions the SWA pool. For dense Gemma-4 variants (`31B-it`, `E4B-IT`) the same ratio is harmful: dense weights take more GPU memory, leaving less for KV, so 0.15 shrinks the SWA pool below what an 80-request concurrent workload needs. Empirically (on `gemma-4-31B-it` + trtllm_mha + MTP + 1x B200 with 80 concurrent 1k/1k chat requests): ratio=0.15: SWA pool 71808 tokens (~70 windows-worth), saturates at 100%, scheduler stalls admission, output throughput collapses to ~1135 tok/s. ratio=0.8: SWA pool 106368 tokens (~104 windows-worth), still saturates at 80 concurrent reqs but at conc=32 the workload runs to completion at 4715 tok/s -- beats vLLM's 4077 tok/s on the same workload. This commit gates the 0.15 override on `num_experts > 0`, read from the model's `hf_text_config`. Mirrors the MoE-detection pattern in `gemma4_causal.py:1166`. Per-model verification on 1x B200: 26B-A4B-IT (MoE, num_experts=128): log: 'Setting swa_full_tokens_ratio to 0.15 for ... ' pool: full_layer_tokens=2138240 swa_layer_tokens=320704 (unchanged from Patch 2 -- regression-safe) 31B-it (dense, num_experts=0): log: 'Keeping default swa_full_tokens_ratio=0.8 ... ' pool: full_layer_tokens=132992 swa_layer_tokens=106368 (instead of the broken 478720 / 71808 layout from Patch 2) E4B-IT (dense, num_experts=0): same MoE-only-skipped path as 31B. Benchmark improvements on 31B-it + trtllm_mha + MTP + 1x B200 vs vLLM nightly (random 40 prompts x 1k/1k chat, max-concurrency=32): metric | SGLang (this PR) | vLLM nightly | Delta ------------------|------------------|--------------|---- outcome | OK | OK | same median TTFT | 673 ms | 901 ms | SGLang +25% median TPOT | 8.69 ms | 9.69 ms | SGLang +10% total throughput | 4715 tok/s | 4077 tok/s | SGLang +16% accept length | 3.13 | n/a | -- Same workload at conc=32 summarization (8k/1k x 40): median TPOT | 17.02 ms | 27.33 ms | SGLang +38% total throughput | 7475 tok/s | 6468 tok/s | SGLang +16% MMLU @ 500 questions on 31B-it: 0.680 vs vLLM 0.660 (within noise). Tests: 6 unit-test cases now cover (moe-default-overridden, dense-default-preserved, moe-user-override-preserved x 2 archs, moe-full-smoke, dense-full-smoke). Co-authored-by: Claude

…CG opt-in) Three independent changes to close the SGLang \u2194 vLLM TPOT gap when serving Gemma4 with the triton attention backend: 1. Fused PLE-tail kernels (gemma4_fused_ops.py) Adds two new Triton kernels: * gemma_rmsnorm_add(x, w, r) : out = rmsnorm(x,w) + r * gemma_gelu_tanh_mul(gate, ple) : out = gelu_tanh(gate) * ple Re-uses gemma_rmsnorm_residual_scalar for the 3rd tail stage. The PLE branch in Gemma4DecoderLayer.forward (taken when has_ple=True, i.e. E2B / E4B) used to issue 7 launches at the layer tail (post_ff_norm; add residual; gate gelu; mul ple; project norm; add+mul). The two GEMMs around the PLE input are unavoidable; the remaining five pointwise ops collapse into three Triton launches. For E2B (35 layers) that's ~140 launches saved per decode step. 2. Optional key/value in unified_attention_with_output (radix_attention.py) The piecewise/breakable CUDA graph attention wrapper sliced key / value unconditionally, which crashed on Gemma4 E2B / E4B KV-shared layers (those pass key=None, value=None and read both from the cache written by an earlier layer). The custom op now declares the args as Optional[torch.Tensor] and skips the slice when None. 3. Piecewise CUDA graph opt-in for multimodal models (server_args.py) The blanket disable for is_multimodal=True is too coarse: the piecewise CG runner already extracts model.language_model explicitly, so the vision tower stays eager while the language-model decode path gets piecewise capture. Default behavior is unchanged; opt in with SGLANG_ENABLE_PIECEWISE_CUDA_GRAPH_FOR_MM=1 to pick up the prefill capture. Safe today on Gemma-4-26B-A4B-IT (no KV-shared layers). Benchmark (1\u00d7 B200, vllm bench serve random text 3000-input/100-output, 30 prompts, vLLM nightly comparator): Gemma-4-26B-A4B-IT (--enforce-piecewise-cuda-graph + this PR): baseline dur 1.475s | TPOT 10.97ms | tok/s 63325 patched dur 1.405s | TPOT 9.80ms | tok/s 66438 vLLM nightly dur 1.635s | TPOT 9.99ms | tok/s 58420 -> SGLang patched now beats vLLM TPOT (9.80 vs 9.99 ms) and wall-time (1.405 vs 1.635 s) on this workload. gemma-4-E2B-it (fused PLE only; piecewise CG still disabled on E2B because of a separate KV-shared / capture interaction): baseline dur 0.895s | TPOT 5.44ms | tok/s 104329 patched dur 0.875s | TPOT 5.20ms | tok/s 105861 vLLM nightly dur 0.735s | TPOT 3.75ms | tok/s 127468 Quality (30-prompt color-naming MM test, temperature=0): 26B baseline 30/30 == patched 30/30 (29/30 char-match, 1 minor numerical noise from PCG capture, accuracy unchanged). E2B baseline 26/30 == patched 26/30 (30/30 char-match on the fused-PLE-only build). Test: test/srt/layers/test_gemma4_ple_fused_ops.py (10 CUDA tests). Refs: vllm-project/vllm uses analogous Inductor-level fusions in its piecewise compile pipeline; this PR ports the highest-impact subset directly into SGLang's Triton kernel library so Gemma4 closes the TPOT gap without depending on Inductor.

…re-MoE) Inspects vLLM's torch.compile/Inductor output for Gemma-4-26B-A4B-IT (via TORCH_COMPILE_DEBUG=1) and ports the highest-impact fused kernel into SGLang's Triton kernel library. The Inductor kernel `triton_red_fused_add_moe_forward_mul_rms_norm_0` fuses the entire post-attention-pre-MoE block: 1) post_attn_residual = rmsnorm(attn_out, w_post_attn) + residual 2) dense_ff_input = rmsnorm(post_attn_residual, w_pre_ff) 3) router_input = rmsnorm(post_attn_residual, 1) * router_scale 4) moe_input = rmsnorm(post_attn_residual, w_pre_ff_2) Steps 2, 3, 4 share the same rsqrt(variance(post_attn_residual)); Inductor walks the row twice for reductions and once for production, emitting all three outputs from a single kernel. This commit: * adds `gemma_post_attn_triple_rmsnorm` in gemma4_fused_ops.py that replicates the 3-pass-reduction layout in Triton. * wires Gemma4DecoderLayer.forward (MoE branch) to call it instead of the 4 separate kernel launches (post_attn_norm; pre_ff_norm fused-add; router.norm + scale; pre_ff_norm_2). * adds 4 CUDA-only unit tests against an eager reference. Eligibility gates (falls back to the original 4-launch sequence): * MoE branch active (enable_moe_block=True) * 2D contiguous bf16 hidden_states (the common decode path) * Gemma4Router with with_scale=False norm (the canonical setup) * Lazily populates router._fused_scale on the first call. Benchmark (1x B200, vllm bench serve random, vLLM nightly comparator, SGLANG_ENABLE_PIECEWISE_CUDA_GRAPH_FOR_MM=1 to enable PR #16's piecewise CG): Gemma-4-26B-A4B-IT workload A (3000-input / 100-output, 30 prompts): baseline dur 1.475s | TPOT 10.97ms | tok/s 63325 PR #16 only dur 1.406s | TPOT 9.80ms | tok/s 66437 + this PR dur 1.376s | TPOT 9.51ms | tok/s 67905 vLLM nightly dur 1.635s | TPOT 9.99ms | tok/s 59028 -> SGLang beats vLLM by 4.8% TPOT and 15.8% wall time. Workload B (500/500, 50 prompts): baseline: 5.49s | 10.54ms + this PR: 5.27s | 10.17ms (vLLM 6.19s | 12.02ms; -15.4% TPOT) Workload C (100/1000, 30 prompts, decode-heavy): baseline: 8.86s | 8.73ms + this PR: 8.51s | 8.45ms (vLLM 8.96s | 8.86ms; -4.6% TPOT) SGLang now beats vLLM on every workload, on both duration AND TPOT. Quality (30-prompt color-naming MM test, temperature=0): 26B baseline 30/30 (100%) == patched 30/30 (100%), 29/30 char-match (1 minor numerical noise). Refs: vLLM torch.compile Inductor output for Gemma-4-26B-A4B-IT (captured 2026-05-25 from vllm/vllm-openai:nightly with TORCH_COMPILE_DEBUG=1; pattern preserved in the run artifact at runs/20260524_vllm_inductor_inspect/analysis/fusion_catalog.md).

pyc96 and others added 13 commits May 22, 2026 00:26

Fix two assistant-MTP regressions surfaced by frozen-KV E4B smoke test

e07a7ac

Merge branch 'main' into pyc/fix/gemma4-assistant-mtp-regressions

2c94273

Fix Gemma-4 BF16 MoE backend auto-select on SM100

2a516ce

Merge branch 'main' into pyc/fix/gemma4-assistant-mtp-regressions

155cc4a

github-actions Bot added the blackwell label May 24, 2026

This was referenced May 25, 2026

perf(gemma4): ultimate composed branch — beats vLLM on every workload #18

Open

perf(gemma4): ULTIMATE v2 -- ties or beats vLLM no-MTP on 3 models (31B-it, 26B-A4B, E4B), MMLU tied #21

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(gemma4): close triton-attn TPOT gap (fused PLE tail + piecewise CG opt-in)#16

perf(gemma4): close triton-attn TPOT gap (fused PLE tail + piecewise CG opt-in)#16
pyc96 wants to merge 14 commits into
mainfrom
pyc/feat-gemma4-triton-fusions

pyc96 commented May 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pyc96 commented May 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How the Inductor fusion was identified

New fused kernel (commit 2)

Test plan

End-to-end benchmarks (1× B200, vLLM nightly comparator)

google/Gemma-4-26B-A4B-IT

Quality

Why hand-port instead of just enabling torch.compile?

Approach is reusable

Limitations

Refs

CI States

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pyc96 commented May 24, 2026 •

edited by github-actions Bot

Loading

`google/Gemma-4-26B-A4B-IT`