fix(gemma4): only apply swa_full_tokens_ratio=0.15 to MoE variants#8
Open
pyc96 wants to merge 1 commit into
Open
fix(gemma4): only apply swa_full_tokens_ratio=0.15 to MoE variants#8pyc96 wants to merge 1 commit into
pyc96 wants to merge 1 commit into
Conversation
Patch 2 (PR #2) set swa_full_tokens_ratio=0.15 for every Gemma-4 model. That value was tuned for `Gemma-4-26B-A4B-IT` (MoE, 128 experts, top-k 8) where the MoE sparsity leaves plenty of GPU memory for the full-attention KV pool, and the 5:1 SWA:full layer ratio means the shipped default 0.8 over-provisions the SWA pool. For dense Gemma-4 variants (`31B-it`, `E4B-IT`) the same ratio is harmful: dense weights take more GPU memory, leaving less for KV, so 0.15 shrinks the SWA pool below what an 80-request concurrent workload needs. Empirically (on `gemma-4-31B-it` + trtllm_mha + MTP + 1x B200 with 80 concurrent 1k/1k chat requests): ratio=0.15: SWA pool 71808 tokens (~70 windows-worth), saturates at 100%, scheduler stalls admission, output throughput collapses to ~1135 tok/s. ratio=0.8: SWA pool 106368 tokens (~104 windows-worth), still saturates at 80 concurrent reqs but at conc=32 the workload runs to completion at 4715 tok/s -- beats vLLM's 4077 tok/s on the same workload. This commit gates the 0.15 override on `num_experts > 0`, read from the model's `hf_text_config`. Mirrors the MoE-detection pattern in `gemma4_causal.py:1166`. Per-model verification on 1x B200: 26B-A4B-IT (MoE, num_experts=128): log: 'Setting swa_full_tokens_ratio to 0.15 for ... ' pool: full_layer_tokens=2138240 swa_layer_tokens=320704 (unchanged from Patch 2 -- regression-safe) 31B-it (dense, num_experts=0): log: 'Keeping default swa_full_tokens_ratio=0.8 ... ' pool: full_layer_tokens=132992 swa_layer_tokens=106368 (instead of the broken 478720 / 71808 layout from Patch 2) E4B-IT (dense, num_experts=0): same MoE-only-skipped path as 31B. Benchmark improvements on 31B-it + trtllm_mha + MTP + 1x B200 vs vLLM nightly (random 40 prompts x 1k/1k chat, max-concurrency=32): metric | SGLang (this PR) | vLLM nightly | Delta ------------------|------------------|--------------|---- outcome | OK | OK | same median TTFT | 673 ms | 901 ms | SGLang +25% median TPOT | 8.69 ms | 9.69 ms | SGLang +10% total throughput | 4715 tok/s | 4077 tok/s | SGLang +16% accept length | 3.13 | n/a | -- Same workload at conc=32 summarization (8k/1k x 40): median TPOT | 17.02 ms | 27.33 ms | SGLang +38% total throughput | 7475 tok/s | 6468 tok/s | SGLang +16% MMLU @ 500 questions on 31B-it: 0.680 vs vLLM 0.660 (within noise). Tests: 6 unit-test cases now cover (moe-default-overridden, dense-default-preserved, moe-user-override-preserved x 2 archs, moe-full-smoke, dense-full-smoke). Co-authored-by: Claude
This was referenced May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refine Patch 2 (#2) so the
swa_full_tokens_ratio=0.15override only applies to MoE Gemma-4 variants
(
Gemma-4-26B-A4B-IT), not to dense variants (gemma-4-31B-it,Gemma-4-E4B-IT).Stacked on #7 (clamp revert). Staged on
pyc96/sglangonly.
Motivation
Patch 2's
0.15ratio was tuned for the MoE 26B-A4B-IT: MoE sparseweights leave plenty of GPU memory for KV, so growing the full-attn
pool 3.6x (and shrinking the over-provisioned SWA pool) materially
improves long-context summarization TTFT.
For dense 31B and E4B, the same ratio is harmful:
(~70 windows-worth). Concurrent 80-prompt chat → SWA usage 1.00 →
scheduler stalls admission → output throughput collapses to 1135 tok/s.
This commit gates the override on
num_experts > 0(read fromhf_text_config), the standard SGLang MoE-detection pattern (mirrorsgemma4_causal.py:1166).Per-model behavior (verified on 1x B200)
Gemma-4-26B-A4B-IT(MoE)gemma-4-31B-it(dense)Gemma-4-E4B-IT(dense)Log lines:
Benchmark: 31B-it + trtllm_mha + MTP + 1x B200 vs vLLM nightly
random 40 prompts, max-concurrency=32, seed 1:chat 1k/1k
summarization 8k/1k
Quality (MMLU @ N=500, seed=0, temp=0)
Within sampling noise.
Regression test for 26B-A4B-IT
The 26B path is unchanged. Verified by relaunching 26B and confirming:
Setting swa_full_tokens_ratio to 0.15 for ... (MoE Gemma-4 with num_experts=128; ...)full_layer_tokens=2138240 swa_layer_tokens=320704(same as Patch 2)Tests
test/srt/test_gemma4_swa_full_tokens_ratio.py— 4 cases pass, 2skipped (full smoke tests skip when env lacks model-config stubs,
same as before):
Known limitation (out-of-scope)
The predicate
if self.swa_full_tokens_ratio == ServerArgs.swa_full_tokens_ratiostill cannot distinguish "user passed
0.8explicitly" from "userdidn't pass the flag" (same caveat as the upstream
apply_deepseek_v4_defaultspattern). Fixing this would require asentinel default (
None) and resolution in__post_init__— widersurface area than this PR's MoE gating fix. For MoE Gemma-4, a user
passing
0.8will still get the override; if they want the upstreamdefault explicitly, they can pass any other value like
0.81.CI States
Latest PR Test (Base): ❌ Missing
run-cilabel -- add it to run CI tests.Latest PR Test (Extra): ❌ Blocked --
run-ciis required first.