perf(gemma4 31b): cap chunked_prefill_size=4096 + bump mem_fraction floor to 0.88 (dense) by pyc96 · Pull Request #17 · pyc96/sglang

pyc96 · 2026-05-25T02:30:36Z

Summary

Two dense-Gemma-4-only auto-tune nudges in `_handle_model_specific_adjustments` (right after the existing MoE-only `swa_full_tokens_ratio` gate). Closes ~11 pp of the SGLang-vs-vLLM summ throughput gap on `google/gemma-4-31B-it` H100 TP=2.

What the patch does

Setting	Default for 31B-it on H100 TP=2	This PR	Why
`chunked_prefill_size`	auto-tuned to 8192	4096	At 8192 a single 8k-token random-input prompt fills the entire prefill batch and blocks the decode batch from growing → peak #running-req stalls at 11-12. Capping at 4096 lets the scheduler pack two partial prefills per step → peak running-req ≈ 23.
`mem_fraction_static`	auto-tuned to 0.778	0.88	Auto-tune leaves ~16 GB / GPU unused on 80 GB H100. Bumping the floor grows `max_total_num_tokens` 68k → 106k (+27 %), parity with vLLM nightly (109k).

Both overrides:

fire only inside the dense-Gemma-4 branch (MoE Gemma-4 has different memory characteristics; the MoE-only swa-ratio override above already retunes along a different axis)
respect explicit user overrides via "only nudge in the right direction" predicates (chunked is only lowered when at the auto-tune 8192 ceiling; mem_fraction is only raised when below 0.88)
log before/after values for debugging

Measured impact

`google/gemma-4-31B-it`, H100 TP=2, triton attention, FROZEN_KV_MTP (3/4/1), `--max-running-requests 80`, 80 prompts, warmup 2, seed 1:

Scenario	Metric	Baseline	This PR	vLLM nightly MTP	Gap closure
summ 8k/1k	output tok/s	316	425	868	−62 % → −51 % (+33 % SGLang)
summ 8k/1k	median TTFT (ms)	78,567	80,637	39,706	unchanged
summ 8k/1k	median TPOT (ms)	29.0	25.3	30.8	SGLang wins
chat 1k/1k	output tok/s	1483	1513	2972	−50 % → −49 %
chat 1k/1k	median TTFT (ms)	2785	2848	3081	SGLang wins
chat 1k/1k	median TPOT (ms)	29.3	33.6	14.2	within-MTP-path

MMLU N=500 (seed 0, temp 0): 0.780 vs vLLM 0.778 (tied; identical to pre-patch SGLang 0.780).

Server-log evidence

```
[gemma4 dense override] Capping chunked_prefill_size at 4096 for Gemma4ForConditionalGeneration (was 8192; dense Gemma-4 with FROZEN_KV_MTP admits only one full prefill per scheduler step at 8192, leaving the decode batch starved at high concurrency; capping at 4096 lifts summ throughput +33% on the campaign workload).
[gemma4 dense override] Bumping mem_fraction_static from 0.778 to 0.88 for Gemma4ForConditionalGeneration (dense Gemma-4: the auto-tuned ceiling leaves ~16 GB per GPU unused on H100 TP=2; floor grows the KV pool ~27% to bring parity with vLLM nightly).
[init] max_total_num_tokens=106600, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=80
```

Remaining gap (deferred)

The patch closes ~11 pp of the summ gap but vLLM still wins (−51 %). The remaining gap is structural in vLLM's compilation surface:

```
vLLM nightly compilation_config dump:
pass_config.fuse_allreduce_rms = True
cudagraph_mode = FULL_AND_PIECEWISE
backend = inductor
cudagraph_capture_sizes = [1, 2, 4, 8, ..., 512]
```

In this campaign I verified (separately) that `--enable-torch-compile` on SGLang is Inductor-opaque against Gemma-4's custom Triton norm kernels (`gemma_qkv_rmsnorm`, `gemma_rmsnorm_residual_scalar`, `gemma_dual_rmsnorm_residual_scalar`) — matching the 26b D1 finding. The Inductor build takes ~20 minutes for a NET LOSS on both summ and chat. Closing the rest requires SGLang piecewise-CUDA-graph + Inductor coverage that wraps the custom kernels via `@register_custom_op` so Inductor sees them as opaque — multi-week framework work.

Stack

PR base: `pyc/sota-gemma4-31b-mm-disabled` @ `3a3195b30` (post mm_disabled_models patch).

Files

File	Change
`python/sglang/srt/server_args.py`	+80 real lines (rest of the 404-line diff is auto-format reflow of unrelated `assert` statements). The real change is one block inside the dense-Gemma-4 `elif` branch immediately after `Keeping default swa_full_tokens_ratio...`.

CI States

Latest PR Test (Base): ❌ Missing run-ci label -- add it to run CI tests.
Latest PR Test (Extra): ❌ Blocked -- run-ci is required first.

…0.88 (PR closes summ tok/s gap to vLLM) For dense Gemma-4 with FROZEN_KV_MTP (the gemma-4-31B-it H100 TP=2 campaign workload), the default scheduler config left two big perf wins on the floor: 1. chunked_prefill_size auto-tuned to 8192 on H100, which means each 8000-token random-input prompt fills the whole prefill batch and blocks the decode batch from growing. Peak #running-req stalls at 11-12. Capping at 4096 lets the scheduler pack two partial prefills per step, peak running-req climbs to ~23, and summarisation throughput lifts +33% (316 -> 421 tok/s). 2. mem_fraction_static auto-tunes to 0.778, leaving ~16 GB per GPU unused on 80 GB H100 TP=2. Bumping the floor to 0.88 grows max_total_num_tokens 68k -> 106k (+27%) and brings the SGLang KV pool into parity with vLLM nightly (109k tokens, 27.6 GiB KV). Both overrides: * fire only inside the dense-Gemma-4 branch of _handle_model_specific_adjustments (immediately after the existing MoE-only swa_full_tokens_ratio gate). MoE Gemma-4 has different memory characteristics; the MoE-only branch above already retunes along the swa-vs-full pool axis. * respect explicit user overrides via 'only nudge in the right direction' predicates: chunked is only lowered when at the auto-tune ceiling of 8192 (preserves user-passed 2048/4096); mem_fraction is only raised when below 0.88 (preserves user-passed 0.92). * log the before/after values for debugging. Measured on google/gemma-4-31B-it, H100 TP=2, triton attention, FROZEN_KV_MTP (3 spec steps, 4 draft tokens, eagle topk 1), num_prompts=80, warmup 2, seed 1: Scenario | Baseline | This PR | vLLM nightly | Gap closure ---------------|---------:|---------:|--------------:|------------- summ tok/s | 316 | **425** | 868 | -62% -> -51% summ med TTFT | 78,567 | 80,637 | 39,706 | unchanged summ med TPOT | 29.0 | 25.3 | 30.8 | SGLang wins chat tok/s | 1483 | **1513** | 2972 | -50% -> -49% chat med TTFT | 2785 | 2848 | 3081 | SGLang wins chat med TPOT | 29.3 | 33.6 | 14.2 | regression (within MTP path) MMLU N=500 (seed 0, temp 0): 0.780 vs vLLM 0.778, tied (identical to the pre-patch SGLang result). Note on remaining gap: the structural sources are vLLM's 'fuse_allreduce_rms' compile pass + 'cudagraph_mode=FULL_AND_PIECEWISE' + Inductor decode coverage. vLLM nightly compilation_config dump: pass_config.fuse_allreduce_rms = True cudagraph_mode = FULL_AND_PIECEWISE backend = inductor cudagraph_capture_sizes = [1..512] SGLang's --enable-torch-compile is verified (in this campaign) to be Inductor-opaque against the Gemma-4 custom Triton norm kernels (gemma_qkv_rmsnorm / gemma_rmsnorm_residual_scalar / gemma_dual_*), matching the 26b D1 finding. Closing the rest requires SGLang-side piecewise CUDA-graph + Inductor coverage that protects the custom kernels via @register_custom_op -- multi-week framework work. Stack base: pyc/sota-gemma4-31b-mm-disabled @ 3a3195b Co-authored-by: Claude

pyc96 mentioned this pull request May 25, 2026

perf(gemma4): ULTIMATE v2 -- ties or beats vLLM no-MTP on 3 models (31B-it, 26B-A4B, E4B), MMLU tied #21

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(gemma4 31b): cap chunked_prefill_size=4096 + bump mem_fraction floor to 0.88 (dense)#17

perf(gemma4 31b): cap chunked_prefill_size=4096 + bump mem_fraction floor to 0.88 (dense)#17
pyc96 wants to merge 1 commit into
pyc/sota-gemma4-31b-mm-disabledfrom
pyc/gemma4-31b-prefill-tune

pyc96 commented May 25, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pyc96 commented May 25, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the patch does

Measured impact

Server-log evidence

Remaining gap (deferred)

Stack

Files

CI States

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pyc96 commented May 25, 2026 •

edited by github-actions Bot

Loading