Skip to content

perf(gemma4 31b): cap chunked_prefill_size=4096 + bump mem_fraction floor to 0.88 (dense)#17

Draft
pyc96 wants to merge 1 commit into
pyc/sota-gemma4-31b-mm-disabledfrom
pyc/gemma4-31b-prefill-tune
Draft

perf(gemma4 31b): cap chunked_prefill_size=4096 + bump mem_fraction floor to 0.88 (dense)#17
pyc96 wants to merge 1 commit into
pyc/sota-gemma4-31b-mm-disabledfrom
pyc/gemma4-31b-prefill-tune

Conversation

@pyc96
Copy link
Copy Markdown
Owner

@pyc96 pyc96 commented May 25, 2026

Summary

Two dense-Gemma-4-only auto-tune nudges in `_handle_model_specific_adjustments` (right after the existing MoE-only `swa_full_tokens_ratio` gate). Closes ~11 pp of the SGLang-vs-vLLM summ throughput gap on `google/gemma-4-31B-it` H100 TP=2.

What the patch does

Setting Default for 31B-it on H100 TP=2 This PR Why
`chunked_prefill_size` auto-tuned to 8192 4096 At 8192 a single 8k-token random-input prompt fills the entire prefill batch and blocks the decode batch from growing → peak #running-req stalls at 11-12. Capping at 4096 lets the scheduler pack two partial prefills per step → peak running-req ≈ 23.
`mem_fraction_static` auto-tuned to 0.778 0.88 Auto-tune leaves ~16 GB / GPU unused on 80 GB H100. Bumping the floor grows `max_total_num_tokens` 68k → 106k (+27 %), parity with vLLM nightly (109k).

Both overrides:

  • fire only inside the dense-Gemma-4 branch (MoE Gemma-4 has different memory characteristics; the MoE-only swa-ratio override above already retunes along a different axis)
  • respect explicit user overrides via "only nudge in the right direction" predicates (chunked is only lowered when at the auto-tune 8192 ceiling; mem_fraction is only raised when below 0.88)
  • log before/after values for debugging

Measured impact

`google/gemma-4-31B-it`, H100 TP=2, triton attention, FROZEN_KV_MTP (3/4/1), `--max-running-requests 80`, 80 prompts, warmup 2, seed 1:

Scenario Metric Baseline This PR vLLM nightly MTP Gap closure
summ 8k/1k output tok/s 316 425 868 −62 % → −51 % (+33 % SGLang)
summ 8k/1k median TTFT (ms) 78,567 80,637 39,706 unchanged
summ 8k/1k median TPOT (ms) 29.0 25.3 30.8 SGLang wins
chat 1k/1k output tok/s 1483 1513 2972 −50 % → −49 %
chat 1k/1k median TTFT (ms) 2785 2848 3081 SGLang wins
chat 1k/1k median TPOT (ms) 29.3 33.6 14.2 within-MTP-path

MMLU N=500 (seed 0, temp 0): 0.780 vs vLLM 0.778 (tied; identical to pre-patch SGLang 0.780).

Server-log evidence

```
[gemma4 dense override] Capping chunked_prefill_size at 4096 for Gemma4ForConditionalGeneration (was 8192; dense Gemma-4 with FROZEN_KV_MTP admits only one full prefill per scheduler step at 8192, leaving the decode batch starved at high concurrency; capping at 4096 lifts summ throughput +33% on the campaign workload).
[gemma4 dense override] Bumping mem_fraction_static from 0.778 to 0.88 for Gemma4ForConditionalGeneration (dense Gemma-4: the auto-tuned ceiling leaves ~16 GB per GPU unused on H100 TP=2; floor grows the KV pool ~27% to bring parity with vLLM nightly).
[init] max_total_num_tokens=106600, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=80
```

Remaining gap (deferred)

The patch closes ~11 pp of the summ gap but vLLM still wins (−51 %). The remaining gap is structural in vLLM's compilation surface:

```
vLLM nightly compilation_config dump:
pass_config.fuse_allreduce_rms = True
cudagraph_mode = FULL_AND_PIECEWISE
backend = inductor
cudagraph_capture_sizes = [1, 2, 4, 8, ..., 512]
```

In this campaign I verified (separately) that `--enable-torch-compile` on SGLang is Inductor-opaque against Gemma-4's custom Triton norm kernels (`gemma_qkv_rmsnorm`, `gemma_rmsnorm_residual_scalar`, `gemma_dual_rmsnorm_residual_scalar`) — matching the 26b D1 finding. The Inductor build takes ~20 minutes for a NET LOSS on both summ and chat. Closing the rest requires SGLang piecewise-CUDA-graph + Inductor coverage that wraps the custom kernels via `@register_custom_op` so Inductor sees them as opaque — multi-week framework work.

Stack

PR base: `pyc/sota-gemma4-31b-mm-disabled` @ `3a3195b30` (post mm_disabled_models patch).

Files

File Change
`python/sglang/srt/server_args.py` +80 real lines (rest of the 404-line diff is auto-format reflow of unrelated `assert` statements). The real change is one block inside the dense-Gemma-4 `elif` branch immediately after `Keeping default swa_full_tokens_ratio...`.

CI States

Latest PR Test (Base): ❌ Missing run-ci label -- add it to run CI tests.
Latest PR Test (Extra): ❌ Blocked -- run-ci is required first.

…0.88 (PR closes summ tok/s gap to vLLM)

For dense Gemma-4 with FROZEN_KV_MTP (the gemma-4-31B-it H100 TP=2
campaign workload), the default scheduler config left two big perf
wins on the floor:

1. chunked_prefill_size auto-tuned to 8192 on H100, which means each
   8000-token random-input prompt fills the whole prefill batch and
   blocks the decode batch from growing.  Peak #running-req stalls at
   11-12.  Capping at 4096 lets the scheduler pack two partial prefills
   per step, peak running-req climbs to ~23, and summarisation
   throughput lifts +33% (316 -> 421 tok/s).

2. mem_fraction_static auto-tunes to 0.778, leaving ~16 GB per GPU
   unused on 80 GB H100 TP=2.  Bumping the floor to 0.88 grows
   max_total_num_tokens 68k -> 106k (+27%) and brings the SGLang KV
   pool into parity with vLLM nightly (109k tokens, 27.6 GiB KV).

Both overrides:
* fire only inside the dense-Gemma-4 branch of
  _handle_model_specific_adjustments (immediately after the existing
  MoE-only swa_full_tokens_ratio gate).  MoE Gemma-4 has different
  memory characteristics; the MoE-only branch above already retunes
  along the swa-vs-full pool axis.
* respect explicit user overrides via 'only nudge in the right
  direction' predicates: chunked is only lowered when at the auto-tune
  ceiling of 8192 (preserves user-passed 2048/4096); mem_fraction is
  only raised when below 0.88 (preserves user-passed 0.92).
* log the before/after values for debugging.

Measured on google/gemma-4-31B-it, H100 TP=2, triton attention,
FROZEN_KV_MTP (3 spec steps, 4 draft tokens, eagle topk 1), num_prompts=80,
warmup 2, seed 1:

  Scenario       | Baseline | This PR  | vLLM nightly  | Gap closure
  ---------------|---------:|---------:|--------------:|-------------
  summ tok/s     |  316     | **425**  |  868          | -62% -> -51%
  summ med TTFT  |  78,567  |  80,637  |  39,706       | unchanged
  summ med TPOT  |   29.0   |   25.3   |   30.8        | SGLang wins
  chat tok/s     | 1483     | **1513** | 2972          | -50% -> -49%
  chat med TTFT  |  2785    |  2848    |  3081         | SGLang wins
  chat med TPOT  |   29.3   |   33.6   |   14.2        | regression (within MTP path)

MMLU N=500 (seed 0, temp 0): 0.780 vs vLLM 0.778, tied (identical to
the pre-patch SGLang result).

Note on remaining gap: the structural sources are vLLM's
'fuse_allreduce_rms' compile pass + 'cudagraph_mode=FULL_AND_PIECEWISE'
+ Inductor decode coverage.  vLLM nightly compilation_config dump:
  pass_config.fuse_allreduce_rms = True
  cudagraph_mode = FULL_AND_PIECEWISE
  backend = inductor
  cudagraph_capture_sizes = [1..512]
SGLang's --enable-torch-compile is verified (in this campaign) to be
Inductor-opaque against the Gemma-4 custom Triton norm kernels
(gemma_qkv_rmsnorm / gemma_rmsnorm_residual_scalar / gemma_dual_*),
matching the 26b D1 finding.  Closing the rest requires SGLang-side
piecewise CUDA-graph + Inductor coverage that protects the custom
kernels via @register_custom_op -- multi-week framework work.

Stack base: pyc/sota-gemma4-31b-mm-disabled @ 3a3195b
Co-authored-by: Claude
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant