perf(gemma4 31b): cap chunked_prefill_size=4096 + bump mem_fraction floor to 0.88 (dense)#17
Draft
pyc96 wants to merge 1 commit into
Draft
Conversation
…0.88 (PR closes summ tok/s gap to vLLM) For dense Gemma-4 with FROZEN_KV_MTP (the gemma-4-31B-it H100 TP=2 campaign workload), the default scheduler config left two big perf wins on the floor: 1. chunked_prefill_size auto-tuned to 8192 on H100, which means each 8000-token random-input prompt fills the whole prefill batch and blocks the decode batch from growing. Peak #running-req stalls at 11-12. Capping at 4096 lets the scheduler pack two partial prefills per step, peak running-req climbs to ~23, and summarisation throughput lifts +33% (316 -> 421 tok/s). 2. mem_fraction_static auto-tunes to 0.778, leaving ~16 GB per GPU unused on 80 GB H100 TP=2. Bumping the floor to 0.88 grows max_total_num_tokens 68k -> 106k (+27%) and brings the SGLang KV pool into parity with vLLM nightly (109k tokens, 27.6 GiB KV). Both overrides: * fire only inside the dense-Gemma-4 branch of _handle_model_specific_adjustments (immediately after the existing MoE-only swa_full_tokens_ratio gate). MoE Gemma-4 has different memory characteristics; the MoE-only branch above already retunes along the swa-vs-full pool axis. * respect explicit user overrides via 'only nudge in the right direction' predicates: chunked is only lowered when at the auto-tune ceiling of 8192 (preserves user-passed 2048/4096); mem_fraction is only raised when below 0.88 (preserves user-passed 0.92). * log the before/after values for debugging. Measured on google/gemma-4-31B-it, H100 TP=2, triton attention, FROZEN_KV_MTP (3 spec steps, 4 draft tokens, eagle topk 1), num_prompts=80, warmup 2, seed 1: Scenario | Baseline | This PR | vLLM nightly | Gap closure ---------------|---------:|---------:|--------------:|------------- summ tok/s | 316 | **425** | 868 | -62% -> -51% summ med TTFT | 78,567 | 80,637 | 39,706 | unchanged summ med TPOT | 29.0 | 25.3 | 30.8 | SGLang wins chat tok/s | 1483 | **1513** | 2972 | -50% -> -49% chat med TTFT | 2785 | 2848 | 3081 | SGLang wins chat med TPOT | 29.3 | 33.6 | 14.2 | regression (within MTP path) MMLU N=500 (seed 0, temp 0): 0.780 vs vLLM 0.778, tied (identical to the pre-patch SGLang result). Note on remaining gap: the structural sources are vLLM's 'fuse_allreduce_rms' compile pass + 'cudagraph_mode=FULL_AND_PIECEWISE' + Inductor decode coverage. vLLM nightly compilation_config dump: pass_config.fuse_allreduce_rms = True cudagraph_mode = FULL_AND_PIECEWISE backend = inductor cudagraph_capture_sizes = [1..512] SGLang's --enable-torch-compile is verified (in this campaign) to be Inductor-opaque against the Gemma-4 custom Triton norm kernels (gemma_qkv_rmsnorm / gemma_rmsnorm_residual_scalar / gemma_dual_*), matching the 26b D1 finding. Closing the rest requires SGLang-side piecewise CUDA-graph + Inductor coverage that protects the custom kernels via @register_custom_op -- multi-week framework work. Stack base: pyc/sota-gemma4-31b-mm-disabled @ 3a3195b Co-authored-by: Claude
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two dense-Gemma-4-only auto-tune nudges in `_handle_model_specific_adjustments` (right after the existing MoE-only `swa_full_tokens_ratio` gate). Closes ~11 pp of the SGLang-vs-vLLM summ throughput gap on `google/gemma-4-31B-it` H100 TP=2.
What the patch does
Both overrides:
Measured impact
`google/gemma-4-31B-it`, H100 TP=2, triton attention, FROZEN_KV_MTP (3/4/1), `--max-running-requests 80`, 80 prompts, warmup 2, seed 1:
MMLU N=500 (seed 0, temp 0): 0.780 vs vLLM 0.778 (tied; identical to pre-patch SGLang 0.780).
Server-log evidence
```
[gemma4 dense override] Capping chunked_prefill_size at 4096 for Gemma4ForConditionalGeneration (was 8192; dense Gemma-4 with FROZEN_KV_MTP admits only one full prefill per scheduler step at 8192, leaving the decode batch starved at high concurrency; capping at 4096 lifts summ throughput +33% on the campaign workload).
[gemma4 dense override] Bumping mem_fraction_static from 0.778 to 0.88 for Gemma4ForConditionalGeneration (dense Gemma-4: the auto-tuned ceiling leaves ~16 GB per GPU unused on H100 TP=2; floor grows the KV pool ~27% to bring parity with vLLM nightly).
[init] max_total_num_tokens=106600, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=80
```
Remaining gap (deferred)
The patch closes ~11 pp of the summ gap but vLLM still wins (−51 %). The remaining gap is structural in vLLM's compilation surface:
```
vLLM nightly compilation_config dump:
pass_config.fuse_allreduce_rms = True
cudagraph_mode = FULL_AND_PIECEWISE
backend = inductor
cudagraph_capture_sizes = [1, 2, 4, 8, ..., 512]
```
In this campaign I verified (separately) that `--enable-torch-compile` on SGLang is Inductor-opaque against Gemma-4's custom Triton norm kernels (`gemma_qkv_rmsnorm`, `gemma_rmsnorm_residual_scalar`, `gemma_dual_rmsnorm_residual_scalar`) — matching the 26b D1 finding. The Inductor build takes ~20 minutes for a NET LOSS on both summ and chat. Closing the rest requires SGLang piecewise-CUDA-graph + Inductor coverage that wraps the custom kernels via `@register_custom_op` so Inductor sees them as opaque — multi-week framework work.
Stack
PR base: `pyc/sota-gemma4-31b-mm-disabled` @ `3a3195b30` (post mm_disabled_models patch).
Files
CI States
Latest PR Test (Base): ❌ Missing
run-cilabel -- add it to run CI tests.Latest PR Test (Extra): ❌ Blocked --
run-ciis required first.