[Gemma4] Enable Fast Prefill Optimization#38879
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for the Gemma 4 model family, encompassing both text-only and multimodal (image, audio, and video) capabilities. Key additions include specialized proportional RoPE, logic to handle heterogeneous head dimensions, and custom parsers for reasoning and tool calls. The review feedback identifies several critical robustness and performance improvements: replacing process-terminating sys.exit(1) calls with ValueError exceptions, optimizing memory by conditionalizing tensor clones in the fast prefill path, ensuring global context consistency using try...finally blocks, and implementing bounds checking for batch sizes during profiling to prevent potential runtime crashes.
Port the --kv-sharing-fast-prefill optimization from Gemma3n to Gemma4. When enabled, cross-decoder layers (KV-shared) skip prefill tokens and only process decode tokens, reducing TTFT by ~36% and improving throughput by up to ~39% under concurrent load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
846935d to
d67a659
Compare
RyanMullins
left a comment
There was a problem hiding this comment.
LGTM. Shared layers don't compute so you can early exit depending on the config.
|
verified on TPU with same set up as vllm-project/tpu-inference#2126 (comment), MMMU-pro score is identical before/after this current PR. Performance metrics untested. |
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit 47e6050)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
Just an FYI this may have created correctness issues in certain situations. In debugging #39392, reverting this appears to have fixed that problem for me. It could be this only impacts certain hardware or setups - I'm using a DGX Spark and the user in 39392 was using 8x RTX 4090. |
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Summary
Add
--kv-sharing-fast-prefillsupport for Gemma 4 models, porting the YOCO (You Only Cache Once) fast prefill optimization from Gemma3n. When enabled, the cross-decoder layers (KV-shared) skip prefill tokens and only process decode tokens, significantly reducing prefill latency and improving throughput under concurrent load.shout-out to @sarckk for the original optimzation (#22628)
Test Plan
GSM8K accuracy (Gemma4-E4B, 5-shot)
Serving benchmark
Test Results
GSM8K accuracy (Gemma4-E4B, 5-shot)
No accuracy regression:
Serving performance (Gemma4-E4B, 1xB200, ISL=8192, OSL=150, n=256)
concurrency=8
concurrency=32