[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens#40796
Merged
Isotr0py merged 3 commits intovllm-project:mainfrom May 2, 2026
Merged
[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens#40796Isotr0py merged 3 commits intovllm-project:mainfrom
Isotr0py merged 3 commits intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a typed MultimodalPlaceholderMismatch exception to handle client-input errors in multimodal models, preventing engine crashes and allowing the OpenAI serving layer to return HTTP 400 instead of 500. The changes include updates to EngineCore to catch these errors during the step phase, propagation of error metadata through the output pipeline, and comprehensive regression tests. Feedback was provided to ensure type consistency in EngineCoreOutputs by converting a list of request IDs to a set to match the expected schema.
Isotr0py
reviewed
Apr 25, 2026
Isotr0py
reviewed
Apr 25, 2026
hnt2601
added a commit
to hnt2601/vllm
that referenced
this pull request
Apr 27, 2026
Address review feedback on PR vllm-project#40796: replace the MagicMock-based _make_processing_info helper with build_model_context + MULTIMODAL_REGISTRY.create_processor, mirroring the existing test_limit_mm_per_prompt pattern. Drop the now-unused unittest.mock and Gemma4ProcessingInfo imports. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Hoang Nguyen <118159510+hnt2601@users.noreply.github.com>
Isotr0py
reviewed
Apr 28, 2026
Isotr0py
approved these changes
Apr 28, 2026
auto-merge was automatically disabled
May 2, 2026 01:34
Head branch was pushed to by a user without write access
For extreme-aspect-ratio images (e.g. 3x900), the prompt-side `Gemma4ProcessingInfo._compute_num_soft_tokens` returned more soft tokens than the HF Gemma 4 image processor's vision tower actually emits (which is capped at `max_soft_tokens`). The mismatch caused `_merge_multimodal_embeddings` to fail with `Attempted to assign 280 multimodal tokens to 289 placeholders` mid-forward, propagating `ValueError` out of `EngineCore.step()`. Fix: clamp the return value to `max_soft_tokens` so the prompt-side placeholder count matches the encoder output for any aspect ratio. Adds a parametrized unit test on the arithmetic that pins extreme aspect ratios (including pan-and-scan paths) without loading the real HF model. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Hoang Nguyen <118159510+hnt2601@users.noreply.github.com>
Contributor
Author
|
@Mergifyio update |
Contributor
✅ Branch has been successfully updated |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
For extreme-aspect-ratio images (e.g.
3x900), the prompt-sideGemma4ProcessingInfo._compute_num_soft_tokensreturned more soft tokens than the HF Gemma 4 image processor's vision tower actually emits (which is capped atmax_soft_tokens). The mismatch crashed_merge_multimodal_embeddingsmid-forward and propagatedValueErrorout ofEngineCore.step()with:This PR fixes the bug by clamping the prompt-side estimator to
max_soft_tokensso the placeholder count always matches what the encoder will emit, and adds a small typed-exception layer in the Gemma 4 model so a future engine-layer change can classify this failure mode without parsing free-formValueErrormessages.Root cause
Gemma4ProcessingInfo._compute_num_soft_tokenscomputes target H/W viamax(unit, floor(dim * scale / unit) * unit). For very thin or very tall images themax(unit, …)floor clamps one dimension up tounit, while the other scales freely. After dividing bypatch_size**2 * pooling_kernel_size**2, the result can exceedmax_soft_tokens.Concrete repro for
image_height=3, image_width=900, max_soft_tokens=280, patch_size=14, pooling_kernel_size=2:scale ≈ 9.02target_h = 28(floor would give 0;max(unit=28, …)lifts it)target_w = 8092num_patches = (28 // 14) * (8092 // 14) = 1156soft_tokens = 1156 // 4 = 289The HF Gemma 4 image processor caps its vision-tower output at
max_soft_tokens = 280, so the prompt has 289 image placeholders but only 280 embeddings to fill them.Changes
1. Clamp at the prompt-side estimator (the fix)
Gemma4ProcessingInfo._compute_num_soft_tokensnow returnsmin(num_patches // (pooling_kernel_size**2), max_soft_tokens). Strict tightening of an existing upper bound: any caller that previously got a value ≤max_soft_tokensis unaffected; any caller that got a value >max_soft_tokenswas already producing a hard crash downstream.2. Typed exception for the count mismatch (defense in depth)
Adds
Gemma4MultimodalPlaceholderMismatch(ValueError)and a small_count_multimodal_embedding_rowshelper togemma4_mm.py, and an explicit count check at the top ofGemma4ForConditionalGeneration.embed_input_ids:The clamp above means this branch should not fire under normal operation. It exists so that any future regression reintroducing a count mismatch raises a typed
ValueErrorsubclass with structuredactual/expectedattributes — instead of the genericValueErrorthat_merge_multimodal_embeddingsraises after a failinginputs_embeds[is_multimodal] = mm_embeds_flatindex-put. SubclassingValueErrorkeeps existingexcept ValueErrorhandlers working unchanged.Note: the exception is not yet caught at the engine layer — that hardening is a separate follow-up. Until then, the failure mode for a count mismatch is the same as before, but the offending case is identifiable without string matching on the error message.
Why this is not duplicating an existing PR
Open PRs touching
gemma4/multimodalwere checked:max_soft_tokens="auto"to avoid wasting the vision budget on small images #40599 —[MM][Gemma4] Support max_soft_tokens="auto". Addsan auto-mode that selects
max_soft_tokensper image; orthogonalto the prompt-side estimator over-shooting whatever cap is set.
[Bugfix][Gemma4] Fix vision fp16 overflow causing <pad> output. Different bug (numerical overflow), different layer.gemma4: batch multimodal vision processing.Performance work; does not touch the soft-token arithmetic.
Fix gemma4 core: quantized inference + gguf loading + fused moe gelu_tanh. Unrelated to multimodal.backend. Unrelated.
No open PR addresses the placeholder/encoder soft-token count
mismatch.
Test plan
New tests in
tests/models/multimodal/processing/test_gemma4.pyrun without loading the real Gemma 4 weights (uses
MagicMockforget_hf_config()):test_compute_num_soft_tokens_does_not_exceed_max_soft_tokens(4 cases)(900,3,280), swapped(3,900,280), video-frame budget(900,3,70), high cap(4000,2,1120)test_gemma4_multimodal_placeholder_mismatch_is_value_errorValueError(preservesexcept ValueErrorcall sites)test_gemma4_multimodal_placeholder_mismatch_carries_countsactual/expectedattributes + message contains both countstest_gemma4_multimodal_placeholder_mismatch_requires_kwargstest_count_multimodal_embedding_rows(4 cases)All pass locally on Python 3.12 in a
uvvenv.Backwards compatibility
min(…, max_soft_tokens)is a strict tightening of an existing upper bound; no public API surface changes.Gemma4MultimodalPlaceholderMismatchis a new public name invllm.model_executor.models.gemma4_mm. SubclassesValueError, so any caller usingexcept ValueErrorcontinues to catch it._count_multimodal_embedding_rowsis module-private (leading underscore).