[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens by hnt2601 · Pull Request #40796 · vllm-project/vllm

hnt2601 · 2026-04-24T10:00:53Z

Summary

For extreme-aspect-ratio images (e.g. 3x900), the prompt-side Gemma4ProcessingInfo._compute_num_soft_tokens returned more soft tokens than the HF Gemma 4 image processor's vision tower actually emits (which is capped at max_soft_tokens). The mismatch crashed _merge_multimodal_embeddings mid-forward and propagated ValueError out of EngineCore.step() with:

Attempted to assign 280 multimodal tokens to 289 placeholders

This PR fixes the bug by clamping the prompt-side estimator to max_soft_tokens so the placeholder count always matches what the encoder will emit, and adds a small typed-exception layer in the Gemma 4 model so a future engine-layer change can classify this failure mode without parsing free-form ValueError messages.

Root cause

Gemma4ProcessingInfo._compute_num_soft_tokens computes target H/W via max(unit, floor(dim * scale / unit) * unit). For very thin or very tall images the max(unit, …) floor clamps one dimension up to unit, while the other scales freely. After dividing by patch_size**2 * pooling_kernel_size**2, the result can exceed max_soft_tokens.

Concrete repro for image_height=3, image_width=900, max_soft_tokens=280, patch_size=14, pooling_kernel_size=2:

scale ≈ 9.02
target_h = 28 (floor would give 0; max(unit=28, …) lifts it)
target_w = 8092
num_patches = (28 // 14) * (8092 // 14) = 1156
soft_tokens = 1156 // 4 = 289

The HF Gemma 4 image processor caps its vision-tower output at max_soft_tokens = 280, so the prompt has 289 image placeholders but only 280 embeddings to fill them.

Changes

1. Clamp at the prompt-side estimator (the fix)

Gemma4ProcessingInfo._compute_num_soft_tokens now returns min(num_patches // (pooling_kernel_size**2), max_soft_tokens). Strict tightening of an existing upper bound: any caller that previously got a value ≤ max_soft_tokens is unaffected; any caller that got a value > max_soft_tokens was already producing a hard crash downstream.

2. Typed exception for the count mismatch (defense in depth)

Adds Gemma4MultimodalPlaceholderMismatch(ValueError) and a small _count_multimodal_embedding_rows helper to gemma4_mm.py, and an explicit count check at the top of Gemma4ForConditionalGeneration.embed_input_ids:

expected = int(is_multimodal.sum().item())
actual = _count_multimodal_embedding_rows(multimodal_embeddings)
if actual != expected:
    raise Gemma4MultimodalPlaceholderMismatch(
        actual=actual, expected=expected
    )

The clamp above means this branch should not fire under normal operation. It exists so that any future regression reintroducing a count mismatch raises a typed ValueError subclass with structured actual / expected attributes — instead of the generic ValueError that _merge_multimodal_embeddings raises after a failing inputs_embeds[is_multimodal] = mm_embeds_flat index-put. Subclassing ValueError keeps existing except ValueError handlers working unchanged.

Note: the exception is not yet caught at the engine layer — that hardening is a separate follow-up. Until then, the failure mode for a count mismatch is the same as before, but the offending case is identifiable without string matching on the error message.

Why this is not duplicating an existing PR

Open PRs touching gemma4 / multimodal were checked:

[MM][Gemma4] Support max_soft_tokens="auto" to avoid wasting the vision budget on small images #40599 — [MM][Gemma4] Support max_soft_tokens="auto". Adds
an auto-mode that selects max_soft_tokens per image; orthogonal
to the prompt-side estimator over-shooting whatever cap is set.
[Bugfix][Gemma4] Fix vision fp16 overflow causing <pad> output #40347 — [Bugfix][Gemma4] Fix vision fp16 overflow causing <pad> output. Different bug (numerical overflow), different layer.
gemma4: batch multimodal vision processing #40464 — gemma4: batch multimodal vision processing.
Performance work; does not touch the soft-token arithmetic.
Fix gemma4 core: quantized inference + gguf loading + fused moe gelu_tanh #40281 — Fix gemma4 core: quantized inference + gguf loading + fused moe gelu_tanh. Unrelated to multimodal.
Fix RMSNorm hidden_size validation crash for weightless norms #39073, [Gemma4] Allow per-layer attention backend selection for heterogeneou… #38891 — RMSNorm validation, per-layer attention
backend. Unrelated.

No open PR addresses the placeholder/encoder soft-token count
mismatch.

Test plan

New tests in tests/models/multimodal/processing/test_gemma4.py
run without loading the real Gemma 4 weights (uses MagicMock for
get_hf_config()):

.venv/bin/python -m pytest tests/models/multimodal/processing/test_gemma4.py -v

Test	Pins
`test_compute_num_soft_tokens_does_not_exceed_max_soft_tokens` (4 cases)	the clamp — production repro `(900,3,280)`, swapped `(3,900,280)`, video-frame budget `(900,3,70)`, high cap `(4000,2,1120)`
`test_gemma4_multimodal_placeholder_mismatch_is_value_error`	exception subclasses `ValueError` (preserves `except ValueError` call sites)
`test_gemma4_multimodal_placeholder_mismatch_carries_counts`	`actual` / `expected` attributes + message contains both counts
`test_gemma4_multimodal_placeholder_mismatch_requires_kwargs`	constructor is keyword-only
`test_count_multimodal_embedding_rows` (4 cases)	helper handles tensor, list, tuple, and empty inputs

All pass locally on Python 3.12 in a uv venv.

Backwards compatibility

min(…, max_soft_tokens) is a strict tightening of an existing upper bound; no public API surface changes.
Gemma4MultimodalPlaceholderMismatch is a new public name in vllm.model_executor.models.gemma4_mm. Subclasses ValueError, so any caller using except ValueError continues to catch it.
_count_multimodal_embedding_rows is module-private (leading underscore).

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces a typed MultimodalPlaceholderMismatch exception to handle client-input errors in multimodal models, preventing engine crashes and allowing the OpenAI serving layer to return HTTP 400 instead of 500. The changes include updates to EngineCore to catch these errors during the step phase, propagation of error metadata through the output pipeline, and comprehensive regression tests. Feedback was provided to ensure type consistency in EngineCoreOutputs by converting a list of request IDs to a set to match the expected schema.

Address review feedback on PR vllm-project#40796: replace the MagicMock-based _make_processing_info helper with build_model_context + MULTIMODAL_REGISTRY.create_processor, mirroring the existing test_limit_mm_per_prompt pattern. Drop the now-unused unittest.mock and Gemma4ProcessingInfo imports. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Hoang Nguyen <118159510+hnt2601@users.noreply.github.com>

For extreme-aspect-ratio images (e.g. 3x900), the prompt-side `Gemma4ProcessingInfo._compute_num_soft_tokens` returned more soft tokens than the HF Gemma 4 image processor's vision tower actually emits (which is capped at `max_soft_tokens`). The mismatch caused `_merge_multimodal_embeddings` to fail with `Attempted to assign 280 multimodal tokens to 289 placeholders` mid-forward, propagating `ValueError` out of `EngineCore.step()`. Fix: clamp the return value to `max_soft_tokens` so the prompt-side placeholder count matches the encoder output for any aspect ratio. Adds a parametrized unit test on the arithmetic that pins extreme aspect ratios (including pan-and-scan paths) without loading the real HF model. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Hoang Nguyen <118159510+hnt2601@users.noreply.github.com>

hnt2601 · 2026-05-02T03:22:48Z

@Mergifyio update

mergify · 2026-05-02T03:22:54Z

update

✅ Branch has been successfully updated

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

hnt2601 requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, njhill, russellb, tjtanaa and ywang96 as code owners April 24, 2026 10:00

claude Bot reviewed Apr 24, 2026

View reviewed changes

mergify Bot added frontend multi-modality Related to multi-modality (#4194) v1 labels Apr 24, 2026

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread vllm/v1/engine/core.py Outdated

hnt2601 requested review from ApostaC, LucasWilkinson, MatthewBonanni, ProExpertProg, WoosukKwon, bbrowning, heheda12345, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, sfeng33, tdoublep, tlrmchlsmth, yewentao256 and youkaichao as code owners April 25, 2026 02:23

Isotr0py reviewed Apr 25, 2026

View reviewed changes

Comment thread vllm/model_executor/models/gemma4_mm.py Outdated

Isotr0py reviewed Apr 25, 2026

View reviewed changes

Comment thread tests/models/multimodal/processing/test_gemma4.py

Comment thread tests/models/multimodal/processing/test_gemma4.py

hnt2601 requested a review from Isotr0py April 27, 2026 02:56

Isotr0py reviewed Apr 28, 2026

View reviewed changes

Comment thread tests/models/multimodal/processing/test_gemma4.py Outdated

Isotr0py approved these changes Apr 28, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Apr 28, 2026

Isotr0py enabled auto-merge (squash) April 28, 2026 08:07

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 28, 2026

auto-merge was automatically disabled May 2, 2026 01:34
Head branch was pushed to by a user without write access

hnt2601 requested review from 22quinn, ZJY0516, alexm-redhat, benchislett, bigPYJ1151, gshtras, jeejeelee, jikunshang, luccafong, markmc, noooop, sighingnow, tomeras91, vadiklyutiy and zou3519 as code owners May 2, 2026 01:34

Merge branch 'main' into fix/gemma4-multimodal-crash

9a38af7

remove redundant test

997581f

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens#40796

[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens#40796
Isotr0py merged 3 commits intovllm-project:mainfrom
hnt2601:fix/gemma4-multimodal-crash

hnt2601 commented Apr 24, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hnt2601 commented May 2, 2026

Uh oh!

mergify Bot commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hnt2601 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

1. Clamp at the prompt-side estimator (the fix)

2. Typed exception for the count mismatch (defense in depth)

Why this is not duplicating an existing PR

Test plan

Backwards compatibility

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hnt2601 commented May 2, 2026

Uh oh!

mergify Bot commented May 2, 2026

✅ Branch has been successfully updated

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hnt2601 commented Apr 24, 2026 •

edited

Loading