Skip to content

[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens#40796

Merged
Isotr0py merged 3 commits intovllm-project:mainfrom
hnt2601:fix/gemma4-multimodal-crash
May 2, 2026
Merged

[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens#40796
Isotr0py merged 3 commits intovllm-project:mainfrom
hnt2601:fix/gemma4-multimodal-crash

Conversation

@hnt2601
Copy link
Copy Markdown
Contributor

@hnt2601 hnt2601 commented Apr 24, 2026

Summary

For extreme-aspect-ratio images (e.g. 3x900), the prompt-side Gemma4ProcessingInfo._compute_num_soft_tokens returned more soft tokens than the HF Gemma 4 image processor's vision tower actually emits (which is capped at max_soft_tokens). The mismatch crashed _merge_multimodal_embeddings mid-forward and propagated ValueError out of EngineCore.step() with:

Attempted to assign 280 multimodal tokens to 289 placeholders

This PR fixes the bug by clamping the prompt-side estimator to max_soft_tokens so the placeholder count always matches what the encoder will emit, and adds a small typed-exception layer in the Gemma 4 model so a future engine-layer change can classify this failure mode without parsing free-form ValueError messages.

Root cause

Gemma4ProcessingInfo._compute_num_soft_tokens computes target H/W via max(unit, floor(dim * scale / unit) * unit). For very thin or very tall images the max(unit, …) floor clamps one dimension up to unit, while the other scales freely. After dividing by patch_size**2 * pooling_kernel_size**2, the result can exceed max_soft_tokens.

Concrete repro for image_height=3, image_width=900, max_soft_tokens=280, patch_size=14, pooling_kernel_size=2:

  • scale ≈ 9.02
  • target_h = 28 (floor would give 0; max(unit=28, …) lifts it)
  • target_w = 8092
  • num_patches = (28 // 14) * (8092 // 14) = 1156
  • soft_tokens = 1156 // 4 = 289

The HF Gemma 4 image processor caps its vision-tower output at max_soft_tokens = 280, so the prompt has 289 image placeholders but only 280 embeddings to fill them.

Changes

1. Clamp at the prompt-side estimator (the fix)

Gemma4ProcessingInfo._compute_num_soft_tokens now returns min(num_patches // (pooling_kernel_size**2), max_soft_tokens). Strict tightening of an existing upper bound: any caller that previously got a value ≤ max_soft_tokens is unaffected; any caller that got a value > max_soft_tokens was already producing a hard crash downstream.

2. Typed exception for the count mismatch (defense in depth)

Adds Gemma4MultimodalPlaceholderMismatch(ValueError) and a small _count_multimodal_embedding_rows helper to gemma4_mm.py, and an explicit count check at the top of Gemma4ForConditionalGeneration.embed_input_ids:

expected = int(is_multimodal.sum().item())
actual = _count_multimodal_embedding_rows(multimodal_embeddings)
if actual != expected:
    raise Gemma4MultimodalPlaceholderMismatch(
        actual=actual, expected=expected
    )

The clamp above means this branch should not fire under normal operation. It exists so that any future regression reintroducing a count mismatch raises a typed ValueError subclass with structured actual / expected attributes — instead of the generic ValueError that _merge_multimodal_embeddings raises after a failing inputs_embeds[is_multimodal] = mm_embeds_flat index-put. Subclassing ValueError keeps existing except ValueError handlers working unchanged.

Note: the exception is not yet caught at the engine layer — that hardening is a separate follow-up. Until then, the failure mode for a count mismatch is the same as before, but the offending case is identifiable without string matching on the error message.

Why this is not duplicating an existing PR

Open PRs touching gemma4 / multimodal were checked:

No open PR addresses the placeholder/encoder soft-token count
mismatch.

Test plan

New tests in tests/models/multimodal/processing/test_gemma4.py
run without loading the real Gemma 4 weights (uses MagicMock for
get_hf_config()):

.venv/bin/python -m pytest tests/models/multimodal/processing/test_gemma4.py -v
Test Pins
test_compute_num_soft_tokens_does_not_exceed_max_soft_tokens (4 cases) the clamp — production repro (900,3,280), swapped (3,900,280), video-frame budget (900,3,70), high cap (4000,2,1120)
test_gemma4_multimodal_placeholder_mismatch_is_value_error exception subclasses ValueError (preserves except ValueError call sites)
test_gemma4_multimodal_placeholder_mismatch_carries_counts actual / expected attributes + message contains both counts
test_gemma4_multimodal_placeholder_mismatch_requires_kwargs constructor is keyword-only
test_count_multimodal_embedding_rows (4 cases) helper handles tensor, list, tuple, and empty inputs

All pass locally on Python 3.12 in a uv venv.

Backwards compatibility

  • min(…, max_soft_tokens) is a strict tightening of an existing upper bound; no public API surface changes.
  • Gemma4MultimodalPlaceholderMismatch is a new public name in vllm.model_executor.models.gemma4_mm. Subclasses ValueError, so any caller using except ValueError continues to catch it.
  • _count_multimodal_embedding_rows is module-private (leading underscore).

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added frontend multi-modality Related to multi-modality (#4194) v1 labels Apr 24, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a typed MultimodalPlaceholderMismatch exception to handle client-input errors in multimodal models, preventing engine crashes and allowing the OpenAI serving layer to return HTTP 400 instead of 500. The changes include updates to EngineCore to catch these errors during the step phase, propagation of error metadata through the output pipeline, and comprehensive regression tests. Feedback was provided to ensure type consistency in EngineCoreOutputs by converting a list of request IDs to a set to match the expected schema.

Comment thread vllm/v1/engine/core.py Outdated
Comment thread vllm/model_executor/models/gemma4_mm.py Outdated
Comment thread tests/models/multimodal/processing/test_gemma4.py
Comment thread tests/models/multimodal/processing/test_gemma4.py
hnt2601 added a commit to hnt2601/vllm that referenced this pull request Apr 27, 2026
Address review feedback on PR vllm-project#40796: replace the MagicMock-based
_make_processing_info helper with build_model_context +
MULTIMODAL_REGISTRY.create_processor, mirroring the existing
test_limit_mm_per_prompt pattern. Drop the now-unused
unittest.mock and Gemma4ProcessingInfo imports.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Hoang Nguyen <118159510+hnt2601@users.noreply.github.com>
@hnt2601 hnt2601 requested a review from Isotr0py April 27, 2026 02:56
Comment thread tests/models/multimodal/processing/test_gemma4.py Outdated
@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Apr 28, 2026
@Isotr0py Isotr0py enabled auto-merge (squash) April 28, 2026 08:07
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 28, 2026
auto-merge was automatically disabled May 2, 2026 01:34

Head branch was pushed to by a user without write access

For extreme-aspect-ratio images (e.g. 3x900), the prompt-side
`Gemma4ProcessingInfo._compute_num_soft_tokens` returned more soft
tokens than the HF Gemma 4 image processor's vision tower actually
emits (which is capped at `max_soft_tokens`). The mismatch caused
`_merge_multimodal_embeddings` to fail with
`Attempted to assign 280 multimodal tokens to 289 placeholders`
mid-forward, propagating `ValueError` out of `EngineCore.step()`.

Fix: clamp the return value to `max_soft_tokens` so the prompt-side
placeholder count matches the encoder output for any aspect ratio.

Adds a parametrized unit test on the arithmetic that pins extreme
aspect ratios (including pan-and-scan paths) without loading the
real HF model.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Hoang Nguyen <118159510+hnt2601@users.noreply.github.com>
@hnt2601
Copy link
Copy Markdown
Contributor Author

hnt2601 commented May 2, 2026

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 2, 2026

update

✅ Branch has been successfully updated

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build cpu Related to CPU backends documentation Improvements or additions to documentation frontend kv-connector mistral Related to Mistral models multi-modality Related to multi-modality (#4194) nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tool-calling v1

Projects

Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants