[Bugfix][Multi Modal] Fix incorrect Molmo image processing by sangho-vision · Pull Request #26563 · vllm-project/vllm

sangho-vision · 2025-10-10T04:32:29Z

Purpose

Restore correct Molmo outputs by re-introducing image patch re-ordering using image_input_idx.
In PR #12966, the re-ordering that maps image features to their corresponding patch tokens was removed and replaced with a boolean mask (feat_is_patch). That mask only indicates whether each image feature is used as a patch token in the multimodal input sequence, not where each valid patch feature should be placed.

In addition, the method get_num_image_tokens in MolmoProcessingInfo did not account for image start/end tokens as well as column separator tokens.

This PR fixes that regression by:

Restoring correct patch-token placement using image_input_idx
Correcting token-count computation to include start/end and column tokens.

FIX #26518

Note
PR #26451 also addresses the patch re-ordering issue, but it does not include the correction for get_num_image_tokens (which ensures image start/end and column tokens are properly counted).
This PR provides a more complete fix covering both aspects.

Background

Issue [Bug]: Molmo produces incorrect outputs #26451 reported incorrect pointing results in Molmo.
As discussed in [Bugfix][Multi Modal] Fix incorrect output in Molmo #26518, removing image_input_idx broke the alignment between image features and patch-token positions.
The boolean array feat_is_patch lacks the per-feature index mapping needed to reconstruct the original patch order.

Changes made

Replace feat_is_patch field in MolmoImageInputs TensorSchema with image_input_idx
Update multimodal field configuration to include image_input_idx in the processing pipeline
Re-order patches back to their correct spatial positions in the final sequence using image_input_idx
Fix get_num_image_tokens to include start/end and column separator tokens in the total count

Test Plan

Used the same test script from #26518:

from vllm import LLM
from vllm.sampling_params import SamplingParams
import requests
from PIL import Image

model = LLM(
    model="allenai/Molmo-7B-D-0924",
    trust_remote_code=True,
    dtype='bfloat16',
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(max_tokens=448, temperature=0)

image_url = "https://www.visitscotland.com/binaries/content/gallery/visitscotland/cms-images/2022/06/24/clashnessie-bay-car-road"
image = Image.open(requests.get(image_url, stream=True).raw)

inputs = [{
    "prompt": "Point to the car.",
    "multi_modal_data": {"image": image},
}]

outputs = model.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Test Result

 <point x="69.0" y="48.6" alt="car">car</point>

The coordinates correctly align with the car in the image.

Code Review

This pull request provides a crucial bugfix for Molmo image processing. It correctly restores the patch re-ordering mechanism by re-introducing image_input_idx, which was a regression from a previous change. Additionally, it fixes an issue in get_num_image_tokens to accurately account for special tokens, ensuring correct token count estimation. The changes are well-implemented, logical, and directly address the reported bugs. The code is clear and the fix appears to be complete and correct. I have no further comments.

DarkLight1337 · 2025-10-10T06:12:11Z

Thanks, can you fix pre-commit?

vllm/model_executor/models/molmo.py

sangho-vision · 2025-10-10T23:29:01Z

I fixed the pre-commit.

Signed-off-by: sanghol <sanghol@allenai.org>

DarkLight1337

Thanks, LGTM

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: bbartels <benjamin@bartels.dev>

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>

gemini-code-assist bot reviewed Oct 10, 2025

View reviewed changes

sangho-vision mentioned this pull request Oct 10, 2025

[Bug]: Molmo produces incorrect outputs #26451

Closed

1 task

DarkLight1337 reviewed Oct 10, 2025

View reviewed changes

vllm/model_executor/models/molmo.py Outdated Show resolved Hide resolved

vllm/model_executor/models/molmo.py Outdated Show resolved Hide resolved

vllm/model_executor/models/molmo.py Outdated Show resolved Hide resolved

[Bugfix][Multi Modal] Fix incorrect Molmo image processing

d5afbde

Signed-off-by: sanghol <sanghol@allenai.org>

DarkLight1337 approved these changes Oct 11, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 11, 2025 02:44

DarkLight1337 disabled auto-merge October 11, 2025 02:45

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 11, 2025

DarkLight1337 enabled auto-merge (squash) October 11, 2025 03:19

vllm-bot merged commit 55392bc into vllm-project:main Oct 11, 2025
54 of 56 checks passed

DarkLight1337 mentioned this pull request Oct 11, 2025

[Bugfix][Multi Modal] Fix incorrect output in Molmo #26518

Closed

5 tasks

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo image processing (vllm-proj…

38ab3e8

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

sangho-vision deleted the fix_molmo_image_processing branch October 15, 2025 01:38

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo image processing (vllm-proj…

e316969

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: bbartels <benjamin@bartels.dev>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo image processing (vllm-proj…

8246ed2

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo image processing (vllm-proj…

280c574

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>

DarkLight1337 mentioned this pull request Oct 26, 2025

[Doc] Remove Molmo warning #27527

Merged

5 tasks

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo image processing (vllm-proj…

bac2c4f

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo image processing (vllm-proj…

dc0c4c7

…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Multi Modal] Fix incorrect Molmo image processing#26563

[Bugfix][Multi Modal] Fix incorrect Molmo image processing#26563
vllm-bot merged 1 commit intovllm-project:mainfrom
sangho-vision:fix_molmo_image_processing

sangho-vision commented Oct 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

DarkLight1337 commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sangho-vision commented Oct 10, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sangho-vision commented Oct 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background

Changes made

Test Plan

Test Result

Related

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

DarkLight1337 commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sangho-vision commented Oct 10, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sangho-vision commented Oct 10, 2025 •

edited by github-actions bot

Loading