[Bugfix][Multi Modal] Fix incorrect Molmo image processing#26563
[Bugfix][Multi Modal] Fix incorrect Molmo image processing#26563vllm-bot merged 1 commit intovllm-project:mainfrom sangho-vision:fix_molmo_image_processing
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request provides a crucial bugfix for Molmo image processing. It correctly restores the patch re-ordering mechanism by re-introducing image_input_idx, which was a regression from a previous change. Additionally, it fixes an issue in get_num_image_tokens to accurately account for special tokens, ensuring correct token count estimation. The changes are well-implemented, logical, and directly address the reported bugs. The code is clear and the fix appears to be complete and correct. I have no further comments.
|
Thanks, can you fix pre-commit? |
|
I fixed the pre-commit. |
Signed-off-by: sanghol <sanghol@allenai.org>
…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: bbartels <benjamin@bartels.dev>
…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>
…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>
…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>
…ect#26563) Signed-off-by: sanghol <sanghol@allenai.org>
Purpose
Restore correct Molmo outputs by re-introducing image patch re-ordering using
image_input_idx.In PR #12966, the re-ordering that maps image features to their corresponding patch tokens was removed and replaced with a boolean mask (
feat_is_patch). That mask only indicates whether each image feature is used as a patch token in the multimodal input sequence, not where each valid patch feature should be placed.In addition, the method
get_num_image_tokensinMolmoProcessingInfodid not account for image start/end tokens as well as column separator tokens.This PR fixes that regression by:
image_input_idxFIX #26518
Note
PR #26451 also addresses the patch re-ordering issue, but it does not include the correction for
get_num_image_tokens(which ensures image start/end and column tokens are properly counted).This PR provides a more complete fix covering both aspects.
Background
image_input_idxbroke the alignment between image features and patch-token positions.feat_is_patchlacks the per-feature index mapping needed to reconstruct the original patch order.Changes made
feat_is_patchfield inMolmoImageInputsTensorSchema withimage_input_idximage_input_idxin the processing pipelineimage_input_idxget_num_image_tokensto include start/end and column separator tokens in the total countTest Plan
Used the same test script from #26518:
Test Result
The coordinates correctly align with the car in the image.
Related