[Bugfix][Multi Modal] Fix incorrect Molmo token processing by sangho-vision · Pull Request #26873 · vllm-project/vllm

sangho-vision · 2025-10-15T03:04:48Z

Purpose

When serving a Molmo model online using chat completion, the vLLM code first applies Molmo’s chat template to the input text and tokenizes it. It then calls the custom _apply_hf_processor_tokens_only method in MolmoMultiModalProcessor for further processing.
However, _apply_hf_processor_tokens_only internally calls the get_tokens_input of the Molmo's HF processor, which applies the chat template once again, resulting in double templating.

This behavior can be verified by running the following code:

from vllm import LLM
from vllm.sampling_params import SamplingParams
from transformers import AutoProcessor

model = LLM(
    model="allenai/Molmo-7B-D-0924",
    tensor_parallel_size=torch.cuda.device_count(),
    trust_remote_code=True,
    dtype='bfloat16',
    gpu_memory_utilization=0.95,
)

processor = AutoProcessor.from_pretrained(
    "allenai/Molmo-7B-D-0924",
    trust_remote_code=True,
    dtype="auto",
    device_map="auto",
)

sampling_params = SamplingParams(max_tokens=448, temperature=0)

image_url = "https://www.visitscotland.com/binaries/content/gallery/visitscotland/cms-images/2022/06/24/clashnessie-bay-car-road"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe the image."
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url
                }
            },
        ],
    },
]

outputs = model.chat(messages, sampling_params=sampling_params)
prompt = processor.tokenizer.decode(outputs[0].prompt_token_ids, skip_special_tokens=True)
print(prompt)

This prints:

 User: User: Describe the image. Assistant: Assistant:

The double "User:" and "Assistant:" indicate that the chat template is applied twice.

This PR fixes this double templating issue.

Changes Made

Make sure that _apply_hf_processor_tokens_only uses the "none" message format instead of the model's default configuration:

        # The chat template is already applied to the prompt tokens
        # Use message_format="none" to avoid applying it again
        # Prepend an empty space if `always_start_with_space` is True
        tokens = processor.processor.get_tokens_input(  # type: ignore
            self.info.get_tokenizer().decode(prompt_tokens),
            message_format="none",
            always_start_with_space=processor.always_start_with_space,
        )

Test Plan

Run the same code snippet above.

Test Result

 User: Describe the image. Assistant:

The double templating is resolved.

Signed-off-by: sanghol <sanghol@allenai.org>

gemini-code-assist

Code Review

This pull request correctly addresses a bug where the Molmo chat template was being applied twice, leading to incorrect model inputs. The fix is simple and effective. My review includes a suggestion to optimize the token processing logic to avoid an inefficient decode-then-encode cycle, which can improve performance for long prompts.

gemini-code-assist · 2025-10-15T03:06:53Z

vllm/model_executor/models/molmo.py

+        # The chat template is already applied to the prompt tokens
+        # Use message_format="none" to avoid applying it again
+        # Prepend an empty space if `always_start_with_space` is True
        tokens = processor.processor.get_tokens_input(  # type: ignore
            self.info.get_tokenizer().decode(prompt_tokens),
-            message_format=processor.message_format,
+            message_format="none",
            always_start_with_space=processor.always_start_with_space,
        )


While this correctly fixes the double-templating issue, calling get_tokens_input here incurs a decode-then-encode cycle for the entire prompt on every request. This can be inefficient for long prompts.

Since the main purpose of this call (with message_format="none") is to enforce the always_start_with_space logic, we can optimize this by inlining a more efficient version of that logic. The suggestion below avoids re-encoding the prompt if it already starts with a space, which can provide a significant performance improvement.

Suggested change

# The chat template is already applied to the prompt tokens

# Use message_format="none" to avoid applying it again

# Prepend an empty space if `always_start_with_space` is True

tokens = processor.processor.get_tokens_input( # type: ignore

self.info.get_tokenizer().decode(prompt_tokens),

message_format=processor.message_format,

message_format="none",

always_start_with_space=processor.always_start_with_space,

)

tokenizer = self.info.get_tokenizer()

# The chat template is already applied. The logic below is an

# optimized reimplementation of `processor.get_tokens_input`

# with `message_format="none"`. It avoids a decode-encode cycle

# if the prompt already starts with a space, improving performance.

if processor.always_start_with_space:

decoded_prompt = tokenizer.decode(prompt_tokens)

if not decoded_prompt.startswith(" "):

tokens = tokenizer.encode(" " + decoded_prompt,

add_special_tokens=False)

else:

tokens = prompt_tokens

else:

tokens = prompt_tokens

DarkLight1337

Thanks for fixing

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: bbartels <benjamin@bartels.dev>

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org>

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org>

[Bugfix][Multi Modal] Fix incorrect Molmo token processing

c05de85

Signed-off-by: sanghol <sanghol@allenai.org>

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

DarkLight1337 approved these changes Oct 15, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 15, 2025 03:08

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 15, 2025

ywang96 approved these changes Oct 15, 2025

View reviewed changes

DarkLight1337 merged commit 8865da1 into vllm-project:main Oct 15, 2025
56 checks passed

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo token processing (vllm-proj…

6b6c4ec

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org> Signed-off-by: bbartels <benjamin@bartels.dev>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo token processing (vllm-proj…

f1a6db6

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo token processing (vllm-proj…

4e15cf3

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org>

DarkLight1337 mentioned this pull request Oct 26, 2025

[Doc] Remove Molmo warning #27527

Merged

5 tasks

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo token processing (vllm-proj…

cbebffe

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo token processing (vllm-proj…

09fa4b3

…ect#26873) Signed-off-by: sanghol <sanghol@allenai.org>

sangho-vision deleted the fix_molmo_chat branch December 2, 2025 01:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Multi Modal] Fix incorrect Molmo token processing#26873

[Bugfix][Multi Modal] Fix incorrect Molmo token processing#26873
DarkLight1337 merged 1 commit intovllm-project:mainfrom
sangho-vision:fix_molmo_chat

sangho-vision commented Oct 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 15, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        # The chat template is already applied to the prompt tokens
-        # Use message_format="none" to avoid applying it again
-        # Prepend an empty space if `always_start_with_space` is True
-        tokens = processor.processor.get_tokens_input(  # type: ignore
-            self.info.get_tokenizer().decode(prompt_tokens),
-            message_format=processor.message_format,
-            message_format="none",
-            always_start_with_space=processor.always_start_with_space,
-        )
+        tokenizer = self.info.get_tokenizer()
+        # The chat template is already applied. The logic below is an
+        # optimized reimplementation of `processor.get_tokens_input`
+        # with `message_format="none"`. It avoids a decode-encode cycle
+        # if the prompt already starts with a space, improving performance.
+        if processor.always_start_with_space:
+            decoded_prompt = tokenizer.decode(prompt_tokens)
+            if not decoded_prompt.startswith(" "):
+                tokens = tokenizer.encode(" " + decoded_prompt,
+                                          add_special_tokens=False)
+            else:
+                tokens = prompt_tokens
+        else:
+            tokens = prompt_tokens

Uh oh!

Conversation

sangho-vision commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes Made

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sangho-vision commented Oct 15, 2025 •

edited by github-actions bot

Loading