Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions vllm/model_executor/models/molmo.py
Original file line number Diff line number Diff line change
Expand Up @@ -1264,13 +1264,16 @@ def _apply_hf_processor_tokens_only(
) -> list[int]:
processor = self.info.get_hf_processor()

# Apply the chat template to the tokens
# The chat template is already applied to the prompt tokens
# Use message_format="none" to avoid applying it again
# Prepend an empty space if `always_start_with_space` is True
tokens = processor.processor.get_tokens_input( # type: ignore
self.info.get_tokenizer().decode(prompt_tokens),
message_format=processor.message_format,
message_format="none",
always_start_with_space=processor.always_start_with_space,
)
Comment on lines +1267 to 1274
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this correctly fixes the double-templating issue, calling get_tokens_input here incurs a decode-then-encode cycle for the entire prompt on every request. This can be inefficient for long prompts.

Since the main purpose of this call (with message_format="none") is to enforce the always_start_with_space logic, we can optimize this by inlining a more efficient version of that logic. The suggestion below avoids re-encoding the prompt if it already starts with a space, which can provide a significant performance improvement.

Suggested change
# The chat template is already applied to the prompt tokens
# Use message_format="none" to avoid applying it again
# Prepend an empty space if `always_start_with_space` is True
tokens = processor.processor.get_tokens_input( # type: ignore
self.info.get_tokenizer().decode(prompt_tokens),
message_format=processor.message_format,
message_format="none",
always_start_with_space=processor.always_start_with_space,
)
tokenizer = self.info.get_tokenizer()
# The chat template is already applied. The logic below is an
# optimized reimplementation of `processor.get_tokens_input`
# with `message_format="none"`. It avoids a decode-encode cycle
# if the prompt already starts with a space, improving performance.
if processor.always_start_with_space:
decoded_prompt = tokenizer.decode(prompt_tokens)
if not decoded_prompt.startswith(" "):
tokens = tokenizer.encode(" " + decoded_prompt,
add_special_tokens=False)
else:
tokens = prompt_tokens
else:
tokens = prompt_tokens


# Prepend a BOS token id to the tokens
processed_data = self.info.ctx.call_hf_processor(
processor, # type: ignore
dict(tokens=tokens),
Expand Down