Skip to content

Conversation

@myselvess
Copy link
Contributor

@myselvess myselvess commented Aug 18, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Adding support for the ovis2.5

FIX #23011

Test Plan

Test Result

(Optional) Documentation Update

Updated supported_models.md

Signed-off-by: myselvess <[email protected]>
@myselvess myselvess requested a review from hmellor as a code owner August 18, 2025 06:34
@mergify mergify bot added documentation Improvements or additions to documentation new-model Requests to new models labels Aug 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the ovis2.5 model. The changes include the model implementation, processor, and updates to the model registry and documentation. I've found a few critical issues in the implementation related to tensor manipulation and data processing logic that need to be addressed.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@Magmanat
Copy link

should controlling thinking and thinking budget be in this PR as well?

Example from huggingface:

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM

MODEL_PATH = "AIDC-AI/Ovis2.5-9B"
# Controls whether to enable thinking mode.
enable_thinking = True
# NOTE: The thinking budget mechanism is effective only when
# enable_thinking and enable_thinking_budget are both True.
# Setting enable_thinking=True and enable_thinking_budget=False
# enables thinking without budget. In such case the model might
# spend a lot of tokens in the thinking phase and could be slow.
enable_thinking_budget = True
# max_new_tokens is the upper limit for thinking and non-thinking tokens combined.
# MUST ensure that max_new_tokens > thinking_budget + 25
# when using the thinking budget mechanism.
max_new_tokens = 3072
thinking_budget = 2048

# The implementation of thinking budget involves two-phase generation,
# which is incompatible with the official transformers TextIteratorStreamer.
# MUST use this new class for streaming whether thinking budget is used
# or not. See the commented lines below that involve "streamer" for usage.
from transformers import TextIteratorStreamer
class MyTextIteratorStreamer(TextIteratorStreamer):
    def manual_end(self):
        """Flushes any remaining cache and prints a newline to stdout."""
        # Flush the cache, if it exists
        if len(self.token_cache) > 0:
            text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
            printable_text = text[self.print_len :]
            self.token_cache = []
            self.print_len = 0
        else:
            printable_text = ""

        self.next_tokens_are_prompt = True
        self.on_finalized_text(printable_text, stream_end=True)
    
    def end(self):
        pass

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda()

# streamer = MyTextIteratorStreamer(model.text_tokenizer, skip_prompt=True, skip_special_tokens=True)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)},
        {"type": "text", "text": "Calculate the sum of the numbers in the middle box in figure (c)."},
    ],
}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(
    messages=messages,
    add_generation_prompt=True,
    enable_thinking=enable_thinking
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

outputs = model.generate(
    inputs=input_ids,
    pixel_values=pixel_values,
    grid_thws=grid_thws,
    enable_thinking=enable_thinking,
    enable_thinking_budget=enable_thinking_budget,
    max_new_tokens=max_new_tokens,
    thinking_budget=thinking_budget,
    # streamer=streamer
)

response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing your work on this model! Some initial comments

@JumpingRain
Copy link

@Isotr0py

@myselvess
Copy link
Contributor Author

should controlling thinking and thinking budget be in this PR as well?

Example from huggingface:

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM

MODEL_PATH = "AIDC-AI/Ovis2.5-9B"
# Controls whether to enable thinking mode.
enable_thinking = True
# NOTE: The thinking budget mechanism is effective only when
# enable_thinking and enable_thinking_budget are both True.
# Setting enable_thinking=True and enable_thinking_budget=False
# enables thinking without budget. In such case the model might
# spend a lot of tokens in the thinking phase and could be slow.
enable_thinking_budget = True
# max_new_tokens is the upper limit for thinking and non-thinking tokens combined.
# MUST ensure that max_new_tokens > thinking_budget + 25
# when using the thinking budget mechanism.
max_new_tokens = 3072
thinking_budget = 2048

# The implementation of thinking budget involves two-phase generation,
# which is incompatible with the official transformers TextIteratorStreamer.
# MUST use this new class for streaming whether thinking budget is used
# or not. See the commented lines below that involve "streamer" for usage.
from transformers import TextIteratorStreamer
class MyTextIteratorStreamer(TextIteratorStreamer):
    def manual_end(self):
        """Flushes any remaining cache and prints a newline to stdout."""
        # Flush the cache, if it exists
        if len(self.token_cache) > 0:
            text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
            printable_text = text[self.print_len :]
            self.token_cache = []
            self.print_len = 0
        else:
            printable_text = ""

        self.next_tokens_are_prompt = True
        self.on_finalized_text(printable_text, stream_end=True)
    
    def end(self):
        pass

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda()

# streamer = MyTextIteratorStreamer(model.text_tokenizer, skip_prompt=True, skip_special_tokens=True)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)},
        {"type": "text", "text": "Calculate the sum of the numbers in the middle box in figure (c)."},
    ],
}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(
    messages=messages,
    add_generation_prompt=True,
    enable_thinking=enable_thinking
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

outputs = model.generate(
    inputs=input_ids,
    pixel_values=pixel_values,
    grid_thws=grid_thws,
    enable_thinking=enable_thinking,
    enable_thinking_budget=enable_thinking_budget,
    max_new_tokens=max_new_tokens,
    thinking_budget=thinking_budget,
    # streamer=streamer
)

response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Vllm already supports the enable_thinking parameter, But it seems need to involve modifications to the vllm framework If want to support enable_thinking_budget. This PR will not support enable_thinking_budget.

@myselvess myselvess requested a review from ywang96 as a code owner August 18, 2025 08:46
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Aug 18, 2025
myselvess and others added 2 commits August 18, 2025 16:48
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: myselvess <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: myselvess <[email protected]>
myselvess and others added 3 commits August 18, 2025 18:01
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both generation test and processor test passed locally. LGTM!

Comment on lines +640 to +643
marks=[pytest.mark.skipif(
not is_flash_attn_2_available(),
reason="HF model needs `flash_attn` installed"
)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myselvess The HF implementation's ViT needs flash_attn installed (https://huggingface.co/AIDC-AI/Ovis2.5-2B/blob/main/modeling_ovis2_5.py#L221-L250), which won't be installed on our CI. Do we have plans to support SDPA as a fallback for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have supported SDPA in siglip2navit https://github.com/myselvess/vllm/blob/ovis2_5_new/vllm/model_executor/models/siglip2navit.py#L257-L281. Do you mean that sdpa is also required in hf implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that sdpa is also required in hf implementation?

Yes, because we need to run HF implementation on CI without FA installed for correctness comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that sdpa is also required in hf implementation?

Yes, because we need to run HF implementation on CI without FA installed for correctness comparison.

OK, I'll tell my colleagues to add this part.

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 18, 2025
@Dineshkumar-Anandan-ZS0367

@hmellor @ywang96

Please approve and merge this changes.

@Isotr0py Isotr0py enabled auto-merge (squash) August 19, 2025 06:04
@Magmanat
Copy link

@myselvess,

seems like im unable to disable thinking using vllm for ovis

@myselvess
Copy link
Contributor Author

@myselvess,

seems like im unable to disable thinking using vllm for ovis

How did you configure it? @Magmanat

@Magmanat
Copy link

@myselvess

Sorry about that,
For enable_thinking, i managed to get it to work with the extra_body: chat_template_kwargs.

However because i was trying to emulate the max_thinking_budget behaviour. I went to edit the chat template to allow for assistant prefill so that i can grab the thinking and prefill my own thinking via online vllm inference. I can get the prefill thinking to work on other models like mimo vl rl.

Something like:

{
  "role": "assistant",
  "content": "<think> some thought/n</think>"
}

And i will only get the answer as the result

But for some reason it does not work with ovis2.5 when i approached it the same way. It will still generate its own thinking again.

This is example chat_template i used

{
 *"chat_template"*: "{%- for message in messages %}{{- '<|im_start|>' + message.role + '\n'}}{%- if message.role == 'system' or message.role == 'user' %}{%- if message.content is string %}{{- message.content | replace('<image>', '') | replace('<video>', '') }}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{{- item.text | replace('<image>', '') | replace('<video>', '') }}{%- elif item.type == 'image' %}{{- '<image>'}}{%- elif item.type == 'video' %}{{- '<video>'}}{%- else %}{{- raise_exception('Invalid content type. Supported types for system and user are text, image, video.')}}{%- endif %}{%- if not loop.last %}{{- '\n'}}{%- endif %}{%- endfor %}{%- endif %}{%- elif message.role == 'assistant' %}{%- set content = '' %}{%- if message.content is string %}{%- set content = message.content | replace('<image>', '') | replace('<video>', '') %}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{%- set content = content ~ (item.text | replace('<image>', '') | replace('<video>', '')) %}{%- else %}{{- raise_exception('Invalid content type. Supported type for assistant is text.')}}{%- endif %}{%- endfor %}{%- endif %}{%- if not loop.last %}{%- set content = content.split('</think>')[-1].lstrip('\n') %}{%- endif %}{{- content }}{%- else %}{{- raise_exception('Invalid role. Supported roles are system, user, assistant.')}}{%- endif %}{%- if not (loop.last and message.role == 'assistant') %}{{- '<|im_end|>\n'}}{%- endif %}{%- endfor %}{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}{%- endif %}"
}

Also not so sure whether to keep the discussion in this PR, maybe it is better if its an issue in vllm or in ovis github

djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
@myselvess myselvess deleted the ovis2_5_new branch August 26, 2025 02:56
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Xiao Yu <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
@Chris-Sigopt Chris-Sigopt mentioned this pull request Sep 3, 2025
3 tasks
@yujunhuics
Copy link

Unable to perform VLLM offline inference.

AssertionError: Failed to apply prompt replacement for mm_items['image'][0]
ERROR 09-09 02:47:19 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.

@DarkLight1337
Copy link
Member

Can you provide more details? Which version of vLLM and transformers are you using?

@yujunhuics
Copy link

yujunhuics commented Sep 10, 2025

Can you provide more details? Which version of vLLM and transformers are you using?

thx, I use git clone https://github.com/vllm-project/vllm.git Install the latest version of vLLM and use offline inference example scripts to perform offline inference for ovis2.5.

offline inference demo:

transformers                      4.56.1

The detailed error message is as follows:

armup model) took 87.06 seconds
INFO 09-09 03:55:51 [llm.py:283] Supported_tasks: ['generate']
Adding requests:   0%|                                                       | 0/1 [00:00<?, ?it/s]Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Adding requests:   0%|                                                       | 0/1 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "/xxxxxx/xxx/vllm_offine_infer.py", line 115, in <module>
    run_generate(QUESTION, image_urls)
  File "/xxxxxx/xxx/vllm_offine_infer.py", line 97, in run_generate
    outputs = llm.generate(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 379, in generate
    self._validate_and_add_requests(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1416, in _validate_and_add_requests
    self._add_request(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1434, in _add_request
    self.llm_engine.add_request(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 206, in add_request
    prompt_str, request = self.processor.process_inputs(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/v1/engine/processor.py", line 308, in process_inputs
    processed_inputs: ProcessorInputs = self.input_preprocessor.preprocess(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 881, in preprocess
    return self._process_decoder_only_prompt(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 828, in _process_decoder_only_prompt
    prompt_comps = self._prompt_to_llm_inputs(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 508, in _prompt_to_llm_inputs
    return self._process_text(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 417, in _process_text
    inputs = self._process_multimodal(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 278, in _process_multimodal
    return mm_processor.apply(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1778, in apply
    prompt_ids, prompt, mm_placeholders = self._maybe_apply_prompt_updates(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1730, in _maybe_apply_prompt_updates
    ) = self._apply_prompt_updates(
  File "/xxxxxx/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1652, in _apply_prompt_updates
    assert update_idx is not None, (
AssertionError: Failed to apply prompt replacement for mm_items['image'][0]
ERROR 09-09 03:55:57 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.

It is worth mentioning that VLLM server can reason normally when started.

@DarkLight1337
Copy link
Member

cc @Isotr0py

@lxiaoxiaoxing
Copy link

Why doesn't VLLM support the thinking_budget parameter

@DarkLight1337
Copy link
Member

Why doesn't VLLM support the thinking_budget parameter

Please open a separate issue regarding this, unless it is specific to this model

@lxiaoxiaoxing
Copy link

Why doesn't VLLM support the thinking_budget parameter

Please open a separate issue regarding this, unless it is specific to this model

Yes, thinking_budget is specific to the ovis2.5 model, but this parameter does not take effect in vllm

@DarkLight1337
Copy link
Member

Thinking budget is not supported in general by vLLM, please comment on #17887 if you want support for this

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support for AIDC-AI/Ovis2.5-9B multimodal

9 participants