-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Model] support new model ovis2.5 #23084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: myselvess <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the ovis2.5 model. The changes include the model implementation, processor, and updates to the model registry and documentation. I've found a few critical issues in the implementation related to tensor manipulation and data processing logic that need to be addressed.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
should controlling thinking and thinking budget be in this PR as well? Example from huggingface: import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM
MODEL_PATH = "AIDC-AI/Ovis2.5-9B"
# Controls whether to enable thinking mode.
enable_thinking = True
# NOTE: The thinking budget mechanism is effective only when
# enable_thinking and enable_thinking_budget are both True.
# Setting enable_thinking=True and enable_thinking_budget=False
# enables thinking without budget. In such case the model might
# spend a lot of tokens in the thinking phase and could be slow.
enable_thinking_budget = True
# max_new_tokens is the upper limit for thinking and non-thinking tokens combined.
# MUST ensure that max_new_tokens > thinking_budget + 25
# when using the thinking budget mechanism.
max_new_tokens = 3072
thinking_budget = 2048
# The implementation of thinking budget involves two-phase generation,
# which is incompatible with the official transformers TextIteratorStreamer.
# MUST use this new class for streaming whether thinking budget is used
# or not. See the commented lines below that involve "streamer" for usage.
from transformers import TextIteratorStreamer
class MyTextIteratorStreamer(TextIteratorStreamer):
def manual_end(self):
"""Flushes any remaining cache and prints a newline to stdout."""
# Flush the cache, if it exists
if len(self.token_cache) > 0:
text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
printable_text = text[self.print_len :]
self.token_cache = []
self.print_len = 0
else:
printable_text = ""
self.next_tokens_are_prompt = True
self.on_finalized_text(printable_text, stream_end=True)
def end(self):
pass
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
trust_remote_code=True
).cuda()
# streamer = MyTextIteratorStreamer(model.text_tokenizer, skip_prompt=True, skip_special_tokens=True)
messages = [{
"role": "user",
"content": [
{"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)},
{"type": "text", "text": "Calculate the sum of the numbers in the middle box in figure (c)."},
],
}]
input_ids, pixel_values, grid_thws = model.preprocess_inputs(
messages=messages,
add_generation_prompt=True,
enable_thinking=enable_thinking
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None
outputs = model.generate(
inputs=input_ids,
pixel_values=pixel_values,
grid_thws=grid_thws,
enable_thinking=enable_thinking,
enable_thinking_budget=enable_thinking_budget,
max_new_tokens=max_new_tokens,
thinking_budget=thinking_budget,
# streamer=streamer
)
response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response) |
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing your work on this model! Some initial comments
Vllm already supports the enable_thinking parameter, But it seems need to involve modifications to the vllm framework If want to support enable_thinking_budget. This PR will not support enable_thinking_budget. |
… Siglip2Attention Signed-off-by: myselvess <[email protected]>
Co-authored-by: Isotr0py <[email protected]> Signed-off-by: myselvess <[email protected]>
Co-authored-by: Isotr0py <[email protected]> Signed-off-by: myselvess <[email protected]>
Co-authored-by: Isotr0py <[email protected]> Signed-off-by: myselvess <[email protected]>
Signed-off-by: myselvess <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both generation test and processor test passed locally. LGTM!
| marks=[pytest.mark.skipif( | ||
| not is_flash_attn_2_available(), | ||
| reason="HF model needs `flash_attn` installed" | ||
| )], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@myselvess The HF implementation's ViT needs flash_attn installed (https://huggingface.co/AIDC-AI/Ovis2.5-2B/blob/main/modeling_ovis2_5.py#L221-L250), which won't be installed on our CI. Do we have plans to support SDPA as a fallback for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have supported SDPA in siglip2navit https://github.com/myselvess/vllm/blob/ovis2_5_new/vllm/model_executor/models/siglip2navit.py#L257-L281. Do you mean that sdpa is also required in hf implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that sdpa is also required in hf implementation?
Yes, because we need to run HF implementation on CI without FA installed for correctness comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that sdpa is also required in hf implementation?
Yes, because we need to run HF implementation on CI without FA installed for correctness comparison.
OK, I'll tell my colleagues to add this part.
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: myselvess <[email protected]>
|
seems like im unable to disable thinking using vllm for ovis |
How did you configure it? @Magmanat |
|
Sorry about that, However because i was trying to emulate the max_thinking_budget behaviour. I went to edit the chat template to allow for assistant prefill so that i can grab the thinking and prefill my own thinking via online vllm inference. I can get the prefill thinking to work on other models like mimo vl rl. Something like: And i will only get the answer as the result But for some reason it does not work with ovis2.5 when i approached it the same way. It will still generate its own thinking again. This is example chat_template i used Also not so sure whether to keep the discussion in this PR, maybe it is better if its an issue in vllm or in ovis github |
Signed-off-by: myselvess <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: myselvess <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: myselvess <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: myselvess <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: myselvess <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: myselvess <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>
|
Unable to perform VLLM offline inference. AssertionError: Failed to apply prompt replacement for mm_items['image'][0] |
|
Can you provide more details? Which version of vLLM and transformers are you using? |
thx, I use git clone https://github.com/vllm-project/vllm.git Install the latest version of vLLM and use offline inference example scripts to perform offline inference for ovis2.5. offline inference demo:
The detailed error message is as follows: It is worth mentioning that VLLM server can reason normally when started. |
|
cc @Isotr0py |
|
Why doesn't VLLM support the thinking_budget parameter |
Please open a separate issue regarding this, unless it is specific to this model |
Yes, thinking_budget is specific to the ovis2.5 model, but this parameter does not take effect in vllm |
|
Thinking budget is not supported in general by vLLM, please comment on #17887 if you want support for this |
Signed-off-by: myselvess <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Adding support for the ovis2.5
FIX #23011
Test Plan
Test Result
(Optional) Documentation Update
Updated supported_models.md