[Model] Add OpenCUA-7B support#29068
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
- Add OpenCUAForConditionalGeneration model implementation - Add OpenCUAConfig for HuggingFace compatibility - Register OpenCUA in model registry - Add OpenCUA test example model OpenCUA-7B is a multimodal model based on Qwen2.5-VL but uses 1D-RoPE instead of M-RoPE for the vision encoder. Signed-off-by: lim4349 <rockmanzero@naver.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for the OpenCUA-7B model, which is a multimodal model based on Qwen2.5-VL. The changes include adding the model implementation, configuration, and registering it within the vLLM framework. My review identifies a few high-severity issues related to multimodal token handling and potential non-determinism in embedding generation. Specifically, there are inconsistencies in the image placeholder token used across different parts of the implementation, an incorrect configuration for the video token ID, and a potential for non-deterministic ordering of multimodal embeddings. Addressing these points will improve the correctness and robustness of the model's integration.
💡 Codex Reviewhttps://github.com/vllm-project/vllm/blob/89d55f69ea863901244cba6d09ffc88ec1573281/vllm/transformers_utils/configs/opencua.py#L12-L15 OpenCUAConfig is introduced here, but ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Zero <rockmanzero@naver.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Zero <rockmanzero@naver.com>
- Register OpenCUAConfig in _CONFIG_REGISTRY to avoid requiring trust_remote_code - Remove video support (OpenCUA only supports images) - Fix embed_multimodal to handle only images - Add get_language_model() method - Update image token to use <|media_placeholder|> instead of <|image_pad|> - Fix code duplication in embed_multimodal method Signed-off-by: lim4349 <rockmanzero@naver.com>
- Call nn.Module.__init__ directly instead of super().__init__ - Initialize language_model with text_config to avoid AttributeError - Remove duplicate initialization code - Remove unused head_dim variable Signed-off-by: lim4349 <rockmanzero@naver.com>
|
Don't forget to update the model support doc, see: https://docs.vllm.ai/en/latest/models/supported_models/#generative-models_1 |
- Remove video support from OpenCUA (images only) - Override get_supported_mm_limits() to return only image modality - Fix get_hf_config() to use ctx.get_hf_config() without type checking - Update supported_models.md documentation Signed-off-by: lim4349 <rockmanzero@naver.com>
|
@DarkLight1337 |
|
@DarkLight1337 My initial thinking was to avoid relying on remote code inside the engine and keep the integration stable even if the upstream repo changes. I also wanted to try implementing the integration myself to better understand how vLLM handles multimodal processors, and explore whether we could later optimize this path more tightly within vLLM. Also, since the OpenCUA team mentioned they were working on vLLM support on the HF side but it seemed to be taking quite a while, I thought it would be helpful (and interesting) to prototype the integration directly here. But if the project prefers relying on |
|
I prefer relying on |
|
@DarkLight1337 Instead, I’ll keep this PR minimal:
If you’d rather not list it in the supported models table and only keep the test/example, I can also remove the docs change and leave just the registry entry. |
|
We only have to remove the config. Everything else should be kept intact. |
Remove OpenCUAConfig as it's not needed when using trust_remote_code=True. The config will be loaded from HuggingFace repository directly. Signed-off-by: lim4349 <rockmanzero@naver.com>
DarkLight1337
left a comment
There was a problem hiding this comment.
Thanks for your patience!
|
@DarkLight1337 I remove the vLLM-side OpenCUA config and rely on the HF |
|
The model initialization test is failing on main, let's wait for #29092 to be merged first so we can check this PR |
DarkLight1337
left a comment
There was a problem hiding this comment.
Sorry for the delay, let's see if the tests pass now
OpenCUAProcessor does not add placeholders to text in its __call__ method, so we need to override _hf_processor_applies_updates to return False, allowing vLLM to handle prompt updates directly via _get_prompt_updates. This fixes the test error: 'Expected there to be 3 prompt placeholders corresponding to 3 image items, but instead found 0 prompt placeholders!' Signed-off-by: lim4349 <rockmanzero@naver.com>
Head branch was pushed to by a user without write access
|
@DarkLight1337 |
Signed-off-by: lim4349 <rockmanzero@naver.com> Signed-off-by: Zero <rockmanzero@naver.com> Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
Signed-off-by: lim4349 <rockmanzero@naver.com> Signed-off-by: Zero <rockmanzero@naver.com> Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: lim4349 <rockmanzero@naver.com> Signed-off-by: Zero <rockmanzero@naver.com> Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
Add support for OpenCUA-7B, a multimodal model based on Qwen2.5-VL that uses 1D-RoPE instead of M-RoPE for the vision encoder. This PR enables vLLM to load and run inference with the OpenCUA-7B model from HuggingFace (
xlangai/OpenCUA-7B).Test Plan
1. Registry Import Test
✅ PASSED - Model class can be imported and registered successfully
The model is properly registered in the vLLM model registry:
vllm/model_executor/models/registry.py: Added to_MULTIMODAL_MODELSdictionarytests/models/registry.py: Added test example model entry2. API Inference Test with System Prompt
✅ PASSED - Model successfully processes multimodal inputs via OpenAI-compatible API
Test Command:
Test Result
pyautogui.click(x=1428, y=264))Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.