Skip to content

[Model] Add OpenCUA-7B support#29068

Merged
DarkLight1337 merged 26 commits intovllm-project:mainfrom
lim4349:main
Nov 24, 2025
Merged

[Model] Add OpenCUA-7B support#29068
DarkLight1337 merged 26 commits intovllm-project:mainfrom
lim4349:main

Conversation

@lim4349
Copy link
Copy Markdown
Contributor

@lim4349 lim4349 commented Nov 20, 2025

Purpose

Add support for OpenCUA-7B, a multimodal model based on Qwen2.5-VL that uses 1D-RoPE instead of M-RoPE for the vision encoder. This PR enables vLLM to load and run inference with the OpenCUA-7B model from HuggingFace (xlangai/OpenCUA-7B).

Test Plan

1. Registry Import Test

PASSED - Model class can be imported and registered successfully

The model is properly registered in the vLLM model registry:

  • vllm/model_executor/models/registry.py: Added to _MULTIMODAL_MODELS dictionary
  • tests/models/registry.py: Added test example model entry

2. API Inference Test with System Prompt

PASSED - Model successfully processes multimodal inputs via OpenAI-compatible API

Test Command:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "xlangai/OpenCUA-7B",
    "messages": [
      {
        "role": "system",
        "content": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task."
      },
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://..."}},
          {"type": "text", "text": "Close file explorer"}
        ]
      }
    ],
    "max_tokens": 512,
    "temperature": 0
  }'

Test Result

  • ✅ Model successfully loads and processes requests
  • ✅ System prompt is correctly handled
  • ✅ Image input is properly processed via multimodal pipeline
  • ✅ Response generation works correctly
  • ✅ Output format matches expected structure (e.g., pyautogui.click(x=1428, y=264))
  • ✅ Input/output format is identical to the official HuggingFace Spaces demo (xlangai/OpenCUA-demo)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the new-model Requests to new models label Nov 20, 2025
lim4349 and others added 12 commits November 20, 2025 13:44
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: lim4349 <rockmanzero@naver.com>
- Add OpenCUAForConditionalGeneration model implementation
- Add OpenCUAConfig for HuggingFace compatibility
- Register OpenCUA in model registry
- Add OpenCUA test example model

OpenCUA-7B is a multimodal model based on Qwen2.5-VL but uses
1D-RoPE instead of M-RoPE for the vision encoder.

Signed-off-by: lim4349 <rockmanzero@naver.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the OpenCUA-7B model, which is a multimodal model based on Qwen2.5-VL. The changes include adding the model implementation, configuration, and registering it within the vLLM framework. My review identifies a few high-severity issues related to multimodal token handling and potential non-determinism in embedding generation. Specifically, there are inconsistencies in the image placeholder token used across different parts of the implementation, an incorrect configuration for the video token ID, and a potential for non-deterministic ordering of multimodal embeddings. Addressing these points will improve the correctness and robustness of the model's integration.

@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

https://github.com/vllm-project/vllm/blob/89d55f69ea863901244cba6d09ffc88ec1573281/vllm/transformers_utils/configs/opencua.py#L12-L15
P1 Badge Register OpenCUA config to avoid forced remote code

OpenCUAConfig is introduced here, but _CONFIG_REGISTRY in vllm/transformers_utils/config.py was not updated to map the new model_type ("opencua"). With the default ModelConfig.trust_remote_code=False, the config parser will fall back to AutoConfig.from_pretrained, which rejects unknown configs and raises the usual “requires you to execute the configuration file” error. That means xlangai/OpenCUA-7B cannot be loaded in the new built-in implementation unless users manually enable remote code execution. Please register this config in _CONFIG_REGISTRY so the local class is used and the model loads without trust_remote_code.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lim4349 and others added 3 commits November 20, 2025 13:47
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Zero <rockmanzero@naver.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Zero <rockmanzero@naver.com>
- Register OpenCUAConfig in _CONFIG_REGISTRY to avoid requiring trust_remote_code
- Remove video support (OpenCUA only supports images)
- Fix embed_multimodal to handle only images
- Add get_language_model() method
- Update image token to use <|media_placeholder|> instead of <|image_pad|>
- Fix code duplication in embed_multimodal method

Signed-off-by: lim4349 <rockmanzero@naver.com>
- Call nn.Module.__init__ directly instead of super().__init__
- Initialize language_model with text_config to avoid AttributeError
- Remove duplicate initialization code
- Remove unused head_dim variable

Signed-off-by: lim4349 <rockmanzero@naver.com>
@jeejeelee
Copy link
Copy Markdown
Collaborator

Don't forget to update the model support doc, see: https://docs.vllm.ai/en/latest/models/supported_models/#generative-models_1

- Remove video support from OpenCUA (images only)
- Override get_supported_mm_limits() to return only image modality
- Fix get_hf_config() to use ctx.get_hf_config() without type checking
- Update supported_models.md documentation

Signed-off-by: lim4349 <rockmanzero@naver.com>
@lim4349
Copy link
Copy Markdown
Contributor Author

lim4349 commented Nov 20, 2025

@DarkLight1337
I added it intentionally to follow the Apache-2.0 license, but I can remove it if the project prefers to omit SPDX headers in new files. Let me know what style you'd like.

@lim4349
Copy link
Copy Markdown
Contributor Author

lim4349 commented Nov 20, 2025

@DarkLight1337
You're right that OpenCUA can be loaded via trust_remote_code=True just like other custom models, so the local config isn’t strictly required.

My initial thinking was to avoid relying on remote code inside the engine and keep the integration stable even if the upstream repo changes. I also wanted to try implementing the integration myself to better understand how vLLM handles multimodal processors, and explore whether we could later optimize this path more tightly within vLLM.

Also, since the OpenCUA team mentioned they were working on vLLM support on the HF side but it seemed to be taking quite a while, I thought it would be helpful (and interesting) to prototype the integration directly here.

But if the project prefers relying on trust_remote_code for this type of model (consistent with other integrations), I'm happy to drop the local config and simplify the patch.

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Nov 20, 2025

I prefer relying ontrust_remote_code as it frees us from the burden of having maintain the configs vLLM-side. (Since the config.json file itself still depends on HF repo)

@lim4349
Copy link
Copy Markdown
Contributor Author

lim4349 commented Nov 20, 2025

@DarkLight1337
Given your preference to rely on trust_remote_code, I’ll drop the vLLM-side OpenCUA model implementation and config so we don’t have to maintain a duplicate architecture/config in this repo.

Instead, I’ll keep this PR minimal:

  • add an entry in tests/models/registry.py to validate xlangai/OpenCUA-7B via trust_remote_code=True, and
  • (optionally) keep the documentation row in supported_models.md to show that OpenCUA-7B can be used with vLLM via HF + trust_remote_code.

If you’d rather not list it in the supported models table and only keep the test/example, I can also remove the docs change and leave just the registry entry.

@DarkLight1337
Copy link
Copy Markdown
Member

We only have to remove the config. Everything else should be kept intact.

Remove OpenCUAConfig as it's not needed when using trust_remote_code=True.
The config will be loaded from HuggingFace repository directly.

Signed-off-by: lim4349 <rockmanzero@naver.com>
Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience!

@lim4349
Copy link
Copy Markdown
Contributor Author

lim4349 commented Nov 20, 2025

@DarkLight1337
Got it, thanks for clarifying!

I remove the vLLM-side OpenCUA config and rely on the HF config.json via trust_remote_code, keeping the rest of the integration as-is.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 20, 2025 15:28
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025
@DarkLight1337 DarkLight1337 removed the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025
@DarkLight1337
Copy link
Copy Markdown
Member

The model initialization test is failing on main, let's wait for #29092 to be merged first so we can check this PR

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 23, 2025
Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, let's see if the tests pass now

OpenCUAProcessor does not add placeholders to text in its __call__ method,
so we need to override _hf_processor_applies_updates to return False,
allowing vLLM to handle prompt updates directly via _get_prompt_updates.

This fixes the test error: 'Expected there to be 3 prompt placeholders
corresponding to 3 image items, but instead found 0 prompt placeholders!'

Signed-off-by: lim4349 <rockmanzero@naver.com>
auto-merge was automatically disabled November 24, 2025 00:03

Head branch was pushed to by a user without write access

@lim4349
Copy link
Copy Markdown
Contributor Author

lim4349 commented Nov 24, 2025

@DarkLight1337
All tests are now passing.
PATL when you have a moment.

@DarkLight1337 DarkLight1337 merged commit 3085478 into vllm-project:main Nov 24, 2025
51 checks passed
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: Zero <rockmanzero@naver.com>
Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: Zero <rockmanzero@naver.com>
Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Signed-off-by: lim4349 <rockmanzero@naver.com>
Signed-off-by: Zero <rockmanzero@naver.com>
Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants