[Model] Add OpenCUA-7B support by lim4349 · Pull Request #29068 · vllm-project/vllm

lim4349 · 2025-11-20T04:40:21Z

Purpose

Add support for OpenCUA-7B, a multimodal model based on Qwen2.5-VL that uses 1D-RoPE instead of M-RoPE for the vision encoder. This PR enables vLLM to load and run inference with the OpenCUA-7B model from HuggingFace (xlangai/OpenCUA-7B).

Test Plan

1. Registry Import Test

✅ PASSED - Model class can be imported and registered successfully

The model is properly registered in the vLLM model registry:

vllm/model_executor/models/registry.py: Added to _MULTIMODAL_MODELS dictionary
tests/models/registry.py: Added test example model entry

2. API Inference Test with System Prompt

✅ PASSED - Model successfully processes multimodal inputs via OpenAI-compatible API

Test Command:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "xlangai/OpenCUA-7B",
    "messages": [
      {
        "role": "system",
        "content": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task."
      },
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://..."}},
          {"type": "text", "text": "Close file explorer"}
        ]
      }
    ],
    "max_tokens": 512,
    "temperature": 0
  }'

Test Result

✅ Model successfully loads and processes requests
✅ System prompt is correctly handled
✅ Image input is properly processed via multimodal pipeline
✅ Response generation works correctly
✅ Output format matches expected structure (e.g., pyautogui.click(x=1428, y=264))
✅ Input/output format is identical to the official HuggingFace Spaces demo (xlangai/OpenCUA-demo)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-11-20T04:40:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Signed-off-by: lim4349 <rockmanzero@naver.com>

- Add OpenCUAForConditionalGeneration model implementation - Add OpenCUAConfig for HuggingFace compatibility - Register OpenCUA in model registry - Add OpenCUA test example model OpenCUA-7B is a multimodal model based on Qwen2.5-VL but uses 1D-RoPE instead of M-RoPE for the vision encoder. Signed-off-by: lim4349 <rockmanzero@naver.com>

gemini-code-assist

Code Review

This pull request adds support for the OpenCUA-7B model, which is a multimodal model based on Qwen2.5-VL. The changes include adding the model implementation, configuration, and registering it within the vLLM framework. My review identifies a few high-severity issues related to multimodal token handling and potential non-determinism in embedding generation. Specifically, there are inconsistencies in the image placeholder token used across different parts of the implementation, an incorrect configuration for the video token ID, and a potential for non-deterministic ordering of multimodal embeddings. Addressing these points will improve the correctness and robustness of the model's integration.

vllm/model_executor/models/opencua.py

vllm/transformers_utils/configs/opencua.py

chatgpt-codex-connector · 2025-11-20T04:45:53Z

💡 Codex Review

https://github.com/vllm-project/vllm/blob/89d55f69ea863901244cba6d09ffc88ec1573281/vllm/transformers_utils/configs/opencua.py#L12-L15
Register OpenCUA config to avoid forced remote code

OpenCUAConfig is introduced here, but _CONFIG_REGISTRY in vllm/transformers_utils/config.py was not updated to map the new model_type ("opencua"). With the default ModelConfig.trust_remote_code=False, the config parser will fall back to AutoConfig.from_pretrained, which rejects unknown configs and raises the usual “requires you to execute the configuration file” error. That means xlangai/OpenCUA-7B cannot be loaded in the new built-in implementation unless users manually enable remote code execution. Please register this config in _CONFIG_REGISTRY so the local class is used and the model loads without trust_remote_code.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Zero <rockmanzero@naver.com>

- Register OpenCUAConfig in _CONFIG_REGISTRY to avoid requiring trust_remote_code - Remove video support (OpenCUA only supports images) - Fix embed_multimodal to handle only images - Add get_language_model() method - Update image token to use <|media_placeholder|> instead of <|image_pad|> - Fix code duplication in embed_multimodal method Signed-off-by: lim4349 <rockmanzero@naver.com>

vllm/model_executor/models/opencua.py

vllm/transformers_utils/configs/opencua.py

- Call nn.Module.__init__ directly instead of super().__init__ - Initialize language_model with text_config to avoid AttributeError - Remove duplicate initialization code - Remove unused head_dim variable Signed-off-by: lim4349 <rockmanzero@naver.com>

vllm/model_executor/models/opencua.py

jeejeelee · 2025-11-20T06:45:47Z

Don't forget to update the model support doc, see: https://docs.vllm.ai/en/latest/models/supported_models/#generative-models_1

- Remove video support from OpenCUA (images only) - Override get_supported_mm_limits() to return only image modality - Fix get_hf_config() to use ctx.get_hf_config() without type checking - Update supported_models.md documentation Signed-off-by: lim4349 <rockmanzero@naver.com>

vllm/transformers_utils/configs/opencua.py

lim4349 · 2025-11-20T14:43:36Z

@DarkLight1337
I added it intentionally to follow the Apache-2.0 license, but I can remove it if the project prefers to omit SPDX headers in new files. Let me know what style you'd like.

lim4349 · 2025-11-20T15:00:16Z

@DarkLight1337
You're right that OpenCUA can be loaded via trust_remote_code=True just like other custom models, so the local config isn’t strictly required.

My initial thinking was to avoid relying on remote code inside the engine and keep the integration stable even if the upstream repo changes. I also wanted to try implementing the integration myself to better understand how vLLM handles multimodal processors, and explore whether we could later optimize this path more tightly within vLLM.

Also, since the OpenCUA team mentioned they were working on vLLM support on the HF side but it seemed to be taking quite a while, I thought it would be helpful (and interesting) to prototype the integration directly here.

But if the project prefers relying on trust_remote_code for this type of model (consistent with other integrations), I'm happy to drop the local config and simplify the patch.

DarkLight1337 · 2025-11-20T15:04:40Z

I prefer relying ontrust_remote_code as it frees us from the burden of having maintain the configs vLLM-side. (Since the config.json file itself still depends on HF repo)

lim4349 · 2025-11-20T15:20:14Z

@DarkLight1337
Given your preference to rely on trust_remote_code, I’ll drop the vLLM-side OpenCUA model implementation and config so we don’t have to maintain a duplicate architecture/config in this repo.

Instead, I’ll keep this PR minimal:

add an entry in tests/models/registry.py to validate xlangai/OpenCUA-7B via trust_remote_code=True, and
(optionally) keep the documentation row in supported_models.md to show that OpenCUA-7B can be used with vLLM via HF + trust_remote_code.

If you’d rather not list it in the supported models table and only keep the test/example, I can also remove the docs change and leave just the registry entry.

DarkLight1337 · 2025-11-20T15:23:13Z

We only have to remove the config. Everything else should be kept intact.

Remove OpenCUAConfig as it's not needed when using trust_remote_code=True. The config will be loaded from HuggingFace repository directly. Signed-off-by: lim4349 <rockmanzero@naver.com>

DarkLight1337

Thanks for your patience!

lim4349 · 2025-11-20T15:28:07Z

@DarkLight1337
Got it, thanks for clarifying!

I remove the vLLM-side OpenCUA config and rely on the HF config.json via trust_remote_code, keeping the rest of the integration as-is.

DarkLight1337 · 2025-11-20T15:30:14Z

The model initialization test is failing on main, let's wait for #29092 to be merged first so we can check this PR

DarkLight1337

Sorry for the delay, let's see if the tests pass now

OpenCUAProcessor does not add placeholders to text in its __call__ method, so we need to override _hf_processor_applies_updates to return False, allowing vLLM to handle prompt updates directly via _get_prompt_updates. This fixes the test error: 'Expected there to be 3 prompt placeholders corresponding to 3 image items, but instead found 0 prompt placeholders!' Signed-off-by: lim4349 <rockmanzero@naver.com>

lim4349 · 2025-11-24T02:15:08Z

@DarkLight1337
All tests are now passing.
PATL when you have a moment.

Signed-off-by: lim4349 <rockmanzero@naver.com> Signed-off-by: Zero <rockmanzero@naver.com> Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

Signed-off-by: lim4349 <rockmanzero@naver.com> Signed-off-by: Zero <rockmanzero@naver.com> Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

lim4349 requested review from DarkLight1337 and ywang96 as code owners November 20, 2025 04:40

mergify bot added the new-model Requests to new models label Nov 20, 2025

lim4349 force-pushed the main branch from 89d55f6 to 0a30b0f Compare November 20, 2025 04:43

lim4349 and others added 12 commits November 20, 2025 13:44

init

84e6134

Signed-off-by: lim4349 <rockmanzero@naver.com>

first

e065b77

Signed-off-by: lim4349 <rockmanzero@naver.com>

second

d71c66d

Signed-off-by: lim4349 <rockmanzero@naver.com>

third

ef0daf8

Signed-off-by: lim4349 <rockmanzero@naver.com>

force

ec7faf2

Signed-off-by: lim4349 <rockmanzero@naver.com>

May

9bb9c83

Signed-off-by: lim4349 <rockmanzero@naver.com>

May

9f37dc5

Signed-off-by: lim4349 <rockmanzero@naver.com>

remove annotations

6c55be2

Signed-off-by: lim4349 <rockmanzero@naver.com>

changes discarded gpu_runner

14453ec

Signed-off-by: lim4349 <rockmanzero@naver.com>

last

c5688f7

Signed-off-by: lim4349 <rockmanzero@naver.com>

feat: add registry

9d9b7d7

Signed-off-by: lim4349 <rockmanzero@naver.com>

lim4349 force-pushed the main branch from 0a30b0f to 9207e8c Compare November 20, 2025 04:44

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

lim4349 and others added 3 commits November 20, 2025 13:47

Update vllm/model_executor/models/opencua.py

2a38534

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Zero <rockmanzero@naver.com>

Update vllm/model_executor/models/opencua.py

f5ee87b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Zero <rockmanzero@naver.com>

lim4349 force-pushed the main branch from a2d9a55 to 423a670 Compare November 20, 2025 05:12

Merge branch 'main' into main

cbb4294

Isotr0py reviewed Nov 20, 2025

View reviewed changes

vllm/model_executor/models/opencua.py Outdated Show resolved Hide resolved

vllm/model_executor/models/opencua.py Outdated Show resolved Hide resolved

vllm/transformers_utils/configs/opencua.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Nov 20, 2025

View reviewed changes

vllm/model_executor/models/opencua.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Nov 20, 2025

View reviewed changes

vllm/transformers_utils/configs/opencua.py Outdated Show resolved Hide resolved

refactor: Remove OpenCUAConfig

1565dab

Remove OpenCUAConfig as it's not needed when using trust_remote_code=True. The config will be loaded from HuggingFace repository directly. Signed-off-by: lim4349 <rockmanzero@naver.com>

DarkLight1337 approved these changes Nov 20, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) November 20, 2025 15:28

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025

Merge branch 'main' into main

6f9dca1

DarkLight1337 removed the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025

Merge branch 'main' into main

9d71d6d

lim4349 requested a review from DarkLight1337 November 23, 2025 11:00

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 23, 2025

DarkLight1337 approved these changes Nov 23, 2025

View reviewed changes

auto-merge was automatically disabled November 24, 2025 00:03
Head branch was pushed to by a user without write access

DarkLight1337 merged commit 3085478 into vllm-project:main Nov 24, 2025
51 checks passed

lim4349 mentioned this pull request Nov 24, 2025

vLLM support xlang-ai/OpenCUA#46

Closed

XinyuanWangCS mentioned this pull request Nov 28, 2025

关于VLLM的适配 xlang-ai/OpenCUA#22

Open

Uh oh!

Conversation

lim4349 commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

1. Registry Import Test

2. API Inference Test with System Prompt

Test Result

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Nov 20, 2025

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeejeelee commented Nov 20, 2025

Uh oh!

Uh oh!

lim4349 commented Nov 20, 2025

Uh oh!

lim4349 commented Nov 20, 2025

Uh oh!

DarkLight1337 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lim4349 commented Nov 20, 2025

Uh oh!

DarkLight1337 commented Nov 20, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

lim4349 commented Nov 20, 2025

Uh oh!

DarkLight1337 commented Nov 20, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

lim4349 commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lim4349 commented Nov 20, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Nov 20, 2025 •

edited

Loading