[CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic#29873
[CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic#29873vllm-bot merged 15 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
The pull request successfully refactors the apply_rotary_emb functionality into a CustomOp class, unifying dispatch logic and removing redundant definitions across several files. This is a positive step towards better modularity and extensibility. However, there are a few critical issues that need to be addressed to ensure correctness and prevent runtime errors.
ProExpertProg
left a comment
There was a problem hiding this comment.
Why is this necessary - there's already a RotaryEmbedding custom op class?
|
I have also tested this PR together with vllm-project/vllm-ascend#4667 on Ascend NPU. |
|
@ProExpertProg - mind following up on this? |
ProExpertProg
left a comment
There was a problem hiding this comment.
Nice dispatch cleanup, approving so that it's not blocked while I'm gone for the next 2 weeks but please address comments!!
Really thanks for your review. I will address the comments and fix CI errors recently. |
c18f8a6 to
15797ae
Compare
apply_rotary_emb as CustomOp and unify the dispatch logic|
@shen-shanshan Can you check if there are performance regression. When you change to use customops, it will by default run the |
|
@DarkLight1337 @ProExpertProg
|
|
Last time I optimized the ViT rotary embedding before so the best config that we want is
I tested the the new I am getting error on ROCm. In the mrope despite calling its own Let's make sure that the
Related comments https://github.com/vllm-project/vllm/pull/29873/files#r2605818486 |
There was a problem hiding this comment.
It is breaking on ROCm, and I think it also breaks CUDA when we enable custom_ops +apply_rotary_emb
The bug is not captured because we don't have tests for this new custom op, even on CUDA.
@shen-shanshan can you add unit tests? Especially a small model's unit tests.
Because we want to replace some plugin-device ops here, such as https://github.com/vllm-project/vllm-ascend/pull/4667/changes#diff-2d5d0eeaa29da1fddb16a845a92692c9277676ed31cf60604eccbc3bb0ed764aR441-R480. |
Ok, I will add related UT later. |
|
@shen-shanshan Thanks for going through my long reviews. Ping me when you are ready. |
|
Documentation preview: https://vllm--29873.org.readthedocs.build/en/29873/ |
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Does this improvement comes from using enforce custom VIT? or it is using |
I think that it's just because of the random deviation? Let me test again based on the newest commit. |
|
@tjtanaa I tested again with the default parameters and the same command. Not sure what makes the performance better. Now the behavior of forward should be in our expectation. Maybe the performance difference is within the noise of the benchmark. Before: After: |
|
The amd CI has failed seems something unrelated to this PR. |
…patch (#4667) ### What this PR does / why we need it? Following vllm-project/vllm#29873, register `AscendApplyRotaryEmb` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### ✅ Test Qwen2.5-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d ``` #### ✅ Test Qwen3-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: shen-shanshan <467638484@qq.com>
…logic (vllm-project#29873) Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…logic (vllm-project#29873) Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
…patch (vllm-project#4667) ### What this PR does / why we need it? Following vllm-project/vllm#29873, register `AscendApplyRotaryEmb` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### ✅ Test Qwen2.5-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d ``` #### ✅ Test Qwen3-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…patch (vllm-project#4667) ### What this PR does / why we need it? Following vllm-project/vllm#29873, register `AscendApplyRotaryEmb` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### ✅ Test Qwen2.5-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d ``` #### ✅ Test Qwen3-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Purpose
apply_rotary_embfunction by using pre-computed cos/sin cache, like: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen2_5_vl.py#L383-L385. This is just a function, not an operation, and cannot be overwritten by some plugins (e.g., vllm-ascend). By extracting it as an CustomOp, we can just extend this class and implement ourforward_oot()function, after which we register it to easily replace this op with other op of OOT device.apply_rotary_embwhich may make users or developers confused, such as https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L56-L70 and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L73-L97. This PR has unified these separate dispatch logic into one CustomOp to make it clearer.rotate_half()andapply_rotary_emb_torch(). This PR has removed these replicated definitions in different modeling files and just remained one replica in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py.Recent updates:
find_spec("flash_attn")to__init__()ofApplyRotaryEmb.enable_fp32_computeparam toApplyRotaryEmbto determine whether convert input to float and recover it after computation inside the OP.ApplyRotaryEmbinXxxRope. Make sureforward_native()->forward_native(),forward_cuda() / forward_hip()->forward().ApplyRotaryEmbin ViT part to avoid extra configs when lauching the server. Find more details at [CustomOp] Support object-level enable for CustomOp #30547.Test Plan
Test Result
✅ Functional test on Ascend A2 NPU
I have tested this PR on Ascend A2 NPU together with vllm-project/vllm-ascend#4667.
Run:
Output:
{"id":"chatcmpl-9ab4de23690c85aa","object":"chat.completion","created":1764748509,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image reads \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}✅ Functional test on NVIDIA A100 GPU
Run:
Output:
✅ Benchmark on NVIDIA A100 GPU
Thanks for @gcanlin doing this benchmark on NVIDIA A100 GPU and adding UT for testing dispatching logic.
Run:
Before this PR:
After this PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Signed-off-by: shen-shanshan 467638484@qq.com
Co-authored-by: gcanlin canlinguosdu@gmail.com