[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention#4349
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Ascend-specific implementation for Qwen2_5_VisionAttention. The changes move the custom attention logic from subclassing Qwen2_5_VisionAttention to monkey-patching it, and switches from weight padding to activation padding. This simplifies the code in vllm_ascend/models/qwen2_5_vl.py by removing a lot of redundant code.
However, the new implementation in vllm_ascend/patch/worker/patch_qwen2_5_vl.py introduces a critical bug in the forward method due to incorrect state management. Instance attributes are modified in a way that will cause incorrect behavior on subsequent calls. I've provided a detailed comment and a suggested fix for this issue.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
ced4ab1 to
3ed998e
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
6ee38c9 to
599d3e0
Compare
AscendQwen2_5_VisionAttention and remove redundant codeAscendQwen2_5_VisionAttention
AscendQwen2_5_VisionAttention|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
3a9acfc to
3e16d66
Compare
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following #4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following #4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
What this PR does / why we need it?
Qwen2_5_VisionAttentionwithAscendQwen2_5_VisionAttention.AscendQwen2_5_VisionTransformerwithQwen2_5_VisionTransformerin vllm.forward()ofQwen2_5_VisionAttention.cu_seqlensinQwen2_5_VisionAttentionfrom cumulative form to intervals and move it to cpu (compatible for npu FA).set_forward_contextpluggable when getting MM embedding. Find more details at [Platform] Make forward context manager pluggable for other device vllm#29388.Does this PR introduce any user-facing change?
No.
How was this patch tested?
✅ Functional Test - Eager Mode
Run:
Output:
{"id":"chatcmpl-dcec377ffdda48f99252e6c45a418721","object":"chat.completion","created":1763971067,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image appears to be a series of characters that are not standard letters or symbols. It looks like a pattern or a code rather than readable text. The characters seem to be a mix of shapes and lines, possibly representing a specific language or a stylized form of writing, but without additional context, it's difficult to determine its meaning or purpose. If you have any specific questions about the pattern or need help with a particular aspect of the image, please let me know!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":177,"completion_tokens":99,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}✅ Functional Test - Graph Mode
Run:
Output:
{"id":"chatcmpl-dcec377ffdda48f99252e6c45a418721","object":"chat.completion","created":1763971067,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image appears to be a series of characters that are not standard letters or symbols. It looks like a pattern or a code rather than readable text. The characters seem to be a mix of shapes and lines, possibly representing a specific language or a stylized form of writing, but without additional context, it's difficult to determine its meaning or purpose. If you have any specific questions about the pattern or need help with a particular aspect of the image, please let me know!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":177,"completion_tokens":99,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}✅ Benchmark
After replacing the whole modeling file with that in latest vllm, we have got a considerable performance improvement.
Before removing:
After removing: