update to vllm 12-19 by leo-pony · Pull Request #5223 · vllm-project/vllm-ascend

leo-pony · 2025-12-22T02:48:12Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

Fix vllm break:

[Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] ([Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement vllm#29558)
Fix Solution: Add the now-necessary all2all_backend parameter. The impact of this parameter on the original set_splitting_ops_for_v1 implementation is only that graph mode is disabled in vllm if deepep_high_throughput is enabled; it has no effect on the vllm-ascend logic.

2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation.

4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code:
#5297

How was this patch tested?

Co-authored-by: zxwang 1476209578@qq.com

vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request updates the versioning policy documentation to include a specific vLLM commit hash for compatibility with the main branch. While this adds precision, the pull request description is empty. More importantly, the corresponding Chinese translation file has not been updated, leading to inconsistent documentation. This should be addressed to ensure all users have correct information.

gemini-code-assist · 2025-12-22T02:49:03Z

 | vLLM Ascend | vLLM         | Python           | Stable CANN | PyTorch/torch_npu  |
 |-------------|--------------|------------------|-------------|--------------------|
-|     main    | v0.13.0 tag | >= 3.10, < 3.12   | 8.3.RC2 | 2.8.0 / 2.8.0 |
+|     main    | 5fbfa8d9ef15948599631baeb91e8220b2ee9bcc, v0.13.0 tag | >= 3.10, < 3.12   | 8.3.RC2 | 2.8.0 / 2.8.0 |


The compatibility matrix has been updated, but the corresponding Chinese translation file (docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po) was not updated. This will cause a discrepancy and provide outdated information to users relying on the Chinese documentation. Please update the translation files to ensure consistency across all supported languages.

github-actions · 2025-12-22T02:58:43Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: leo-pony <nengjunma@outlook.com>

…29558' Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: zxwang <1476209578@qq.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>

### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (vllm-project/vllm#29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: vllm-project#5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

leo-pony marked this pull request as draft December 22, 2025 02:48

gemini-code-assist bot reviewed Dec 22, 2025

View reviewed changes

leo-pony changed the title ~~update to vllm 12-18~~ update to vllm 12-19 Dec 22, 2025

leo-pony changed the title ~~update to vllm 12-19~~ update to vllm 12-18 Dec 22, 2025

github-actions bot added documentation Improvements or additions to documentation ci/build labels Dec 22, 2025

leo-pony added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Dec 22, 2025

leo-pony marked this pull request as ready for review December 23, 2025 03:28

leo-pony changed the title ~~update to vllm 12-18~~ update to vllm 12-19 Dec 23, 2025

leo-pony and others added 8 commits December 23, 2025 11:15

update to vllm 12-18

472b723

Signed-off-by: leo-pony <nengjunma@outlook.com>

git commit -s -m 'fix break by vllm: Enable cuda graph for deepepHT #…

15936c6

…29558' Signed-off-by: leo-pony <nengjunma@outlook.com>

lint format fix

e38d711

Signed-off-by: leo-pony <nengjunma@outlook.com>

fix reshape_qkv_to_3d

96a1f56

Signed-off-by: zxwang <1476209578@qq.com>

fix lint

379e594

Signed-off-by: zxwang <1476209578@qq.com>

add MMEncodeAttention cu_seqlens none handle

1d38d15

Signed-off-by: leo-pony <nengjunma@outlook.com>

fix lint error

5c9009a

Signed-off-by: leo-pony <nengjunma@outlook.com>

Skip hunyhuan vl test case

0c3cbd3

Signed-off-by: leo-pony <nengjunma@outlook.com>

leo-pony force-pushed the update_12_22 branch from 67b68ce to 0c3cbd3 Compare December 23, 2025 11:18

leo-pony added ready read for review ready-for-test start test by label for PR labels Dec 23, 2025

wangxiyuan approved these changes Dec 23, 2025

View reviewed changes

wangxiyuan merged commit 3b59f20 into vllm-project:main Dec 23, 2025
53 of 66 checks passed

leo-pony deleted the update_12_22 branch December 30, 2025 06:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update to vllm 12-19#5223

update to vllm 12-19#5223
wangxiyuan merged 8 commits intovllm-project:mainfrom
leo-pony:update_12_22

leo-pony commented Dec 22, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leo-pony commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leo-pony commented Dec 22, 2025 •

edited

Loading