Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the versioning policy documentation to include a specific vLLM commit hash for compatibility with the main branch. While this adds precision, the pull request description is empty. More importantly, the corresponding Chinese translation file has not been updated, leading to inconsistent documentation. This should be addressed to ensure all users have correct information.
| | vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | | ||
| |-------------|--------------|------------------|-------------|--------------------| | ||
| | main | v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 | | ||
| | main | 5fbfa8d9ef15948599631baeb91e8220b2ee9bcc, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 | |
There was a problem hiding this comment.
The compatibility matrix has been updated, but the corresponding Chinese translation file (docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po) was not updated. This will cause a discrepancy and provide outdated information to users relying on the Chinese documentation. Please update the translation files to ensure consistency across all supported languages.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: leo-pony <nengjunma@outlook.com>
…29558' Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
67b68ce to
0c3cbd3
Compare
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (vllm-project/vllm#29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: vllm-project#5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (vllm-project/vllm#29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: vllm-project#5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
Fix vllm break:
Fix Solution: Add the now-necessary
all2all_backendparameter. The impact of this parameter on the originalset_splitting_ops_for_v1implementation is only that graph mode is disabled invllmifdeepep_high_throughputis enabled; it has no effect on thevllm-ascendlogic.2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation.
4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code:
#5297
How was this patch tested?
Co-authored-by: zxwang 1476209578@qq.com