[BugFix]converting pa get_workspace back to capturing by Angazenn · Pull Request #5833 · vllm-project/vllm-ascend

Angazenn · 2026-01-13T03:11:20Z

What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier implementation, we use _npu_paged_attention_get_workspace in _update_pa_attn_params. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed SEQ_LEN_WITH_MAX_PA_WORKSPACE to get max workspace.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@2f4e654

github-actions · 2026-01-13T03:11:34Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request refactors the workspace management for paged attention in ACL graphs. It removes a runtime workspace calculation that was implemented as a workaround for a bug in _npu_paged_attention. The new approach relies on using a pre-captured workspace from graph_params.

While this simplifies the code, I have a critical concern about removing the workaround. The original code comments indicated that this safeguard was necessary because smaller sequence lengths could unexpectedly require a larger workspace. My review comment details this concern, highlighting the risk of re-introducing runtime errors if the underlying issue has not been resolved.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm_ascend/compilation/acl_graph.py (240-257)

This change removes the runtime workspace calculation for _npu_paged_attention. This calculation was originally added as a workaround for a bug where smaller seq_lens could require a larger workspace than what's allocated during graph capture. The new implementation relies on graph_params.workspaces.get(runtime_shape) to provide the workspace.

The removed TODO comment indicated this workaround should only be removed once _npu_paged_attention is replaced by npu_fused_infer_attention_score. Since _npu_paged_attention is still in use, removing this safeguard is risky unless the underlying bug in torch_npu has been fixed, or the graph capture logic has been updated to determine a sufficiently large workspace for all cases.

Without confirmation that the underlying issue is resolved, this change may re-introduce the original bug, potentially leading to runtime errors.

github-actions · 2026-01-16T13:02:26Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Angazenn <supperccell@163.com>

…turing (#6108) ### What this PR does / why we need it? This cherry-picks #5833 . This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. --------- Signed-off-by: Angazenn <supperccell@163.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (51 commits) [Bugfix] Remove `use_aclgraph` in mtp_proposer and use `use_cuda_graph` (vllm-project#6032) [BugFix] fix 3vl dense model load quant weight (vllm-project#6100) [CP&SP] Integrate FIA operator in mla_cp._forward_decode (vllm-project#5641) [CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (vllm-project#6145) [CI]Install clang in dokerfile for triton ascend (vllm-project#4409) [Main] Upgrade PTA to 2.9.0 (vllm-project#6112) [Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (vllm-project#5721) [P/D][PCP]bugfix pcp force free twice caused logger error (vllm-project#6124) [BugFix]converting pa get_workspace back to capturing (vllm-project#5833) [CI] optimize lint term (vllm-project#5986) [Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6042) [bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (vllm-project#6015) [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#6097) [Misc] Bump mooncake version to v0.3.8.post1 (vllm-project#6110) [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (vllm-project#5758) [bugfix] adapt_remote_request_id (vllm-project#6051) [Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (vllm-project#5143) [Feature] Support DSA-CP for Hybrid scenario (vllm-project#5702) [CI] Upgrade CANN to 8.5.0 (vllm-project#6070) Default enable MLAPO (vllm-project#5952) ...

) ### What this PR does / why we need it? This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: Angazenn <supperccell@163.com>

…turing (vllm-project#6108) ### What this PR does / why we need it? This cherry-picks vllm-project#5833 . This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. --------- Signed-off-by: Angazenn <supperccell@163.com>

) ### What this PR does / why we need it? This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: Angazenn <supperccell@163.com>

) ### What this PR does / why we need it? This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

) ### What this PR does / why we need it? This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: Angazenn <supperccell@163.com>

) ### What this PR does / why we need it? This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

) ### What this PR does / why we need it? This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: Angazenn <supperccell@163.com>

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

yiz-liu approved these changes Jan 13, 2026

View reviewed changes

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated

github-actions bot added the merge-conflicts label Jan 16, 2026

Angazenn force-pushed the pa_fix branch from 609b6bc to f08b9f1 Compare January 21, 2026 12:55

github-actions bot removed the merge-conflicts label Jan 21, 2026

Angazenn force-pushed the pa_fix branch 2 times, most recently from df842aa to 54c4e59 Compare January 21, 2026 15:00

Angazenn marked this pull request as ready for review January 21, 2026 15:00

Angazenn requested a review from MengqingCao as a code owner January 21, 2026 15:00

Angazenn added ready read for review ready-for-test start test by label for PR labels Jan 21, 2026

pa_bugfix

9223ca6

Signed-off-by: Angazenn <supperccell@163.com>

Angazenn force-pushed the pa_fix branch from 685fae1 to 9223ca6 Compare January 22, 2026 01:17

Angazenn changed the title ~~[Draft]converting pa get_workspace back to capturing~~ [BugFix]converting pa get_workspace back to capturing Jan 22, 2026

Angazenn mentioned this pull request Jan 22, 2026

[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing #6108

Merged

wangxiyuan merged commit 1d3544c into vllm-project:main Jan 22, 2026
20 checks passed

Angazenn deleted the pa_fix branch February 4, 2026 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix]converting pa get_workspace back to capturing#5833

[BugFix]converting pa get_workspace back to capturing#5833
wangxiyuan merged 1 commit intovllm-project:mainfrom
Angazenn:pa_fix

Angazenn commented Jan 13, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Angazenn commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

vllm_ascend/compilation/acl_graph.py (240-257)

Uh oh!

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Angazenn commented Jan 13, 2026 •

edited

Loading