[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing by Angazenn · Pull Request #6108 · vllm-project/vllm-ascend

Angazenn · 2026-01-21T15:12:16Z

What this PR does / why we need it?

This cherry-picks #5833 .

This helps to fix a bug in for pa get_workspace. In earlier implementation, we use _npu_paged_attention_get_workspace in _update_pa_attn_params. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed SEQ_LEN_WITH_MAX_PA_WORKSPACE to get max workspace.

Does this PR introduce any user-facing change?

How was this patch tested?

gemini-code-assist

Code Review

This pull request refactors the workspace management for paged attention operations during NPU graph capturing. It removes a direct call to _npu_paged_attention_get_workspace in acl_graph.py and introduces a fixed sequence length (SEQ_LEN_WITH_MAX_PA_WORKSPACE) to be used during graph capturing to ensure maximum workspace allocation. While the intention is to streamline workspace handling, there's a potential for correctness issues if the new fixed sequence length doesn't universally cover all edge cases previously handled by the explicit get_workspace call, especially for smaller sequence lengths that might require larger workspaces in specific scenarios.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm_ascend/compilation/acl_graph.py (239-256)

The removal of the explicit _npu_paged_attention_get_workspace call, along with its accompanying comment, raises a critical concern. The original comment explicitly stated that this call was necessary to address "rare bugs for FULL_DECODE_ONLY mode with GQA" where "smaller seq_lens might encounter a bigger workspace" than what max_model_len (used in capturing) provides. While model_runner_v1.py now uses SEQ_LEN_WITH_MAX_PA_WORKSPACE during capturing to get the "max workspace", it's unclear if this single fixed value universally resolves the issue for all smaller seq_lens that might have previously required a larger workspace. If the underlying issue of varying workspace requirements for different seq_lens still exists, removing this explicit call could reintroduce or worsen these rare bugs, leading to runtime errors or incorrect behavior due to insufficient workspace during graph replay.

vllm_ascend/worker/model_runner_v1.py (1942-1947)

The comment here explains that _npu_paged_attention_get_workspace "only returns max workspace with specific seq_lens" and SEQ_LEN_WITH_MAX_PA_WORKSPACE is used for capturing. This directly contradicts the rationale for the removed code in acl_graph.py (lines 239-256 in the LEFT diff), which stated that "smaller seq_lens might encounter a bigger workspace" than what max_model_len (used in capturing) provides, necessitating an additional get_workspace call. This suggests an unresolved or re-introduced correctness issue. If SEQ_LEN_WITH_MAX_PA_WORKSPACE does not cover all edge cases where smaller seq_lens previously required a larger workspace, the system will be prone to runtime errors or performance degradation when replaying graphs with insufficient pre-allocated workspace.

vllm_ascend/compilation/acl_graph.py (249)

This change replaces the locally computed workspace with one retrieved from graph_params.workspaces. This is directly linked to the removal of the explicit _npu_paged_attention_get_workspace call in the previous diff. If the graph_params.workspaces (populated during capturing using SEQ_LEN_WITH_MAX_PA_WORKSPACE) does not guarantee sufficient workspace for all seq_lens variations, especially those 'rare cases' where smaller seq_lens needed a larger workspace, this could lead to runtime failures or incorrect results. It's crucial to confirm that SEQ_LEN_WITH_MAX_PA_WORKSPACE is indeed sufficient for all scenarios, or that the underlying bug it addressed has been resolved by other means.

vllm_ascend/worker/model_runner_v1.py (131)

The constant SEQ_LEN_WITH_MAX_PA_WORKSPACE = 6144 is a magic number. While the comment in a later diff explains its purpose (to obtain max workspace during graph capturing), the specific value 6144 is not self-explanatory. It's critical to ensure this value is robust and universally yields the maximum required workspace for _npu_paged_attention across all possible seq_lens and configurations. If this value is not precisely the one that triggers the largest workspace allocation, it could lead to insufficient memory during graph replay, causing critical failures. Consider adding a more detailed explanation or a reference to why this specific value is chosen, or if possible, derive it dynamically.

Signed-off-by: Angazenn <supperccell@163.com>

…turing (vllm-project#6108) ### What this PR does / why we need it? This cherry-picks vllm-project#5833 . This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. --------- Signed-off-by: Angazenn <supperccell@163.com>

…turing (vllm-project#6108)

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

Angazenn changed the title ~~[v0.13.0]converting pa get_workspace back to capturing~~ [v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing Jan 22, 2026

Angazenn force-pushed the pa_fix_dev branch from f9b1e57 to 816cfb5 Compare January 22, 2026 02:04

pa_bugfix

5d5fafd

Signed-off-by: Angazenn <supperccell@163.com>

Angazenn force-pushed the pa_fix_dev branch from 816cfb5 to 5d5fafd Compare January 22, 2026 02:07

Angazenn added ready read for review ready-for-test start test by label for PR labels Jan 22, 2026

fix lint

0bd5f23

Signed-off-by: Angazenn <supperccell@163.com>

wangxiyuan merged commit 0216038 into vllm-project:releases/v0.13.0 Jan 22, 2026
12 checks passed

Angazenn deleted the pa_fix_dev branch February 4, 2026 06:30

tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026

[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to cap…

d06cebe

…turing (vllm-project#6108)

tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026

[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to cap…

206cf01

…turing (vllm-project#6108)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing#6108

[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing#6108
wangxiyuan merged 2 commits intovllm-project:releases/v0.13.0from
Angazenn:pa_fix_dev

Angazenn commented Jan 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Angazenn commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

vllm_ascend/compilation/acl_graph.py (239-256)

vllm_ascend/worker/model_runner_v1.py (1942-1947)

vllm_ascend/compilation/acl_graph.py (249)

vllm_ascend/worker/model_runner_v1.py (131)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Angazenn commented Jan 21, 2026 •

edited

Loading