Skip to content

[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing#6108

Merged
wangxiyuan merged 2 commits intovllm-project:releases/v0.13.0from
Angazenn:pa_fix_dev
Jan 22, 2026
Merged

[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing#6108
wangxiyuan merged 2 commits intovllm-project:releases/v0.13.0from
Angazenn:pa_fix_dev

Conversation

@Angazenn
Copy link
Copy Markdown
Collaborator

@Angazenn Angazenn commented Jan 21, 2026

What this PR does / why we need it?

This cherry-picks #5833 .

This helps to fix a bug in for pa get_workspace. In earlier implementation, we use _npu_paged_attention_get_workspace in _update_pa_attn_params. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed SEQ_LEN_WITH_MAX_PA_WORKSPACE to get max workspace.

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the workspace management for paged attention operations during NPU graph capturing. It removes a direct call to _npu_paged_attention_get_workspace in acl_graph.py and introduces a fixed sequence length (SEQ_LEN_WITH_MAX_PA_WORKSPACE) to be used during graph capturing to ensure maximum workspace allocation. While the intention is to streamline workspace handling, there's a potential for correctness issues if the new fixed sequence length doesn't universally cover all edge cases previously handled by the explicit get_workspace call, especially for smaller sequence lengths that might require larger workspaces in specific scenarios.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm_ascend/compilation/acl_graph.py (239-256)

critical

The removal of the explicit _npu_paged_attention_get_workspace call, along with its accompanying comment, raises a critical concern. The original comment explicitly stated that this call was necessary to address "rare bugs for FULL_DECODE_ONLY mode with GQA" where "smaller seq_lens might encounter a bigger workspace" than what max_model_len (used in capturing) provides. While model_runner_v1.py now uses SEQ_LEN_WITH_MAX_PA_WORKSPACE during capturing to get the "max workspace", it's unclear if this single fixed value universally resolves the issue for all smaller seq_lens that might have previously required a larger workspace. If the underlying issue of varying workspace requirements for different seq_lens still exists, removing this explicit call could reintroduce or worsen these rare bugs, leading to runtime errors or incorrect behavior due to insufficient workspace during graph replay.

vllm_ascend/worker/model_runner_v1.py (1942-1947)

critical

The comment here explains that _npu_paged_attention_get_workspace "only returns max workspace with specific seq_lens" and SEQ_LEN_WITH_MAX_PA_WORKSPACE is used for capturing. This directly contradicts the rationale for the removed code in acl_graph.py (lines 239-256 in the LEFT diff), which stated that "smaller seq_lens might encounter a bigger workspace" than what max_model_len (used in capturing) provides, necessitating an additional get_workspace call. This suggests an unresolved or re-introduced correctness issue. If SEQ_LEN_WITH_MAX_PA_WORKSPACE does not cover all edge cases where smaller seq_lens previously required a larger workspace, the system will be prone to runtime errors or performance degradation when replaying graphs with insufficient pre-allocated workspace.

vllm_ascend/compilation/acl_graph.py (249)

high

This change replaces the locally computed workspace with one retrieved from graph_params.workspaces. This is directly linked to the removal of the explicit _npu_paged_attention_get_workspace call in the previous diff. If the graph_params.workspaces (populated during capturing using SEQ_LEN_WITH_MAX_PA_WORKSPACE) does not guarantee sufficient workspace for all seq_lens variations, especially those 'rare cases' where smaller seq_lens needed a larger workspace, this could lead to runtime failures or incorrect results. It's crucial to confirm that SEQ_LEN_WITH_MAX_PA_WORKSPACE is indeed sufficient for all scenarios, or that the underlying bug it addressed has been resolved by other means.

vllm_ascend/worker/model_runner_v1.py (131)

high

The constant SEQ_LEN_WITH_MAX_PA_WORKSPACE = 6144 is a magic number. While the comment in a later diff explains its purpose (to obtain max workspace during graph capturing), the specific value 6144 is not self-explanatory. It's critical to ensure this value is robust and universally yields the maximum required workspace for _npu_paged_attention across all possible seq_lens and configurations. If this value is not precisely the one that triggers the largest workspace allocation, it could lead to insufficient memory during graph replay, causing critical failures. Consider adding a more detailed explanation or a reference to why this specific value is chosen, or if possible, derive it dynamically.

@Angazenn Angazenn changed the title [v0.13.0]converting pa get_workspace back to capturing [v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing Jan 22, 2026
Signed-off-by: Angazenn <supperccell@163.com>
@Angazenn Angazenn added ready read for review ready-for-test start test by label for PR labels Jan 22, 2026
Signed-off-by: Angazenn <supperccell@163.com>
@wangxiyuan wangxiyuan merged commit 0216038 into vllm-project:releases/v0.13.0 Jan 22, 2026
12 checks passed
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…turing (vllm-project#6108)

### What this PR does / why we need it?
This cherry-picks vllm-project#5833 .

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

---------

Signed-off-by: Angazenn <supperccell@163.com>
@Angazenn Angazenn deleted the pa_fix_dev branch February 4, 2026 06:30
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants