[v0.13.0][cherry-pick][BugFix]converting pa get_workspace back to capturing#6108
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the workspace management for paged attention operations during NPU graph capturing. It removes a direct call to _npu_paged_attention_get_workspace in acl_graph.py and introduces a fixed sequence length (SEQ_LEN_WITH_MAX_PA_WORKSPACE) to be used during graph capturing to ensure maximum workspace allocation. While the intention is to streamline workspace handling, there's a potential for correctness issues if the new fixed sequence length doesn't universally cover all edge cases previously handled by the explicit get_workspace call, especially for smaller sequence lengths that might require larger workspaces in specific scenarios.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm_ascend/compilation/acl_graph.py (239-256)
The removal of the explicit _npu_paged_attention_get_workspace call, along with its accompanying comment, raises a critical concern. The original comment explicitly stated that this call was necessary to address "rare bugs for FULL_DECODE_ONLY mode with GQA" where "smaller seq_lens might encounter a bigger workspace" than what max_model_len (used in capturing) provides. While model_runner_v1.py now uses SEQ_LEN_WITH_MAX_PA_WORKSPACE during capturing to get the "max workspace", it's unclear if this single fixed value universally resolves the issue for all smaller seq_lens that might have previously required a larger workspace. If the underlying issue of varying workspace requirements for different seq_lens still exists, removing this explicit call could reintroduce or worsen these rare bugs, leading to runtime errors or incorrect behavior due to insufficient workspace during graph replay.
vllm_ascend/worker/model_runner_v1.py (1942-1947)
The comment here explains that _npu_paged_attention_get_workspace "only returns max workspace with specific seq_lens" and SEQ_LEN_WITH_MAX_PA_WORKSPACE is used for capturing. This directly contradicts the rationale for the removed code in acl_graph.py (lines 239-256 in the LEFT diff), which stated that "smaller seq_lens might encounter a bigger workspace" than what max_model_len (used in capturing) provides, necessitating an additional get_workspace call. This suggests an unresolved or re-introduced correctness issue. If SEQ_LEN_WITH_MAX_PA_WORKSPACE does not cover all edge cases where smaller seq_lens previously required a larger workspace, the system will be prone to runtime errors or performance degradation when replaying graphs with insufficient pre-allocated workspace.
vllm_ascend/compilation/acl_graph.py (249)
This change replaces the locally computed workspace with one retrieved from graph_params.workspaces. This is directly linked to the removal of the explicit _npu_paged_attention_get_workspace call in the previous diff. If the graph_params.workspaces (populated during capturing using SEQ_LEN_WITH_MAX_PA_WORKSPACE) does not guarantee sufficient workspace for all seq_lens variations, especially those 'rare cases' where smaller seq_lens needed a larger workspace, this could lead to runtime failures or incorrect results. It's crucial to confirm that SEQ_LEN_WITH_MAX_PA_WORKSPACE is indeed sufficient for all scenarios, or that the underlying bug it addressed has been resolved by other means.
vllm_ascend/worker/model_runner_v1.py (131)
The constant SEQ_LEN_WITH_MAX_PA_WORKSPACE = 6144 is a magic number. While the comment in a later diff explains its purpose (to obtain max workspace during graph capturing), the specific value 6144 is not self-explanatory. It's critical to ensure this value is robust and universally yields the maximum required workspace for _npu_paged_attention across all possible seq_lens and configurations. If this value is not precisely the one that triggers the largest workspace allocation, it could lead to insufficient memory during graph replay, causing critical failures. Consider adding a more detailed explanation or a reference to why this specific value is chosen, or if possible, derive it dynamically.
…turing (vllm-project#6108) ### What this PR does / why we need it? This cherry-picks vllm-project#5833 . This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. --------- Signed-off-by: Angazenn <supperccell@163.com>
What this PR does / why we need it?
This cherry-picks #5833 .
This helps to fix a bug in for pa get_workspace. In earlier implementation, we use
_npu_paged_attention_get_workspacein_update_pa_attn_params. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixedSEQ_LEN_WITH_MAX_PA_WORKSPACEto get max workspace.Does this PR introduce any user-facing change?
How was this patch tested?