Skip to content

[BugFix]converting pa get_workspace back to capturing#5833

Merged
wangxiyuan merged 1 commit intovllm-project:mainfrom
Angazenn:pa_fix
Jan 22, 2026
Merged

[BugFix]converting pa get_workspace back to capturing#5833
wangxiyuan merged 1 commit intovllm-project:mainfrom
Angazenn:pa_fix

Conversation

@Angazenn
Copy link
Copy Markdown
Collaborator

@Angazenn Angazenn commented Jan 13, 2026

What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier implementation, we use _npu_paged_attention_get_workspace in _update_pa_attn_params. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed SEQ_LEN_WITH_MAX_PA_WORKSPACE to get max workspace.

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the workspace management for paged attention in ACL graphs. It removes a runtime workspace calculation that was implemented as a workaround for a bug in _npu_paged_attention. The new approach relies on using a pre-captured workspace from graph_params.

While this simplifies the code, I have a critical concern about removing the workaround. The original code comments indicated that this safeguard was necessary because smaller sequence lengths could unexpectedly require a larger workspace. My review comment details this concern, highlighting the risk of re-introducing runtime errors if the underlying issue has not been resolved.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm_ascend/compilation/acl_graph.py (240-257)

critical

This change removes the runtime workspace calculation for _npu_paged_attention. This calculation was originally added as a workaround for a bug where smaller seq_lens could require a larger workspace than what's allocated during graph capture. The new implementation relies on graph_params.workspaces.get(runtime_shape) to provide the workspace.

The removed TODO comment indicated this workaround should only be removed once _npu_paged_attention is replaced by npu_fused_infer_attention_score. Since _npu_paged_attention is still in use, removing this safeguard is risky unless the underlying bug in torch_npu has been fixed, or the graph capture logic has been updated to determine a sufficiently large workspace for all cases.

Without confirmation that the underlying issue is resolved, this change may re-introduce the original bug, potentially leading to runtime errors.

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@Angazenn Angazenn force-pushed the pa_fix branch 2 times, most recently from df842aa to 54c4e59 Compare January 21, 2026 15:00
@Angazenn Angazenn marked this pull request as ready for review January 21, 2026 15:00
@Angazenn Angazenn requested a review from MengqingCao as a code owner January 21, 2026 15:00
@Angazenn Angazenn added ready read for review ready-for-test start test by label for PR labels Jan 21, 2026
Signed-off-by: Angazenn <supperccell@163.com>
@Angazenn Angazenn changed the title [Draft]converting pa get_workspace back to capturing [BugFix]converting pa get_workspace back to capturing Jan 22, 2026
@wangxiyuan wangxiyuan merged commit 1d3544c into vllm-project:main Jan 22, 2026
20 checks passed
wangxiyuan pushed a commit that referenced this pull request Jan 22, 2026
…turing (#6108)

### What this PR does / why we need it?
This cherry-picks #5833 .

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

---------

Signed-off-by: Angazenn <supperccell@163.com>
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 22, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (51 commits)
  [Bugfix] Remove `use_aclgraph` in mtp_proposer and use `use_cuda_graph` (vllm-project#6032)
  [BugFix] fix 3vl dense model load quant weight (vllm-project#6100)
  [CP&SP] Integrate FIA operator in mla_cp._forward_decode (vllm-project#5641)
  [CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (vllm-project#6145)
  [CI]Install clang in dokerfile for triton ascend (vllm-project#4409)
  [Main] Upgrade PTA to 2.9.0 (vllm-project#6112)
  [Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (vllm-project#5721)
  [P/D][PCP]bugfix pcp force free twice caused logger error (vllm-project#6124)
  [BugFix]converting pa get_workspace back to capturing (vllm-project#5833)
  [CI] optimize lint term (vllm-project#5986)
  [Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6042)
  [bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (vllm-project#6015)
  [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#6097)
  [Misc] Bump mooncake version to v0.3.8.post1 (vllm-project#6110)
  [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (vllm-project#5758)
  [bugfix] adapt_remote_request_id (vllm-project#6051)
  [Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (vllm-project#5143)
  [Feature] Support DSA-CP for Hybrid scenario (vllm-project#5702)
  [CI] Upgrade CANN to 8.5.0 (vllm-project#6070)
  Default enable MLAPO (vllm-project#5952)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
)

### What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: Angazenn <supperccell@163.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…turing (vllm-project#6108)

### What this PR does / why we need it?
This cherry-picks vllm-project#5833 .

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

---------

Signed-off-by: Angazenn <supperccell@163.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
)

### What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: Angazenn <supperccell@163.com>
@Angazenn Angazenn deleted the pa_fix branch February 4, 2026 06:30
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
)

### What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
)

### What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: Angazenn <supperccell@163.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
)

### What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
)

### What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: Angazenn <supperccell@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants