[0.13.0][Bugfix] fix pcp aclgraph qwen FIA bug#6038
[0.13.0][Bugfix] fix pcp aclgraph qwen FIA bug#6038zzzzwwjj merged 1 commit intovllm-project:releases/v0.13.0from
Conversation
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
There was a problem hiding this comment.
Code Review
This pull request aims to fix a bug related to actual_seq_lengths_q in the update_attn_dcp_pcp_params function. While removing the incorrect slicing logic is correct, removing the padding logic as well might introduce a new issue. The query tensor's shape is tied to runtime_shape, and actual_seq_lengths_q needs to have a matching length. I've provided a suggestion to re-introduce the padding logic without the incorrect slicing to prevent potential runtime errors due to shape mismatch. Please review the suggestion.
| actual_seq_lengths_q = actual_seq_lengths_q + [ | ||
| actual_seq_lengths_q[-1] | ||
| ] * (runtime_shape - len(actual_seq_lengths_q)) | ||
| actual_seq_lengths_q = attn_metadata.actual_seq_lengths_q |
There was a problem hiding this comment.
This change removes the padding logic for actual_seq_lengths_q, which could lead to a shape mismatch at runtime. It's implied that attn_metadata.actual_seq_lengths_q should now have a length equal to runtime_shape.
However, the existing test test_update_attn_dcp_pcp_params suggests this might not always be the case (runtime_shape=4 vs. len(actual_seq_lengths_q)=2). This would cause a mismatch with the query tensor's shape inside the npu_fused_infer_attention_score operator, leading to a runtime error. The test only passes because the operator is mocked.
While the original slicing was likely buggy, the padding seems necessary. The suggestion below reintroduces the padding without the incorrect slicing.
| actual_seq_lengths_q = attn_metadata.actual_seq_lengths_q | |
| actual_seq_lengths_q = list(attn_metadata.actual_seq_lengths_q) | |
| pad_len = runtime_shape - len(actual_seq_lengths_q) | |
| if pad_len > 0: | |
| actual_seq_lengths_q.extend([actual_seq_lengths_q[-1]] * pad_len) |
5664020
into
vllm-project:releases/v0.13.0
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: [0.13.0][Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5958) [v0.13.0][Bugfix] Fix XliteModelRunner init failed when aclgraph is enabled (vllm-project#5887) [0.13.0][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5972) [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6057) [0.13.0][Bugfix] fix pcp aclgraph qwen FIA bug (vllm-project#6038) [0.13.0][cherry-pick][bugfix] fix bug of triton mrope (vllm-project#6009) 【0.13.0】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6056)
### What this PR does / why we need it? [releases/v0.13.0] In the pcp full graph Qwen model scenario, the inconsistency between the Q shape and actual q len of the FIA operator is fixed. PR for main branch: vllm-project#6037 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.13.0 Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it? [releases/v0.13.0] In the pcp full graph Qwen model scenario, the inconsistency between the Q shape and actual q len of the FIA operator is fixed. PR for main branch: vllm-project#6037 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.13.0 Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it? [releases/v0.13.0] In the pcp full graph Qwen model scenario, the inconsistency between the Q shape and actual q len of the FIA operator is fixed. PR for main branch: vllm-project#6037 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.13.0 Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
What this PR does / why we need it?
[releases/v0.13.0] In the pcp full graph Qwen model scenario, the inconsistency between the Q shape and actual q len of the FIA operator is fixed.
PR for main branch: #6037
Does this PR introduce any user-facing change?
No
How was this patch tested?
vLLM version: v0.13.0