[0.13.0][cherry-pick][CP&SP] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 #6039
Conversation
There was a problem hiding this comment.
Code Review
This pull request removes redundant code related to batch_seq_mask following an update to CANN 8.5. The changes are consistent across the codebase, simplifying the attention mechanism by relying on the new capabilities of the FIA operator. However, I've identified a critical issue in vllm_ascend/spec_decode/mtp_proposer.py where a numpy array is assigned to a field expecting a torch tensor, which will likely lead to a runtime error. Please see the detailed comment below.
| batch_seq_mask = builder.batch_seq_mask_buf[:batch_seq_mask. | ||
| shape[0]] | ||
| cp_seq_len = torch.where(cp_seq_len == 0, 1, cp_seq_len) | ||
| attn_metadata_i.decode.cp_seq_len = cp_seq_len |
There was a problem hiding this comment.
The cp_seq_len variable is a numpy array, but it is being assigned to attn_metadata_i.decode.cp_seq_len, which is defined as a torch.Tensor in the AscendMLADecodeMetadata dataclass. This type mismatch will likely cause a runtime error in downstream operations that expect a tensor. You should convert cp_seq_len to a torch.Tensor before the assignment, similar to the implementation in vllm_ascend/attention/context_parallel/mla_cp.py.
| attn_metadata_i.decode.cp_seq_len = cp_seq_len | |
| attn_metadata_i.decode.cp_seq_len = torch.tensor(cp_seq_len, dtype=torch.int32) |
b49f7e9 to
e900057
Compare
|
This PR should be merged after #6046 |
e900057 to
1ec5c89
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
3466774 to
ca172a0
Compare
Signed-off-by: dsxsteven <dsxsteven@sina.com>
…operator enables for CANN 8.5 (vllm-project#6039) ### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. pick-from: vllm-project#6013 ### How was this patch tested? passed all CI by CANN 8.5.0 Signed-off-by: dsxsteven <dsxsteven@sina.com>
What this PR does / why we need it?
PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on.
However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem.
After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations.
pick-from: #6013
Does this PR introduce any user-facing change?
How was this patch tested?
passed all CI by CANN 8.5.0