Skip to content

[Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties#31773

Merged
vllm-bot merged 7 commits intovllm-project:mainfrom
neuralmagic:deprecate-cpu-props/part-1
Jan 6, 2026
Merged

[Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties#31773
vllm-bot merged 7 commits intovllm-project:mainfrom
neuralmagic:deprecate-cpu-props/part-1

Conversation

@LucasWilkinson
Copy link
Copy Markdown
Collaborator

Update tests, MLA backends and CUDA full-attention backends

…itonAttention

Replace deprecated CommonAttentionMetadata.seq_lens_cpu property with
explicit seq_lens.cpu() call to avoid implicit H<->D sync.

This is part of the deprecation effort for seq_lens_cpu and
num_computed_tokens_cpu properties (to be removed in v0.14.0).

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ashInfer

Replace deprecated CommonAttentionMetadata.seq_lens_cpu property with
explicit seq_lens.cpu() call. The conditional logic already guards
against unnecessary CPU transfers.

This is part of the deprecation effort for seq_lens_cpu and
num_computed_tokens_cpu properties (to be removed in v0.14.0).

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…n in FlexAttention

Replace deprecated CommonAttentionMetadata.num_computed_tokens_cpu property
with explicit computation from query_start_loc_cpu and seq_lens.

num_computed_tokens = seq_lens - query_lens

This is part of the deprecation effort for seq_lens_cpu and
num_computed_tokens_cpu properties (to be removed in v0.14.0).

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ends

Replace deprecated CommonAttentionMetadata properties with explicit
computation:
- mla/common.py: Compute num_computed_tokens from query_start_loc and seq_lens
- mla/flashmla_sparse.py: Use explicit seq_lens.cpu() call

This is part of the deprecation effort for seq_lens_cpu and
num_computed_tokens_cpu properties (to be removed in v0.14.0).

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Replace deprecated property usage in test files with explicit computation:
- test_attention_backends.py: Use seq_lens.cpu() and derive context_lens
- test_mla_backends.py: Use seq_lens.cpu() and derive context_lens
- test_sparse_mla_backends.py: Use seq_lens.cpu()

Files intentionally NOT modified:
- test_chunked_local_attention.py: Separate PR in progress
- utils.py, test_tree_attention.py: Set internal cache for test fixtures
- test_async_spec_decode.py: Tests the deprecation behavior itself
- test_batch_reordering.py: Uses unrelated MockInputBatch class

This is part of the deprecation effort for seq_lens_cpu and
num_computed_tokens_cpu properties (to be removed in v0.14.0).

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is part of a series to remove the deprecated seq_lens_cpu and num_computed_tokens_cpu properties from CommonAttentionMetadata. The changes in this PR correctly replace the usage of these properties with their direct computation across various test files and attention backends. The refactoring is clean, makes the device-to-host data transfers explicit, and achieves its goal. I have found no high or critical issues with these changes.

…Metadata

Add a new method that computes num_computed_tokens on device (GPU):
  num_computed_tokens = seq_lens - query_lens

This avoids the H<->D sync that the deprecated num_computed_tokens_cpu
property causes.

Updated backends to use this method:
- flex_attention.py: Use compute_num_computed_tokens() directly
- mla/common.py: Use compute_num_computed_tokens().cpu() for CPU indexing

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 6, 2026
@vllm-bot vllm-bot merged commit e0327c9 into vllm-project:main Jan 6, 2026
50 of 53 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 6, 2026
LucasWilkinson added a commit to neuralmagic/vllm that referenced this pull request Jan 6, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Jan 13, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
aipaes pushed a commit to aipaes/vllm-ascend that referenced this pull request Jan 15, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
aabbccddwasd added a commit to aabbccddwasd/vllm that referenced this pull request Feb 5, 2026
This reverts the change in vllm-project#31773 that replaced seq_lens_cpu with
seq_lens.cpu() in the FlashInfer backend. The property access provides
better performance by avoiding unnecessary D2H transfers when the cached
value is already available.

Fixes performance regression on GLM-4.7-GPTQ-INT4-INT8MIX model with
MTP (Multi-Token Prediction) enabled, where throughput dropped from
95 to 77 tps.

Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
aabbccddwasd added a commit to aabbccddwasd/vllm that referenced this pull request Feb 8, 2026
This reverts the change in vllm-project#31773 that replaced seq_lens_cpu with
seq_lens.cpu() in the FlashInfer backend. The property access provides
better performance by avoiding unnecessary D2H transfers when the cached
value is already available.

Fixes performance regression on GLM-4.7-GPTQ-INT4-INT8MIX model with
MTP (Multi-Token Prediction) enabled, where throughput dropped from
95 to 77 tps.

Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants