[Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties#31773
Merged
vllm-bot merged 7 commits intovllm-project:mainfrom Jan 6, 2026
Conversation
…itonAttention Replace deprecated CommonAttentionMetadata.seq_lens_cpu property with explicit seq_lens.cpu() call to avoid implicit H<->D sync. This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ashInfer Replace deprecated CommonAttentionMetadata.seq_lens_cpu property with explicit seq_lens.cpu() call. The conditional logic already guards against unnecessary CPU transfers. This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…n in FlexAttention Replace deprecated CommonAttentionMetadata.num_computed_tokens_cpu property with explicit computation from query_start_loc_cpu and seq_lens. num_computed_tokens = seq_lens - query_lens This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ends Replace deprecated CommonAttentionMetadata properties with explicit computation: - mla/common.py: Compute num_computed_tokens from query_start_loc and seq_lens - mla/flashmla_sparse.py: Use explicit seq_lens.cpu() call This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Replace deprecated property usage in test files with explicit computation: - test_attention_backends.py: Use seq_lens.cpu() and derive context_lens - test_mla_backends.py: Use seq_lens.cpu() and derive context_lens - test_sparse_mla_backends.py: Use seq_lens.cpu() Files intentionally NOT modified: - test_chunked_local_attention.py: Separate PR in progress - utils.py, test_tree_attention.py: Set internal cache for test fixtures - test_async_spec_decode.py: Tests the deprecation behavior itself - test_batch_reordering.py: Uses unrelated MockInputBatch class This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request is part of a series to remove the deprecated seq_lens_cpu and num_computed_tokens_cpu properties from CommonAttentionMetadata. The changes in this PR correctly replace the usage of these properties with their direct computation across various test files and attention backends. The refactoring is clean, makes the device-to-host data transfers explicit, and achieves its goal. I have found no high or critical issues with these changes.
…Metadata Add a new method that computes num_computed_tokens on device (GPU): num_computed_tokens = seq_lens - query_lens This avoids the H<->D sync that the deprecated num_computed_tokens_cpu property causes. Updated backends to use this method: - flex_attention.py: Use compute_num_computed_tokens() directly - mla/common.py: Use compute_num_computed_tokens().cpu() for CPU indexing Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
DarkLight1337
approved these changes
Jan 6, 2026
LucasWilkinson
added a commit
to neuralmagic/vllm
that referenced
this pull request
Jan 6, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
This was referenced Jan 8, 2026
yugong333
pushed a commit
to yugong333/vllm
that referenced
this pull request
Jan 9, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
1 task
wangxiyuan
pushed a commit
to vllm-project/vllm-ascend
that referenced
this pull request
Jan 13, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
aipaes
pushed a commit
to aipaes/vllm-ascend
that referenced
this pull request
Jan 15, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
akh64bit
pushed a commit
to akh64bit/vllm
that referenced
this pull request
Jan 16, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
dsuhinin
pushed a commit
to dsuhinin/vllm
that referenced
this pull request
Jan 21, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
5 tasks
aabbccddwasd
added a commit
to aabbccddwasd/vllm
that referenced
this pull request
Feb 5, 2026
This reverts the change in vllm-project#31773 that replaced seq_lens_cpu with seq_lens.cpu() in the FlashInfer backend. The property access provides better performance by avoiding unnecessary D2H transfers when the cached value is already available. Fixes performance regression on GLM-4.7-GPTQ-INT4-INT8MIX model with MTP (Multi-Token Prediction) enabled, where throughput dropped from 95 to 77 tps. Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
aabbccddwasd
added a commit
to aabbccddwasd/vllm
that referenced
this pull request
Feb 8, 2026
This reverts the change in vllm-project#31773 that replaced seq_lens_cpu with seq_lens.cpu() in the FlashInfer backend. The property access provides better performance by avoiding unnecessary D2H transfers when the cached value is already available. Fixes performance regression on GLM-4.7-GPTQ-INT4-INT8MIX model with MTP (Multi-Token Prediction) enabled, where throughput dropped from 95 to 77 tps. Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
ItzDEXX
pushed a commit
to ItzDEXX/vllm
that referenced
this pull request
Feb 19, 2026
…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
ZRJ026
pushed a commit
to ZRJ026/vllm-ascend
that referenced
this pull request
Feb 28, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241
pushed a commit
to maoxx241/vllm-ascend
that referenced
this pull request
Mar 2, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
ZRJ026
pushed a commit
to ZRJ026/vllm-ascend
that referenced
this pull request
Mar 4, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ
pushed a commit
to LCAIZJ/vllm-ascend
that referenced
this pull request
Mar 7, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Update tests, MLA backends and CUDA full-attention backends