[Attention][1/n] Remove usage of deprecated `seq_lens_cpu` and `num_computed_tokens_cpu` CommonAttentionMetadata properties by LucasWilkinson · Pull Request #31773 · vllm-project/vllm

LucasWilkinson · 2026-01-06T04:53:22Z

Update tests, MLA backends and CUDA full-attention backends

…itonAttention Replace deprecated CommonAttentionMetadata.seq_lens_cpu property with explicit seq_lens.cpu() call to avoid implicit H<->D sync. This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…ashInfer Replace deprecated CommonAttentionMetadata.seq_lens_cpu property with explicit seq_lens.cpu() call. The conditional logic already guards against unnecessary CPU transfers. This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…n in FlexAttention Replace deprecated CommonAttentionMetadata.num_computed_tokens_cpu property with explicit computation from query_start_loc_cpu and seq_lens. num_computed_tokens = seq_lens - query_lens This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…ends Replace deprecated CommonAttentionMetadata properties with explicit computation: - mla/common.py: Compute num_computed_tokens from query_start_loc and seq_lens - mla/flashmla_sparse.py: Use explicit seq_lens.cpu() call This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Replace deprecated property usage in test files with explicit computation: - test_attention_backends.py: Use seq_lens.cpu() and derive context_lens - test_mla_backends.py: Use seq_lens.cpu() and derive context_lens - test_sparse_mla_backends.py: Use seq_lens.cpu() Files intentionally NOT modified: - test_chunked_local_attention.py: Separate PR in progress - utils.py, test_tree_attention.py: Set internal cache for test fixtures - test_async_spec_decode.py: Tests the deprecation behavior itself - test_batch_reordering.py: Uses unrelated MockInputBatch class This is part of the deprecation effort for seq_lens_cpu and num_computed_tokens_cpu properties (to be removed in v0.14.0). Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

gemini-code-assist

Code Review

This pull request is part of a series to remove the deprecated seq_lens_cpu and num_computed_tokens_cpu properties from CommonAttentionMetadata. The changes in this PR correctly replace the usage of these properties with their direct computation across various test files and attention backends. The refactoring is clean, makes the device-to-host data transfers explicit, and achieves its goal. I have found no high or critical issues with these changes.

…Metadata Add a new method that computes num_computed_tokens on device (GPU): num_computed_tokens = seq_lens - query_lens This avoids the H<->D sync that the deprecated num_computed_tokens_cpu property causes. Updated backends to use this method: - flex_attention.py: Use compute_num_computed_tokens() directly - mla/common.py: Use compute_num_computed_tokens().cpu() for CPU indexing Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>

…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>

This reverts the change in vllm-project#31773 that replaced seq_lens_cpu with seq_lens.cpu() in the FlashInfer backend. The property access provides better performance by avoiding unnecessary D2H transfers when the cached value is already available. Fixes performance regression on GLM-4.7-GPTQ-INT4-INT8MIX model with MTP (Multi-Token Prediction) enabled, where throughput dropped from 95 to 77 tps. Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>

…omputed_tokens_cpu` CommonAttentionMetadata properties (vllm-project#31773) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>

LucasWilkinson added 5 commits January 6, 2026 03:53

LucasWilkinson requested review from mgoin, pavanimajety and tdoublep as code owners January 6, 2026 04:53

mergify bot added nvidia v1 labels Jan 6, 2026

github-project-automation bot added this to NVIDIA Jan 6, 2026

cleanup

843cf25

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026

DarkLight1337 approved these changes Jan 6, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 6, 2026

vllm-bot merged commit e0327c9 into vllm-project:main Jan 6, 2026
50 of 53 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 6, 2026

This was referenced Jan 8, 2026

[Main2Main] Upgrade vllm commit to 0112 vllm-project/vllm-ascend#5691

Closed

[Bugfix] Keep all tensors to be on the same device #31958

Closed

zhangxinyuehfad mentioned this pull request Jan 8, 2026

[Main2Main] Upgrade vllm commit to 0108 vllm-project/vllm-ascend#5727

Closed

LucasWilkinson mentioned this pull request Jan 10, 2026

[Tracking]: Deprecate CPU seqlen related CommonAttentionMetadata properties #32072

Open

1 task

zhangxinyuehfad mentioned this pull request Jan 13, 2026

[Main2Main] Upgrade vllm commit to 0109 vllm-project/vllm-ascend#5752

Merged

wjunLu mentioned this pull request Jan 13, 2026

[Main2Main] Upgrade vllm commit to 0113 vllm-project/vllm-ascend#5839

Merged

aabbccddwasd mentioned this pull request Feb 4, 2026

[Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate #33771

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention][1/n] Remove usage of deprecated `seq_lens_cpu` and `num_computed_tokens_cpu` CommonAttentionMetadata properties#31773

[Attention][1/n] Remove usage of deprecated `seq_lens_cpu` and `num_computed_tokens_cpu` CommonAttentionMetadata properties#31773
vllm-bot merged 7 commits intovllm-project:mainfrom
neuralmagic:deprecate-cpu-props/part-1

LucasWilkinson commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LucasWilkinson commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants