[Bugfix] Keep all tensors to be on the same device by wjunLu · Pull Request #31958 · vllm-project/vllm

wjunLu · 2026-01-08T09:35:41Z

When running on Ascend NPU, I meet the following error

(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2988, in _torch_cuda_wrapper
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     yield
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2934, in capture_model
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     super().capture_model()
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/worker/gpu_model_runner.py", line 4817, in capture_model
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     self._capture_cudagraphs(
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/worker/gpu_model_runner.py", line 4895, in _capture_cudagraphs
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     self._dummy_run(
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     return func(*args, **kwargs)
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2110, in _dummy_run
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     attn_metadata = self._build_dummy_attn_metadata(
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1954, in _build_dummy_attn_metadata
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     attn_metadata_gdn_attention = builder.build_for_cudagraph_capture(
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/attention/backends/gdn_attn.py", line 380, in build_for_cudagraph_capture
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     return self.build(0, m, num_accepted_tokens, num_decode_draft_tokens_cpu)
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/attention/backends/gdn_attn.py", line 152, in build
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     context_lens_tensor = m.compute_num_computed_tokens()
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/attention/backends/utils.py", line 139, in compute_num_computed_tokens
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     self._num_computed_tokens_cache = self.seq_lens - query_lens
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]                                       ~~~~~~~~~~~~~~^~~~~~~~~~~~
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] RuntimeError: Expected all tensors to be on the same device. Expected NPU tensor, please check whether the input tensor device is correct.
Error: (Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] [ERROR] 2026-01-08-03:39:40 (PID:41296, Device:0, RankID:-1) ERR01002 OPS invalid type
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]

I think this may due to this PR: #31773, where the 2 tensors on the right of self._num_computed_tokens_cache = self.seq_lens - query_lens are not on the same device

Purpose

Bugfix

Test Plan

Tested on Ascend NPU with

pytest -sv tests/e2e/multicard/4-cards/test_qwen3_next.py::test_qwen3_next_distributed_mp_full_decode_only_tp4

Test Result

Upon this PR, it works now

(Worker_TP0 pid=48805) query_lens.device = npu:0
(Worker_TP2 pid=48807) self.seq_lens.deveice = npu:2
(Worker_TP2 pid=48807) query_lens.device = npu:2
(Worker_TP3 pid=48808) self.seq_lens.deveice = npu:3
(Worker_TP3 pid=48808) query_lens.device = npu:3
(Worker_TP0 pid=48805) self.seq_lens.deveice = npu:0
(Worker_TP1 pid=48806) self.seq_lens.deveice = npu:1
(Worker_TP0 pid=48805) query_lens.device = npu:0
(Worker_TP1 pid=48806) query_lens.device = npu:1
Processed prompts: 100%|████████████████████████| 4/4 [00:05<00:00,  1.49s/it, est. speed input: 3.36 toks/s, output: 3.36 toks/s]
(Worker_TP0 pid=48805) INFO 01-08 09:31:36 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker_TP3 pid=48808) INFO 01-08 09:31:36 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker_TP2 pid=48807) INFO 01-08 09:31:36 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker_TP1 pid=48806) INFO 01-08 09:31:36 [multiproc_executor.py:707] Parent process exited, terminating worker
PASSED

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wjunLu <wjunlu217@gmail.com>

github-actions · 2026-01-08T09:47:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

LucasWilkinson

compute_num_computed_tokens is intended to be on device; hence no _cpu prefix, please use compute_num_computed_tokens().cpu() or create a compute_num_computed_tokens_cpu

wjunLu · 2026-01-10T08:38:03Z

compute_num_computed_tokens is intended to be on device; hence no _cpu prefix, please use compute_num_computed_tokens().cpu() or create a compute_num_computed_tokens_cpu

Thank you @LucasWilkinson !
I re-check the code, I think we should refactor vllm-ascend to fix this error, rather than vllm

keep all tensors to be on the same device

06d519d

Signed-off-by: wjunLu <wjunlu217@gmail.com>

wjunLu requested a review from LucasWilkinson as a code owner January 8, 2026 09:35

mergify bot added the v1 label Jan 8, 2026

zhangxinyuehfad mentioned this pull request Jan 8, 2026

[Main2Main] Upgrade vllm commit to 0108 vllm-project/vllm-ascend#5727

Closed

LucasWilkinson requested changes Jan 10, 2026

View reviewed changes

LucasWilkinson self-assigned this Jan 10, 2026

wjunLu closed this Jan 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Keep all tensors to be on the same device#31958

[Bugfix] Keep all tensors to be on the same device#31958
wjunLu wants to merge 1 commit intovllm-project:mainfrom
wjunLu:bugfix

wjunLu commented Jan 8, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

LucasWilkinson left a comment

Uh oh!

wjunLu commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wjunLu commented Jan 8, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

wjunLu commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wjunLu commented Jan 8, 2026 •

edited by github-actions bot

Loading