Skip to content

[Bugfix] Keep all tensors to be on the same device#31958

Closed
wjunLu wants to merge 1 commit intovllm-project:mainfrom
wjunLu:bugfix
Closed

[Bugfix] Keep all tensors to be on the same device#31958
wjunLu wants to merge 1 commit intovllm-project:mainfrom
wjunLu:bugfix

Conversation

@wjunLu
Copy link
Copy Markdown

@wjunLu wjunLu commented Jan 8, 2026

When running on Ascend NPU, I meet the following error

(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2988, in _torch_cuda_wrapper
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     yield
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2934, in capture_model
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     super().capture_model()
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/worker/gpu_model_runner.py", line 4817, in capture_model
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     self._capture_cudagraphs(
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/worker/gpu_model_runner.py", line 4895, in _capture_cudagraphs
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     self._dummy_run(
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     return func(*args, **kwargs)
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2110, in _dummy_run
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     attn_metadata = self._build_dummy_attn_metadata(
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1954, in _build_dummy_attn_metadata
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     attn_metadata_gdn_attention = builder.build_for_cudagraph_capture(
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/attention/backends/gdn_attn.py", line 380, in build_for_cudagraph_capture
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     return self.build(0, m, num_accepted_tokens, num_decode_draft_tokens_cpu)
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/attention/backends/gdn_attn.py", line 152, in build
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     context_lens_tensor = m.compute_num_computed_tokens()
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]   File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/attention/backends/utils.py", line 139, in compute_num_computed_tokens
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]     self._num_computed_tokens_cache = self.seq_lens - query_lens
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822]                                       ~~~~~~~~~~~~~~^~~~~~~~~~~~
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] RuntimeError: Expected all tensors to be on the same device. Expected NPU tensor, please check whether the input tensor device is correct.
Error: (Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] [ERROR] 2026-01-08-03:39:40 (PID:41296, Device:0, RankID:-1) ERR01002 OPS invalid type
(Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] 

I think this may due to this PR: #31773, where the 2 tensors on the right of self._num_computed_tokens_cache = self.seq_lens - query_lens are not on the same device

Purpose

Bugfix

Test Plan

Tested on Ascend NPU with

pytest -sv tests/e2e/multicard/4-cards/test_qwen3_next.py::test_qwen3_next_distributed_mp_full_decode_only_tp4

Test Result

Upon this PR, it works now

(Worker_TP0 pid=48805) query_lens.device = npu:0
(Worker_TP2 pid=48807) self.seq_lens.deveice = npu:2
(Worker_TP2 pid=48807) query_lens.device = npu:2
(Worker_TP3 pid=48808) self.seq_lens.deveice = npu:3
(Worker_TP3 pid=48808) query_lens.device = npu:3
(Worker_TP0 pid=48805) self.seq_lens.deveice = npu:0
(Worker_TP1 pid=48806) self.seq_lens.deveice = npu:1
(Worker_TP0 pid=48805) query_lens.device = npu:0
(Worker_TP1 pid=48806) query_lens.device = npu:1
Processed prompts: 100%|████████████████████████| 4/4 [00:05<00:00,  1.49s/it, est. speed input: 3.36 toks/s, output: 3.36 toks/s]
(Worker_TP0 pid=48805) INFO 01-08 09:31:36 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker_TP3 pid=48808) INFO 01-08 09:31:36 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker_TP2 pid=48807) INFO 01-08 09:31:36 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker_TP1 pid=48806) INFO 01-08 09:31:36 [multiproc_executor.py:707] Parent process exited, terminating worker
PASSED

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wjunLu <wjunlu217@gmail.com>
@wjunLu wjunLu requested a review from LucasWilkinson as a code owner January 8, 2026 09:35
@mergify mergify bot added the v1 label Jan 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 8, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_num_computed_tokens is intended to be on device; hence no _cpu prefix, please use compute_num_computed_tokens().cpu() or create a compute_num_computed_tokens_cpu

@LucasWilkinson LucasWilkinson self-assigned this Jan 10, 2026
@wjunLu
Copy link
Copy Markdown
Author

wjunLu commented Jan 10, 2026

compute_num_computed_tokens is intended to be on device; hence no _cpu prefix, please use compute_num_computed_tokens().cpu() or create a compute_num_computed_tokens_cpu

Thank you @LucasWilkinson !
I re-check the code, I think we should refactor vllm-ascend to fix this error, rather than vllm

@wjunLu wjunLu closed this Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants