Skip to content

[Performance]zero bubble async scheduling and spec decoding#7640

Open
HF-001 wants to merge 38 commits intovllm-project:mainfrom
HF-001:zero_bubble_async_spec
Open

[Performance]zero bubble async scheduling and spec decoding#7640
HF-001 wants to merge 38 commits intovllm-project:mainfrom
HF-001:zero_bubble_async_spec

Conversation

@HF-001
Copy link
Copy Markdown
Contributor

@HF-001 HF-001 commented Mar 25, 2026

What this PR does / why we need it?

refer to: vllm-project/vllm#32951 , It improves the async-ness of spec decoding by optimistically assuming all draft tokens are accepted on the CPU and deferring the correction until after the forward pass. The NPU-side tensors are taken as the source of truth.

At present, the function is normal, but it may be a problem with the Triton operator, resulting remains unchanged in performance, which is currently being optimized

How was this patch tested?

todo

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances speculative decoding performance by introducing zero-bubble asynchronous scheduling. The core improvement lies in optimistically processing draft tokens on the CPU, assuming their acceptance, and then performing necessary corrections on the NPU after the model's forward pass. This approach aims to reduce latency and improve hardware utilization. The changes involve a fundamental shift in how sequence lengths and computed tokens are managed across CPU and GPU, optimizing KV cache slot mapping with a new kernel, and implementing robust deferred state corrections to maintain data consistency in this asynchronous execution model.

Highlights

  • Zero-Bubble Async Speculative Decoding: Implemented a new asynchronous scheduling mechanism for speculative decoding that optimistically assumes draft tokens are accepted on the CPU, deferring corrections to the NPU after the forward pass to minimize idle time.
  • Optimistic CPU State Management: Introduced optimistic_seq_lens_cpu to manage sequence lengths on the CPU, with GPU tensors serving as the authoritative source of truth for actual sequence lengths and computed tokens.
  • Kernel-based Slot Mapping: Refactored the compute_slot_mapping function in block_table.py to utilize a more efficient kernel-based approach, improving performance for KV cache management.
  • Deferred State Corrections: Added mechanisms in execute_model to defer state corrections, ensuring proper synchronization of CPU and GPU states for num_computed_tokens and num_accepted_tokens within the asynchronous flow.
  • Attention Metadata Alignment: Modified attention metadata construction and token preparation to align with the new asynchronous speculative decoding logic, including conditional handling of CPU-side sequence lengths and computed tokens.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR implements significant improvements for asynchronous speculative decoding and NPU (Ascend) specific optimizations within VLLM. It shifts towards a GPU-centric state management for attention metadata, introducing optimistic_seq_lens_cpu for speculative decoding and moving slot mapping computation to a GPU kernel. Additionally, it includes deferred state corrections and re-synchronization for Mamba cache alignment. A critical review comment highlights a potential integer overflow risk by changing num_accepted_tokens_cpu_tensor from torch.int64 to torch.int32.

# Speculative decoding
self.num_accepted_tokens_cpu_tensor = torch.ones(
(max_num_reqs,), dtype=torch.int64, device="cpu", pin_memory=pin_memory
(max_num_reqs,), dtype=torch.int32, device="cpu", pin_memory=pin_memory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Changing the dtype of num_accepted_tokens_cpu_tensor from torch.int64 to torch.int32 could lead to an integer overflow if the number of accepted tokens for a request exceeds the maximum value for a 32-bit signed integer (2,147,483,647). Please confirm that int32 is sufficient for all expected scenarios, or revert to int64 to prevent potential data loss or incorrect behavior.

Suggested change
(max_num_reqs,), dtype=torch.int32, device="cpu", pin_memory=pin_memory
(max_num_reqs,), dtype=torch.int64, device="cpu", pin_memory=pin_memory

Comment on lines +1180 to +1184
if common_attn_metadata.seq_lens_cpu is not None:
common_attn_metadata.seq_lens_cpu[:batch_size] = common_attn_metadata.seq_lens_cpu[:batch_size] + 1
exceeds_mask = common_attn_metadata.seq_lens_cpu[:batch_size] >= self.max_model_len
common_attn_metadata.seq_lens_cpu[:batch_size].masked_fill_(exceeds_mask, 1)
if common_attn_metadata.num_computed_tokens_cpu is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The addition of if ... is not None checks for common_attn_metadata.seq_lens_cpu and common_attn_metadata.num_computed_tokens_cpu is a critical improvement. This prevents AttributeError in scenarios where these attributes might be None due to the async spec decode logic, ensuring robustness and correctness.

Comment on lines +837 to +863
# Update num_computed_tokens on GPU. In async spec decode,
# CPU values are optimistic (all drafts accepted). The kernel
# corrects on GPU using the previous step's
# valid_sampled_token_count_gpu. Otherwise, just copy from CPU.
if (
self.use_async_spec_decode
and self.valid_sampled_token_count_gpu is not None
and prev_req_id_to_index
):
self.prev_positions.copy_to_gpu(num_reqs)
self.prev_num_draft_tokens.copy_to_gpu()
cpu_values = self.input_batch.num_computed_tokens_cpu_tensor[:num_reqs].to(
device=self.device, non_blocking=True
)
update_num_computed_tokens_for_batch_change(
self.num_computed_tokens,
self.num_accepted_tokens.gpu[:num_reqs],
self.prev_positions.gpu[:num_reqs],
self.valid_sampled_token_count_gpu,
self.prev_num_draft_tokens.gpu,
cpu_values,
)
else:
self.num_computed_tokens[:num_reqs].copy_(
self.input_batch.num_computed_tokens_cpu_tensor[:num_reqs],
non_blocking=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for conditionally updating self.num_computed_tokens based on use_async_spec_decode is a core part of the asynchronous scheduling. When use_async_spec_decode is enabled, the GPU-side correction using update_num_computed_tokens_for_batch_change is essential for maintaining data consistency between the optimistic CPU state and the authoritative NPU state. This is a critical correctness change for the new speculative decoding approach.

Comment on lines +882 to +886
self.input_batch.block_table.compute_slot_mapping(
num_reqs,
self.query_start_loc.gpu[: num_reqs + 1],
self.positions[:total_num_scheduled_tokens],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The compute_slot_mapping call has been moved and now correctly uses the GPU-side self.positions and self.query_start_loc.gpu. This is a critical change to ensure that the slot mapping is computed based on the most up-to-date and authoritative GPU state, which is essential for the attention mechanism.

Comment on lines +1356 to +1358
if deferred_state_corrections_fn:
deferred_state_corrections_fn()
deferred_state_corrections_fn = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Applying deferred_state_corrections_fn before mamba_utils.preprocess_mamba is a critical correctness fix. preprocess_mamba relies on req_state.num_computed_tokens (CPU), so ensuring these corrections are applied beforehand prevents preprocess_mamba from operating on an outdated or optimistic CPU state.

Comment on lines +2216 to +2219
if self.use_async_spec_decode:
# GPU tensors are authoritative in async mode.
seq_lens_cpu = None
num_computed_tokens_cpu = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting seq_lens_cpu and num_computed_tokens_cpu to None when use_async_spec_decode is enabled is a critical change. This explicitly signals that in async mode, the GPU tensors are authoritative, preventing accidental reliance on potentially optimistic or outdated CPU values in AscendCommonAttentionMetadata. This is crucial for maintaining the integrity of the async scheduling logic.

Comment on lines +873 to +879
self.positions[:total_num_scheduled_tokens] = (
self.num_computed_tokens[req_indices_gpu].to(torch.int64)
+ self.query_pos.gpu[:total_num_scheduled_tokens]
)
self.seq_lens[:num_reqs] = (
self.num_computed_tokens[:num_reqs] + num_scheduled_tokens_gpu
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of self.positions and self.seq_lens directly on the GPU using self.num_computed_tokens and self.query_pos.gpu is a significant change. This aligns with the strategy of making NPU-side tensors the source of truth and reduces CPU-GPU synchronization overhead. This is a high-severity correctness and performance improvement.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@leo-pony
Copy link
Copy Markdown
Collaborator

leo-pony commented Mar 30, 2026

@HF-001 We also do vllm-ascend adapt to vllm PR vllm-project/vllm#32951 working(PR: #7610),
Can we have a conversation?

@HF-001
Copy link
Copy Markdown
Contributor Author

HF-001 commented Mar 30, 2026

@HF-001 We also do vllm-ascend adapt to vllm PR vllm-project/vllm#32951 working(PR: #7610), Can we have a conversation? (WeChat wx380024392)

@leo-pony Alright, I've already sent you a friend request

@HF-001 HF-001 requested a review from Yikun as a code owner March 30, 2026 10:02
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@HF-001 HF-001 requested a review from realliujiaxu as a code owner March 30, 2026 10:03
@HF-001 HF-001 changed the title [wip][Spec Decoding] Zero-bubble async scheduling + spec decoding [wip]zero bubble async scheduling and spec decoding Mar 30, 2026
@HF-001 HF-001 changed the title [wip]zero bubble async scheduling and spec decoding [performance]zero bubble async scheduling and spec decoding Mar 30, 2026
@HF-001 HF-001 changed the title [performance]zero bubble async scheduling and spec decoding [Performance]zero bubble async scheduling and spec decoding Mar 31, 2026
@HF-001 HF-001 force-pushed the zero_bubble_async_spec branch from 7a17418 to ec20743 Compare March 31, 2026 02:00
@HF-001 HF-001 requested review from LCAIZJ and zzzzwwjj as code owners March 31, 2026 02:00
01267596 and others added 6 commits March 31, 2026 02:14
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Upstream vLLM has removed the vllm_is_batch_invariant() function from
batch_invariant.py and now uses envs.VLLM_BATCH_INVARIANT directly.

Create a compatibility wrapper in vllm_ascend/batch_invariant.py that
checks envs.VLLM_BATCH_INVARIANT and update all imports across the codebase
to use the local implementation instead of trying to import from vllm.

Changes:
- Add vllm_is_batch_invariant() function to vllm_ascend/batch_invariant.py
- Update imports in ascend_config.py, sample/sampler.py, and utils.py

Fixes: ImportError when running multicard tests

Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@HF-001 HF-001 force-pushed the zero_bubble_async_spec branch from ec20743 to 1dfa935 Compare March 31, 2026 02:25
@leo-pony leo-pony added ready read for review ready-for-test start test by label for PR labels Mar 31, 2026
@HF-001 HF-001 force-pushed the zero_bubble_async_spec branch from 609eb30 to 946107d Compare April 1, 2026 03:27
01267596 added 2 commits April 1, 2026 06:06
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
fix
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@HF-001 HF-001 force-pushed the zero_bubble_async_spec branch from 93f70f3 to 5c88ee5 Compare April 1, 2026 06:51
fix
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@HF-001 HF-001 force-pushed the zero_bubble_async_spec branch from 74a2015 to c1e05db Compare April 1, 2026 11:01
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: kx <1670186653@qq.com>
01267596 added 3 commits April 2, 2026 01:14
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
fix
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@HF-001 HF-001 force-pushed the zero_bubble_async_spec branch from e4f2c4a to e931e98 Compare April 2, 2026 06:39
01267596 added 2 commits April 2, 2026 09:00
fix
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
fix
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@HF-001 HF-001 force-pushed the zero_bubble_async_spec branch from 6b68b1a to 4363d13 Compare April 2, 2026 09:03
HF-001 and others added 4 commits April 2, 2026 22:17
Signed-off-by: HF-001 <1670186653@qq.com>
fix
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
fix
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@HF-001 HF-001 force-pushed the zero_bubble_async_spec branch from 8a9b13f to c83ea55 Compare April 3, 2026 06:47
fix
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
MengqingCao pushed a commit that referenced this pull request Apr 3, 2026
### What this PR does / why we need it?
main2main upgrade to vllm 0324.
fix breaks:

1. PR [#37487](vllm-project/vllm#37487) [V0
Deprecation] Refactor kv cache from list to element (c59a132f9) —
self.kv_cache from list[tensor](per virtual engine)changed to tensor

2. PR [#37874](vllm-project/vllm#37874) [KV
Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend
abstraction, restructure into cpu/ package (e3c6c10ca) —
LRUOffloadingManager + CPUBackend been refactor to CPUOffloadingManager

3. PR [#32951](vllm-project/vllm#32951)
[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding
(fafe76b4a) — a) changes self.positions and self.seq_lens from
CpuGpuBuffer to plain GPU tensor; b) change _get_cumsum_and_arange
output paramter. Another _prepare_input_ids add num_reqs.

5. PR [#35007](https://github.com/vllm-project/vllm/pull/35007)[Bugfix]
Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var
warning (dc6908ac6) — delete vllm_is_batch_invariant() and const
variable VLLM_BATCH_INVARIANT,replace with vllm.envs

Know issues:
1.310p Qwen3.5 test failed for qwen3.5 patch failure, see issue: #7976
@YangShuai52 is fixing.

### Does this PR introduce _any_ user-facing change?
1. As Zero Async Scheduler + spec decode needs
_compute_slot_mapping_kernel of NPU and corresponding accepted draft
token validation delaye suppots see PR #7640 , this PR make this change:
when in spec decode case close the async scheduler. In this way, the
Main2Main can be developed in parallel with Spec Decode + Async
scheduler, util next release version.

Co-Authored-By: zhaomingyu <zhaomingyu13@h-partners.com>
wangbj127 <wangbj1207@126.com>
SidaoY <1024863041@qq.com>
22dimensions <waitingwind@foxmail.com>

- vLLM main:
vllm-project/vllm@35141a7
---------
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: wangbj127 <wangbj1207@126.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: Claude Code <claude@anthropic.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: wangbj127 <wangbj1207@126.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-conflicts ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants