[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding by MatthewBonanni · Pull Request #32951 · vllm-project/vllm

MatthewBonanni · 2026-01-23T17:01:50Z

Purpose

This is a refactor of #29957. It improves the async-ness of spec decoding by optimistically assuming all draft tokens are accepted on the CPU and deferring the correction until after the forward pass. The GPU-side tensors are taken as the source of truth.

The compute_slot_mappings_kernel from model runner V2 (plus @yewentao256's changes from #34179) is adapted into V1 here.

Test Plan

On H100:

vllm serve deepseek-ai/DeepSeek-V3.2 \
    -tp 8 -ep \
    --no-enable-prefix-caching \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

with

vllm bench serve \
    --dataset-name spec_bench \
    --dataset-path question.jsonl \
    --spec-bench-output-len 2048 \
    --num-prompts 10 \
    --max-concurrency 1 \
    --seed 42 --ignore-eos \
    --temperature 0 \
    --tokenizer deepseek-ai/DeepSeek-V3

Test Result

TPOT 9.19 ms -> 8.90 ms (3% speedup)

Main

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Maximum request concurrency:             1
Benchmark duration (s):                  188.82
Total input tokens:                      801
Total generated tokens:                  20480
Request throughput (req/s):              0.05
Output token throughput (tok/s):         108.46
Peak output token throughput (tok/s):    57.00
Peak concurrent requests:                2.00
Total token throughput (tok/s):          112.70
---------------Time to First Token----------------
Mean TTFT (ms):                          62.86
Median TTFT (ms):                        57.16
P99 TTFT (ms):                           105.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.19
Median TPOT (ms):                        9.19
P99 TPOT (ms):                           9.51
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.79
Median ITL (ms):                         17.78
P99 ITL (ms):                            18.56
---------------Speculative Decoding---------------
Acceptance rate (%):                     93.56
Acceptance length:                       1.94
Drafts:                                  10578
Draft tokens:                            10578
Accepted tokens:                         9897
Per-position acceptance (%):
  Position 0:                            93.56
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Maximum request concurrency:             1
Benchmark duration (s):                  182.79
Total input tokens:                      801
Total generated tokens:                  20480
Request throughput (req/s):              0.05
Output token throughput (tok/s):         112.04
Peak output token throughput (tok/s):    59.00
Peak concurrent requests:                2.00
Total token throughput (tok/s):          116.42
---------------Time to First Token----------------
Mean TTFT (ms):                          53.30
Median TTFT (ms):                        52.90
P99 TTFT (ms):                           80.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.90
Median TPOT (ms):                        8.89
P99 TPOT (ms):                           9.20
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.27
Median ITL (ms):                         17.27
P99 ITL (ms):                            17.90
---------------Speculative Decoding---------------
Acceptance rate (%):                     94.05
Acceptance length:                       1.94
Drafts:                                  10551
Draft tokens:                            10551
Accepted tokens:                         9923
Per-position acceptance (%):
  Position 0:                            94.05
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

…ble_mode Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

mergify · 2026-01-23T17:02:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist

Code Review

This pull request introduces significant changes to support zero-bubble async scheduling and speculative decoding. The refactoring primarily focuses on shifting state management and computations to the GPU to minimize CPU-GPU synchronization overhead, which is crucial for performance in asynchronous modes. Key changes include conditional disabling of cascade attention for async speculative decoding, GPU-side computation of slot mappings, and a revised approach to handling accepted and rejected tokens directly on the GPU. The changes in gpu_model_runner.py are particularly impactful, modifying how num_computed_tokens and mrope_position_delta are managed and updated. Overall, the changes aim to improve the efficiency and correctness of async speculative decoding.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/v1/worker/gpu_model_runner.py (994-996)

In async speculative decoding mode, num_accepted is set to req_state.prev_num_draft_len and num_rejected to 0. This implies an optimistic assumption that all draft tokens are accepted, and no placeholders are added to output_token_ids at this stage. This is a critical change in logic compared to the non-async path. It relies heavily on update_async_output_token_ids to correctly extend output_token_ids later. Please ensure this optimistic approach is robust and correctly handled in all scenarios, especially concerning potential rollbacks or partial acceptances.

vllm/v1/worker/gpu_model_runner.py (1531-1572)

This block introduces complex logic for adjusting num_computed_tokens on the GPU for async speculative decoding by subtracting rejected tokens from the previous step. The calculation of num_rejected involves prev_draft_lens_gpu + 1 - self.valid_sampled_token_count_gpu[prev_indices_gpu].int(). This is a critical piece of the GPU-first state management. Any inaccuracies here could lead to incorrect sequence lengths or KV cache indexing. Thorough testing of this calculation under various acceptance/rejection scenarios is essential.

vllm/v1/worker/gpu_model_runner.py (3760-3765)

This new block for async speculative decoding directly updates self.num_computed_tokens.gpu with valid_sampled_tokens_count. This is a crucial step for maintaining the GPU as the source of truth for computed tokens. It's vital that valid_sampled_tokens_count accurately reflects the accepted tokens and that this addition correctly updates the cumulative count, as this directly influences subsequent sequence length and position calculations.

vllm/v1/worker/gpu_model_runner.py (960-962)

The valid_sampled_token_count is initialized as an empty list and conditionally populated only for non-async speculative decoding. For async spec decode, it remains empty, which might lead to unexpected behavior in downstream logic that expects this list to contain valid counts, even if zero. While the subsequent if self.use_async_spec_decode: block handles this, it's important to ensure that any other parts of the code relying on valid_sampled_token_count are aware of its potentially empty state in async mode.

vllm/v1/worker/gpu_model_runner.py (1593-1597)

The computation of seq_lens is now performed directly on the GPU using self.num_computed_tokens.gpu and num_scheduled_tokens_gpu. This is a significant shift from CPU-based computation. Ensure that self.num_computed_tokens.gpu is always up-to-date and accurate before this operation, especially considering the adjustments made for rejected tokens in the preceding block. Incorrect seq_lens can lead to attention calculation errors.

vllm/v1/worker/gpu_model_runner.py (1622-1625)

The call to self.input_batch.block_table.compute_slot_mapping_gpu indicates that slot mapping is now entirely handled on the GPU. This is a good performance optimization, but it's crucial to verify that the req_indices_gpu and self.positions.gpu tensors are correctly prepared and synchronized before this call to prevent any data inconsistencies or indexing errors on the device.

vllm/v1/worker/gpu_model_runner.py (1726-1746)

The conditional logic for populating self.num_accepted_tokens.gpu now prioritizes self.valid_sampled_token_count_gpu in async spec decode mode. This is a critical optimization to avoid CPU synchronization. It's important to confirm that valid_sampled_token_count_gpu is reliably available and contains the correct data when use_async_spec_decode is true, as any error here would directly impact the acceptance logic in speculative decoding.

vllm/v1/worker/gpu_model_runner.py (1780-1787)

For async speculative decoding, seq_lens_cpu and num_computed_tokens_cpu are explicitly set to None. This change implies that these CPU-side copies are no longer considered valid or necessary in this mode, reinforcing the GPU-first approach. Ensure that no other parts of the system inadvertently rely on these CPU tensors being populated when async spec decode is active, as this could lead to unexpected None access errors.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

izhuhaoran

@MatthewBonanni Thanks for your work—overall LGTM, just a few minor questions.

vllm/v1/executor/multiproc_executor.py

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

vllm/config/vllm.py

vllm/v1/worker/gpu_model_runner.py

izhuhaoran

@MatthewBonanni Thanks for the update, I’ve left a few more suggestions.

Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

with vllm top commit 1b6cb920e6ebcac57154e6154578c39d4892a16c has some diffs with vllm-omni, vllm-project/vllm#32104 vllm-project/vllm#32951 vllm-project/vllm#37287 vllm-project/vllm#36483 just modify vllm-omni to work Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com>

…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>

caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>

…re_next_token_ids_padded) When async scheduling is enabled (zero-bubble spec decoding, PR vllm-project#32951), optimistic_seq_lens_cpu = num_computed_tokens + num_scheduled_tokens is passed to prepare_next_token_ids_padded as seq_lens_cpu. This value is inflated relative to the actual committed output_token_ids because _prepare_inputs appends -1 placeholder slots optimistically. The backup token lookup calls: request.get_token_id(seq_lens_cpu[i]) where seq_lens_cpu[i] points one past the end of the committed tokens, causing get_token_id() to return -1 (placeholder). The drafter then receives -1 as its next input token, which corrupts its hidden state and degrades the draft acceptance rate — causing the Nemotron-3-Super-120B BF16 GSM8K score to drop from ~0.93 to ~0.74. Fix: use (num_tokens_no_spec[i] - 1) — the index of the last committed output token — for the backup token lookup in both EagleProposer (eagle.py) and ExtractHiddenStatesProposer (extract_hidden_states.py). num_tokens_no_spec is set to request.num_tokens before the optimistic extend, so it always points to a valid token slot. Fixes: vllm-project#38098 Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>

caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>

…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

izhuhaoran and others added 7 commits January 16, 2026 00:55

[Perf][Async] Implement zero-bubble async speculative decoding

b141669

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

skip compute_slot_mapping for async_spec_zero_bubble_mode

679054f

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

remove seq_lens_cpu & num_computed_tokens_cpu for async_spec_zero_bub…

527b55d

…ble_mode Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Get rid of async_spec_zero_bubble_mode option

8c2ce4c

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fully async version

1554ab7

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Increase max_concurrent_batches

315e7e6

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Handle reordering

15d8ee9

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

mergify bot added speculative-decoding v1 labels Jan 23, 2026

mergify bot added the needs-rebase label Jan 23, 2026

Merge branch 'main' into async-eagle-mod

abb8ba7

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

mergify bot removed the needs-rebase label Jan 23, 2026

MatthewBonanni added 2 commits January 23, 2026 18:31

Fix

dd8b7c3

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Merge branch 'main' into async-eagle-mod

1573015

MatthewBonanni changed the title ~~[WIP][Async Scheduling][Spec Decoding] Zero-bubble async scheduling + spec decoding~~ [WIP][Async][Spec Decoding] Zero-bubble async scheduling + spec decoding Jan 23, 2026

MatthewBonanni added 3 commits January 23, 2026 18:37

Cleanup

317b452

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Cleanup

e7e39ce

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix

6fb3042

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

izhuhaoran reviewed Jan 26, 2026

View reviewed changes

MatthewBonanni added 2 commits January 26, 2026 18:00

Cleanup

7ff5674

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix hybrid

c491421

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>