[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding#32951
[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding#32951MatthewBonanni merged 112 commits intovllm-project:mainfrom
Conversation
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
…ble_mode Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request introduces significant changes to support zero-bubble async scheduling and speculative decoding. The refactoring primarily focuses on shifting state management and computations to the GPU to minimize CPU-GPU synchronization overhead, which is crucial for performance in asynchronous modes. Key changes include conditional disabling of cascade attention for async speculative decoding, GPU-side computation of slot mappings, and a revised approach to handling accepted and rejected tokens directly on the GPU. The changes in gpu_model_runner.py are particularly impactful, modifying how num_computed_tokens and mrope_position_delta are managed and updated. Overall, the changes aim to improve the efficiency and correctness of async speculative decoding.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/v1/worker/gpu_model_runner.py (994-996)
In async speculative decoding mode, num_accepted is set to req_state.prev_num_draft_len and num_rejected to 0. This implies an optimistic assumption that all draft tokens are accepted, and no placeholders are added to output_token_ids at this stage. This is a critical change in logic compared to the non-async path. It relies heavily on update_async_output_token_ids to correctly extend output_token_ids later. Please ensure this optimistic approach is robust and correctly handled in all scenarios, especially concerning potential rollbacks or partial acceptances.
vllm/v1/worker/gpu_model_runner.py (1531-1572)
This block introduces complex logic for adjusting num_computed_tokens on the GPU for async speculative decoding by subtracting rejected tokens from the previous step. The calculation of num_rejected involves prev_draft_lens_gpu + 1 - self.valid_sampled_token_count_gpu[prev_indices_gpu].int(). This is a critical piece of the GPU-first state management. Any inaccuracies here could lead to incorrect sequence lengths or KV cache indexing. Thorough testing of this calculation under various acceptance/rejection scenarios is essential.
vllm/v1/worker/gpu_model_runner.py (3760-3765)
This new block for async speculative decoding directly updates self.num_computed_tokens.gpu with valid_sampled_tokens_count. This is a crucial step for maintaining the GPU as the source of truth for computed tokens. It's vital that valid_sampled_tokens_count accurately reflects the accepted tokens and that this addition correctly updates the cumulative count, as this directly influences subsequent sequence length and position calculations.
vllm/v1/worker/gpu_model_runner.py (960-962)
The valid_sampled_token_count is initialized as an empty list and conditionally populated only for non-async speculative decoding. For async spec decode, it remains empty, which might lead to unexpected behavior in downstream logic that expects this list to contain valid counts, even if zero. While the subsequent if self.use_async_spec_decode: block handles this, it's important to ensure that any other parts of the code relying on valid_sampled_token_count are aware of its potentially empty state in async mode.
vllm/v1/worker/gpu_model_runner.py (1593-1597)
The computation of seq_lens is now performed directly on the GPU using self.num_computed_tokens.gpu and num_scheduled_tokens_gpu. This is a significant shift from CPU-based computation. Ensure that self.num_computed_tokens.gpu is always up-to-date and accurate before this operation, especially considering the adjustments made for rejected tokens in the preceding block. Incorrect seq_lens can lead to attention calculation errors.
vllm/v1/worker/gpu_model_runner.py (1622-1625)
The call to self.input_batch.block_table.compute_slot_mapping_gpu indicates that slot mapping is now entirely handled on the GPU. This is a good performance optimization, but it's crucial to verify that the req_indices_gpu and self.positions.gpu tensors are correctly prepared and synchronized before this call to prevent any data inconsistencies or indexing errors on the device.
vllm/v1/worker/gpu_model_runner.py (1726-1746)
The conditional logic for populating self.num_accepted_tokens.gpu now prioritizes self.valid_sampled_token_count_gpu in async spec decode mode. This is a critical optimization to avoid CPU synchronization. It's important to confirm that valid_sampled_token_count_gpu is reliably available and contains the correct data when use_async_spec_decode is true, as any error here would directly impact the acceptance logic in speculative decoding.
vllm/v1/worker/gpu_model_runner.py (1780-1787)
For async speculative decoding, seq_lens_cpu and num_computed_tokens_cpu are explicitly set to None. This change implies that these CPU-side copies are no longer considered valid or necessary in this mode, reinforcing the GPU-first approach. Ensure that no other parts of the system inadvertently rely on these CPU tensors being populated when async spec decode is active, as this could lead to unexpected None access errors.
izhuhaoran
left a comment
There was a problem hiding this comment.
@MatthewBonanni Thanks for your work—overall LGTM, just a few minor questions.
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
izhuhaoran
left a comment
There was a problem hiding this comment.
@MatthewBonanni Thanks for the update, I’ve left a few more suggestions.
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
with vllm top commit 1b6cb920e6ebcac57154e6154578c39d4892a16c has some diffs with vllm-omni, vllm-project/vllm#32104 vllm-project/vllm#32951 vllm-project/vllm#37287 vllm-project/vllm#36483 just modify vllm-omni to work Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
…re_next_token_ids_padded) When async scheduling is enabled (zero-bubble spec decoding, PR vllm-project#32951), optimistic_seq_lens_cpu = num_computed_tokens + num_scheduled_tokens is passed to prepare_next_token_ids_padded as seq_lens_cpu. This value is inflated relative to the actual committed output_token_ids because _prepare_inputs appends -1 placeholder slots optimistically. The backup token lookup calls: request.get_token_id(seq_lens_cpu[i]) where seq_lens_cpu[i] points one past the end of the committed tokens, causing get_token_id() to return -1 (placeholder). The drafter then receives -1 as its next input token, which corrupts its hidden state and degrades the draft acceptance rate — causing the Nemotron-3-Super-120B BF16 GSM8K score to drop from ~0.93 to ~0.74. Fix: use (num_tokens_no_spec[i] - 1) — the index of the last committed output token — for the backup token lookup in both EagleProposer (eagle.py) and ExtractHiddenStatesProposer (extract_hidden_states.py). num_tokens_no_spec is set to request.num_tokens before the optimistic extend, so it always points to a valid token slot. Fixes: vllm-project#38098 Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
caused by: vllm-project/vllm#32951 1. AscendEagleProposer missing runner attribute: Added self.runner assignment right after parent __init__ to ensure availability - Affected 32 test cases (spec_decode tests) 2. Tensor.gpu deprecated API: Added _get_device_tensor() compatibility wrapper to handle both CpuGpuBuffer and direct Tensor objects - Affected 5 test cases 3. PrefillNoCache mode doesn't need to read from key_cache, only write to it. Skip key_cache initialization requirement for PrefillNoCache mode. 4. For plain GPU tensors, fill the tensor directly instead of calling .cpu() which creates a separate CPU copy that won't affect the GPU tensor. 5. Fixed unprotected buffer property accesses that could cause garbage output: - Line 919: query_start_loc.gpu access for logits_indices calculation - Line 1071: input_ids.gpu access for draft token computation - Lines 1147-1149: pcp_manager buffer access - Lines 1171, 1207: input_ids.gpu slicing for target tokens - Line 1399: query_start_loc.np assignment - Lines 859-862: mrope_positions gpu/cpu access 6. These were causing non-deterministic/garbage output during generation. 7. Fix unprotected access to xdrope_positions.gpu for plain tensor compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
…llm-project#32951) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
co-authored by @izhuhaoran
Purpose
This is a refactor of #29957. It improves the async-ness of spec decoding by optimistically assuming all draft tokens are accepted on the CPU and deferring the correction until after the forward pass. The GPU-side tensors are taken as the source of truth.
The
compute_slot_mappings_kernelfrom model runner V2 (plus @yewentao256's changes from #34179) is adapted into V1 here.Test Plan
On H100:
with
Test Result
TPOT 9.19 ms -> 8.90 ms (3% speedup)
Main
PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.