Cherry-pick PR #34880: enable FULL CUDAGraph for EAGLE proposer#20
Open
mgehre-amd wants to merge 9 commits intomatthias.merge-upstreamfrom
Open
Cherry-pick PR #34880: enable FULL CUDAGraph for EAGLE proposer#20mgehre-amd wants to merge 9 commits intomatthias.merge-upstreamfrom
mgehre-amd wants to merge 9 commits intomatthias.merge-upstreamfrom
Conversation
…oposer Cherry-pick 409a12e to enable FULL CUDAGraph mode for the EAGLE proposer during draft speculative steps, reducing CPU overhead. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
b1158f4 to
85e9d81
Compare
…WEEP The sweep functions in skinny_gemms_int4.cu and skinny_gemms_int8.cu instantiate many template combinations only used for benchmarking. Wrapping them with #ifdef VLLM_SKINNY_GEMM_SWEEP (matching the existing pattern in skinny_gemms.cu) reduces build time from ~1236s to ~213s. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
- Pass spec_decode_common_attn_metadata to drafter.dummy_run() so the drafter can dispatch uniform_decode=True and match FULL batch keys - Allow any non-NONE cudagraph mode during capture (not just PIECEWISE) so the drafter's FULL CUDAGraphWrapper actually triggers capture - Add hasattr fallback for get_eagle3_aux_hidden_state_layers to support models like Qwen3 that only have the default method - Add _dump_all_full_graphs() call after capture for hipGraph debugging - Re-apply PR vllm-project#34880 changes lost during merge with awq_gemv_ifdef_sweep Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Capture Drafter forward, Target forward, and compute_logits as a single CUDA graph for EAGLE speculative decoding. This eliminates GPU launch overhead between the three phases during decode. Key changes: - cuda_graph.py: Add _merged_capture_bypass flag so CUDAGraphWrapper passes through during merged capture; add hipGraph DOT dump utilities; use keep_graph=True for FULL mode to retain raw graph handles. - gpu_model_runner.py: Add _merged_capture() to record the combined [Drafter → Target → compute_logits] graph with persistent buffers; restructure _capture_cudagraphs to capture individual graphs first, then merged graphs; replay merged graph in execute_model when ready. - eagle.py: Handle target_model_batch_desc=None gracefully in propose(). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The previous ordering (Drafter→Target→Logits) caused the drafter to use stale next_token_ids from the previous step, collapsing the acceptance rate from ~37% to ~2%. Reorder to Target→Logits→Drafter so the drafter receives fresh hidden states and next_token_ids computed via argmax(logits) inside the graph. Key changes: - Compute in-graph greedy rejection: compare argmax(logits) with draft tokens in input_ids to derive token_indices_to_sample, num_rejected_tokens_gpu, and next_token_ids — all within the captured CUDA graph. - Remove prev_input_ids/prev_positions/prev_seq_lens/prev_slot_mapping persistent buffers that were needed for the old ordering. - The drafter now uses the current step's target hidden states and the correctly computed bonus token, matching the eager flow exactly. Results (Qwen3-4B + EAGLE3, 2 spec tokens, Strix Halo): Baseline (no merge): TPOT 8.98ms, acceptance 37.1% Merged graph: TPOT 9.31ms, acceptance 37.2% Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The in-graph greedy rejection (added in the previous commit) overwrites _merged_token_indices_to_sample and _merged_num_rejected_tokens_gpu before the drafter reads them, making the pre-replay eagle_prepare_inputs_padded_kernel redundant. Removing it also eliminates a hidden cu_draft[-1].item() GPU→CPU sync that cost ~0.3ms per step. Results (Qwen3-4B + EAGLE3, 2 spec tokens, Strix Halo): Before: TPOT 9.31ms After: TPOT 9.00ms (baseline 8.98ms) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
For greedy decoding, the in-graph rejection already computes next_token_ids and num_rejected_tokens_gpu via the same argmax comparison used by the real rejection sampler. Derive valid_sampled_tokens_count directly from the in-graph results instead of calling prepare_next_token_ids_padded, which involves a Python loop (backup token computation), a CPU→GPU copy, tensor allocations, and a Triton kernel launch. Falls back to the full prepare_next_token_ids_padded for non-greedy sampling where the real sampler may produce different acceptance patterns than the greedy in-graph rejection. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
When all conditions are met (greedy decoding, no logprobs, no penalties, no bad words, no constrained decoding, no non-argmax-invariant logits processors, no grammar output), construct sampled_token_ids directly from the in-graph rejection results instead of running the external rejection sampler. This eliminates the bonus argmax, target logits extraction, and rejection kernel while maintaining identical output. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Benchmark results (Qwen3-4B W4A16 + Eagle3, 20 prompts, gfx1151)
The PR adds ~0.27ms Python overhead in eager mode but is neutral with CUDAGraph + compile_sizes.
Test plan