Cherry-pick PR #34880: enable FULL CUDAGraph for EAGLE proposer by mgehre-amd · Pull Request #20 · mgehre-amd/vllm

mgehre-amd · 2026-03-20T11:33:40Z

Summary

Cherry-pick of upstream vllm-project/vllm PR [Spec Decode][CUDA Graphs] Enables Eagle drafter support for FULL CUDA Graph mode vllm-project/vllm#34880 (commit 409a12e)
Enables FULL CUDAGraph mode for the EAGLE proposer during draft speculative steps
Targets the matthias.merge-upstream branch (PR Merge upstream/main into matthias.awq_gemv #18)

Benchmark results (Qwen3-4B W4A16 + Eagle3, 20 prompts, gfx1151)

Configuration	Without PR	With PR
--enforce-eager	7.57ms	7.84ms (+0.27ms)
cudagraph + compile_sizes=[3,6,12]	7.56ms	7.59ms (+0.03ms)

The PR adds ~0.27ms Python overhead in eager mode but is neutral with CUDAGraph + compile_sizes.

Test plan

Benchmarked with --enforce-eager (20 prompts)
Benchmarked with cudagraph + compile_sizes=[3,6,12] (20 prompts)

…oposer Cherry-pick 409a12e to enable FULL CUDAGraph mode for the EAGLE proposer during draft speculative steps, reducing CPU overhead. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

…WEEP The sweep functions in skinny_gemms_int4.cu and skinny_gemms_int8.cu instantiate many template combinations only used for benchmarking. Wrapping them with #ifdef VLLM_SKINNY_GEMM_SWEEP (matching the existing pattern in skinny_gemms.cu) reduces build time from ~1236s to ~213s. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

- Pass spec_decode_common_attn_metadata to drafter.dummy_run() so the drafter can dispatch uniform_decode=True and match FULL batch keys - Allow any non-NONE cudagraph mode during capture (not just PIECEWISE) so the drafter's FULL CUDAGraphWrapper actually triggers capture - Add hasattr fallback for get_eagle3_aux_hidden_state_layers to support models like Qwen3 that only have the default method - Add _dump_all_full_graphs() call after capture for hipGraph debugging - Re-apply PR vllm-project#34880 changes lost during merge with awq_gemv_ifdef_sweep Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Capture Drafter forward, Target forward, and compute_logits as a single CUDA graph for EAGLE speculative decoding. This eliminates GPU launch overhead between the three phases during decode. Key changes: - cuda_graph.py: Add _merged_capture_bypass flag so CUDAGraphWrapper passes through during merged capture; add hipGraph DOT dump utilities; use keep_graph=True for FULL mode to retain raw graph handles. - gpu_model_runner.py: Add _merged_capture() to record the combined [Drafter → Target → compute_logits] graph with persistent buffers; restructure _capture_cudagraphs to capture individual graphs first, then merged graphs; replay merged graph in execute_model when ready. - eagle.py: Handle target_model_batch_desc=None gracefully in propose(). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

The previous ordering (Drafter→Target→Logits) caused the drafter to use stale next_token_ids from the previous step, collapsing the acceptance rate from ~37% to ~2%. Reorder to Target→Logits→Drafter so the drafter receives fresh hidden states and next_token_ids computed via argmax(logits) inside the graph. Key changes: - Compute in-graph greedy rejection: compare argmax(logits) with draft tokens in input_ids to derive token_indices_to_sample, num_rejected_tokens_gpu, and next_token_ids — all within the captured CUDA graph. - Remove prev_input_ids/prev_positions/prev_seq_lens/prev_slot_mapping persistent buffers that were needed for the old ordering. - The drafter now uses the current step's target hidden states and the correctly computed bonus token, matching the eager flow exactly. Results (Qwen3-4B + EAGLE3, 2 spec tokens, Strix Halo): Baseline (no merge): TPOT 8.98ms, acceptance 37.1% Merged graph: TPOT 9.31ms, acceptance 37.2% Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

The in-graph greedy rejection (added in the previous commit) overwrites _merged_token_indices_to_sample and _merged_num_rejected_tokens_gpu before the drafter reads them, making the pre-replay eagle_prepare_inputs_padded_kernel redundant. Removing it also eliminates a hidden cu_draft[-1].item() GPU→CPU sync that cost ~0.3ms per step. Results (Qwen3-4B + EAGLE3, 2 spec tokens, Strix Halo): Before: TPOT 9.31ms After: TPOT 9.00ms (baseline 8.98ms) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

For greedy decoding, the in-graph rejection already computes next_token_ids and num_rejected_tokens_gpu via the same argmax comparison used by the real rejection sampler. Derive valid_sampled_tokens_count directly from the in-graph results instead of calling prepare_next_token_ids_padded, which involves a Python loop (backup token computation), a CPU→GPU copy, tensor allocations, and a Triton kernel launch. Falls back to the full prepare_next_token_ids_padded for non-greedy sampling where the real sampler may produce different acceptance patterns than the greedy in-graph rejection. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

When all conditions are met (greedy decoding, no logprobs, no penalties, no bad words, no constrained decoding, no non-argmax-invariant logits processors, no grammar output), construct sampled_token_ids directly from the in-graph rejection results instead of running the external rejection sampler. This eliminates the bonus argmax, target logits extraction, and rejection kernel while maintaining identical output. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Cherry-pick PR vllm-project#34880: enable FULL CUDAGraph for EAGLE pr…

85e9d81

…oposer Cherry-pick 409a12e to enable FULL CUDAGraph mode for the EAGLE proposer during draft speculative steps, reducing CPU overhead. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd force-pushed the matthias.pr34880 branch from b1158f4 to 85e9d81 Compare March 20, 2026 11:43

mgehre-amd added 8 commits March 20, 2026 06:12

Merge branch 'matthias.awq_gemv_ifdef_sweep' into matthias.pr34880

a120383

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick PR #34880: enable FULL CUDAGraph for EAGLE proposer#20

Cherry-pick PR #34880: enable FULL CUDAGraph for EAGLE proposer#20
mgehre-amd wants to merge 9 commits intomatthias.merge-upstreamfrom
matthias.pr34880

mgehre-amd commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgehre-amd commented Mar 20, 2026

Summary

Benchmark results (Qwen3-4B W4A16 + Eagle3, 20 prompts, gfx1151)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant