Skip to content

Cherry-pick PR #34880: enable FULL CUDAGraph for EAGLE proposer#20

Open
mgehre-amd wants to merge 9 commits intomatthias.merge-upstreamfrom
matthias.pr34880
Open

Cherry-pick PR #34880: enable FULL CUDAGraph for EAGLE proposer#20
mgehre-amd wants to merge 9 commits intomatthias.merge-upstreamfrom
matthias.pr34880

Conversation

@mgehre-amd
Copy link
Owner

Summary

Benchmark results (Qwen3-4B W4A16 + Eagle3, 20 prompts, gfx1151)

Configuration Without PR With PR
--enforce-eager 7.57ms 7.84ms (+0.27ms)
cudagraph + compile_sizes=[3,6,12] 7.56ms 7.59ms (+0.03ms)

The PR adds ~0.27ms Python overhead in eager mode but is neutral with CUDAGraph + compile_sizes.

Test plan

  • Benchmarked with --enforce-eager (20 prompts)
  • Benchmarked with cudagraph + compile_sizes=[3,6,12] (20 prompts)

…oposer

Cherry-pick 409a12e to enable FULL CUDAGraph mode for the EAGLE
proposer during draft speculative steps, reducing CPU overhead.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…WEEP

The sweep functions in skinny_gemms_int4.cu and skinny_gemms_int8.cu
instantiate many template combinations only used for benchmarking.
Wrapping them with #ifdef VLLM_SKINNY_GEMM_SWEEP (matching the existing
pattern in skinny_gemms.cu) reduces build time from ~1236s to ~213s.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
- Pass spec_decode_common_attn_metadata to drafter.dummy_run() so the
  drafter can dispatch uniform_decode=True and match FULL batch keys
- Allow any non-NONE cudagraph mode during capture (not just PIECEWISE)
  so the drafter's FULL CUDAGraphWrapper actually triggers capture
- Add hasattr fallback for get_eagle3_aux_hidden_state_layers to support
  models like Qwen3 that only have the default method
- Add _dump_all_full_graphs() call after capture for hipGraph debugging
- Re-apply PR vllm-project#34880 changes lost during merge with awq_gemv_ifdef_sweep

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Capture Drafter forward, Target forward, and compute_logits as a single
CUDA graph for EAGLE speculative decoding. This eliminates GPU launch
overhead between the three phases during decode.

Key changes:
- cuda_graph.py: Add _merged_capture_bypass flag so CUDAGraphWrapper
  passes through during merged capture; add hipGraph DOT dump utilities;
  use keep_graph=True for FULL mode to retain raw graph handles.
- gpu_model_runner.py: Add _merged_capture() to record the combined
  [Drafter → Target → compute_logits] graph with persistent buffers;
  restructure _capture_cudagraphs to capture individual graphs first,
  then merged graphs; replay merged graph in execute_model when ready.
- eagle.py: Handle target_model_batch_desc=None gracefully in propose().

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The previous ordering (Drafter→Target→Logits) caused the drafter to
use stale next_token_ids from the previous step, collapsing the
acceptance rate from ~37% to ~2%. Reorder to Target→Logits→Drafter
so the drafter receives fresh hidden states and next_token_ids
computed via argmax(logits) inside the graph.

Key changes:
- Compute in-graph greedy rejection: compare argmax(logits) with
  draft tokens in input_ids to derive token_indices_to_sample,
  num_rejected_tokens_gpu, and next_token_ids — all within the
  captured CUDA graph.
- Remove prev_input_ids/prev_positions/prev_seq_lens/prev_slot_mapping
  persistent buffers that were needed for the old ordering.
- The drafter now uses the current step's target hidden states and
  the correctly computed bonus token, matching the eager flow exactly.

Results (Qwen3-4B + EAGLE3, 2 spec tokens, Strix Halo):
  Baseline (no merge): TPOT 8.98ms, acceptance 37.1%
  Merged graph:        TPOT 9.31ms, acceptance 37.2%

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The in-graph greedy rejection (added in the previous commit) overwrites
_merged_token_indices_to_sample and _merged_num_rejected_tokens_gpu
before the drafter reads them, making the pre-replay
eagle_prepare_inputs_padded_kernel redundant. Removing it also
eliminates a hidden cu_draft[-1].item() GPU→CPU sync that cost ~0.3ms
per step.

Results (Qwen3-4B + EAGLE3, 2 spec tokens, Strix Halo):
  Before: TPOT 9.31ms
  After:  TPOT 9.00ms (baseline 8.98ms)

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
For greedy decoding, the in-graph rejection already computes
next_token_ids and num_rejected_tokens_gpu via the same argmax
comparison used by the real rejection sampler. Derive
valid_sampled_tokens_count directly from the in-graph results
instead of calling prepare_next_token_ids_padded, which involves
a Python loop (backup token computation), a CPU→GPU copy, tensor
allocations, and a Triton kernel launch.

Falls back to the full prepare_next_token_ids_padded for non-greedy
sampling where the real sampler may produce different acceptance
patterns than the greedy in-graph rejection.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
When all conditions are met (greedy decoding, no logprobs, no penalties,
no bad words, no constrained decoding, no non-argmax-invariant logits
processors, no grammar output), construct sampled_token_ids directly
from the in-graph rejection results instead of running the external
rejection sampler. This eliminates the bonus argmax, target logits
extraction, and rejection kernel while maintaining identical output.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant