[Spec Decode][CUDA Graphs] Enables Eagle drafter support for FULL CUDA Graph mode#34880
[Spec Decode][CUDA Graphs] Enables Eagle drafter support for FULL CUDA Graph mode#34880yiz-liu wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables full CUDA graph support for the Eagle drafter model, which is a significant performance enhancement. The changes are well-structured, including necessary modifications for CUDA graph compatibility like in-place tensor updates and proper dummy run setup for graph capturing. I've identified one critical issue: a logging statement with incorrect formatting that will cause a TypeError at runtime. I've provided a suggestion to fix it. Overall, this is a great contribution.
yiz-liu
left a comment
There was a problem hiding this comment.
These are some questions I am not entirely sure about. Any comments?
vllm/v1/spec_decode/eagle.py
Outdated
| if not self.speculative_config.enforce_eager: | ||
| # This is a temprary mapping open to discussions | ||
| # FULL_AND_PIECEWISE -> PIECEWISE, FULL_DECODE_ONLY -> FULL | ||
| # PIECEWISE -> PIECEWISE, FULL -> FULL | ||
| eagle_cudagraph_mode = ( | ||
| CUDAGraphMode.PIECEWISE | ||
| if cudagraph_mode.has_piecewise_cudagraphs() | ||
| else cudagraph_mode.decode_mode() | ||
| ) |
There was a problem hiding this comment.
Do we have any other thoughts on this?
| common_attn_metadata.query_start_loc[: batch_size + 1] = self.arange[ | ||
| : batch_size + 1 | ||
| ] | ||
| common_attn_metadata.query_start_loc_cpu[: batch_size + 1] = torch.from_numpy( |
There was a problem hiding this comment.
I am not sure why we set query_start_loc or slot_mapping to a different buffer in the first place, but I assume it's always safe to use the original buffer.
There was a problem hiding this comment.
I think it was not considering consistent addressing since we didn't have full graphs yet.
| # NOTE: For CUDA Graph, we need the `num_reqs_padded` here | ||
| batch_size = common_attn_metadata.num_reqs |
There was a problem hiding this comment.
This is another core change, as the input_batch.num_reqs != common_attn_metadata.num_reqs after padding, and I wonder if there is a better way to deal with this?
|
@tomasruizt @LucasWilkinson Could you please take a look at this? Thanks! |
| # if we have dedicated decode cudagraphs, and spec-decode is enabled, | ||
| # we need to adjust the cudagraph sizes to be a multiple of the uniform | ||
| # decode query length to avoid: https://github.com/vllm-project/vllm/issues/28207 | ||
| # temp-fix: https://github.com/vllm-project/vllm/issues/28207#issuecomment-3504004536 | ||
| # Will be removed in the near future when we have separate cudagraph capture | ||
| # sizes for decode and mixed prefill-decode. | ||
| if ( | ||
| cudagraph_mode.decode_mode() == CUDAGraphMode.FULL | ||
| and cudagraph_mode.separate_routine() | ||
| and self.uniform_decode_query_len > 1 | ||
| ): |
There was a problem hiding this comment.
Also I check the comments here, since we already have separate capture sizes now, we can remove this condition right? This is to solve the num_speculative_tokens=2 issue.
|
@yiz-liu Thanks a lot for this PR! Edit: Perhaps the observation below is just a matter of enabling full CG also for I profiled your branch using the PyTorch profiler and found:
I attach the command I used to generate the trace as well as the PyTorch trace, which you can open in https://ui.perfetto.dev/. Script: profile-4b-sd-0.6b.sh Trace: rank0.1771496331250686317.pt.trace.json.gz I assume that you are seeing no changes in performance whatsoever compared to main (TPOT, ITL). If correct, it means that some wiring up is still missing to enable full CG for the drafter. Let me know if I'm wrong or missing something. |
|
@tomasruizt Weird, I'll look into this later, in the meanwhile scripts and profiling are attached below: python3 data_parallel.py \
--model="/home/weight/Qwen3-30B-A3B-FP8" \
-dp=1 \
-tp=2 |
|
For EAGLE3, I'm observing the same phenomenon. I used gpt-oss-20b + eagle3.
Profiling script: profile-gpt-oss-20b-eagle3.sh |
@tomasruizt Oh yeah I checked this scripts and profiling, I believe the behavior you're observing is due to the default CUDA graph mode, which resolves to
Could you please try explicitly setting the CUDA graph mode to This brings up a design question, I'll elaborate on my comment before: do you think we should change the default strategy to be more aggressive (i.e., prefer |
|
If the target model runs in full cg, then the draft model should run in full cg if possible, right? The higher performance setting should be the default. What is the problem with setting full cg as a default? |
Yeah that's a good point. No problem at all, my initial design was just trying to honor the existing default behavior for consistency. However, I agree that prioritizing performance is the right way. I'll go ahead and update the PR. Of course, for others who might have concerns, this is still open for discussion. Thanks for the valuable feedback! |
|
+1, behaviour should match the base model for consistency whenever possible. If base model uses full graphs for a certain shape, so should the drafter. |
|
Thanks for the great work! Will the full CUDA can be applied to parallel-EAGLE as well? CC @benchislett |
|
@tomasruizt @benchislett Hi, please see the latest commit for the unified CUDA Graph mode, the target model and drafter model should share the same behavior now. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @yiz-liu, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @yiz-liu, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@yiz-liu Are you able to generate the PyTorch profile? You can attach it once done as proof that the drafter uses |
Hmm, ran into this error again, I'll take a closer look at it. |
Thank you. Perhaps that's an issue on my end then. I'll look into the setup closer, and if I'm able to reproduce it in a simple way, I'll share the details here |
Oh sorry, I missed this comment before. Good catch. I've been struggling with the dispatching and
I suspect the regression is due to some mismatch between the first (1+K) step and the subsequent 1-token decoding steps. Maybe they are getting incorrect stale values and the illegal memory access may suggests that the CUDA Graph captured for the larger shape is accessing out-of-bounds memory when executed on the smaller 1-token batch? Do you have any insights on how to capture and dispatch graphs @benchislett , thanks a lot. My next step is to handle these two scenarios independently by ensuring they are captured as distinct CUDA graph instances. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@LucasWilkinson might have some ideas. I'm not entirely sure how to avoid the issue without having an additional set of graphs, which seems like it would be a pain to maintain. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Sorry for the late reply, didn't feel well last week. I’ve been overhauling things the past couple of days. The main changes:
Also, the acceptance rate should be OK now as you can see in the Test Result in PR description, @gorski-m . Could you please review this PR again? Thanks. @benchislett @LucasWilkinson |
|
Hi @yiz-liu, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…ificantly improving inference performance by reducing CPU overhead during the draft speculative steps. 1. CudagraphDispatcher * Added a for_draft_model flag to allow specialized graph capture logic for speculative decoding. * Updated initialize_cudagraph_keys to capture graphs up to max_num_tokens specifically for steps > 0. * Set uniform_decode_query_len as a independent parameter as when steps > 0, it should be 1. 2. EAGLE Proposer Updates * Model Wrapping: The draft model is now wrapped in CUDAGraphWrapper when FULL mode is enabled and padding is not disabled. * Metadata Padding: Fixed a potential crash by padding spec_decode_metadata.cu_num_draft_tokens to match the padded batch size. * Refined Dispatching: Updated _determine_batch_execution_and_padding to return and pass BatchDescriptor objects, ensuring the runtime uses the correct graph key. * Capture Logic: Enhanced dummy_run to simulate the actual speculative decoding steps during the graph capture phase. 3. GPUModelRunner * Introduced supports_sd_full_graph to identify proposers (like EAGLE) that are compatible with FULL graph mode. * Modified ExecuteModelState to track batch_desc, ensuring consistency between the target model and draft model. * Ensured CommonAttentionMetadata is correctly passed to the drafter's warmup/dummy runs to facilitate accurate metadata building during capture. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…oposer Cherry-pick 409a12e to enable FULL CUDAGraph mode for the EAGLE proposer during draft speculative steps, reducing CPU overhead. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…oposer Cherry-pick 409a12e to enable FULL CUDAGraph mode for the EAGLE proposer during draft speculative steps, reducing CPU overhead. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
- Pass spec_decode_common_attn_metadata to drafter.dummy_run() so the drafter can dispatch uniform_decode=True and match FULL batch keys - Allow any non-NONE cudagraph mode during capture (not just PIECEWISE) so the drafter's FULL CUDAGraphWrapper actually triggers capture - Add hasattr fallback for get_eagle3_aux_hidden_state_layers to support models like Qwen3 that only have the default method - Add _dump_all_full_graphs() call after capture for hipGraph debugging - Re-apply PR vllm-project#34880 changes lost during merge with awq_gemv_ifdef_sweep Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
|
I'm happy about seeing eagle drafters in CUDA Graph! |
I think the main blocker is the D2H/H2D transfer between the target model and the draft model, for example in I am not sure yet whether we can get rid of those transfers in the future, and I will check Model Runner v2 design later. |
|
@LucasWilkinson Hi, I noticed that similar features have been added in MRV2 (#35959 ) over the past couple of weeks. I’ll take a look as well, but would you mind reviewing this PR when you get a chance? I think V1 still needs this support, please let me know what you think, any help would be greatly appreciated. Thanks! |
|
This pull request has merge conflicts that must be resolved before it can be |






Purpose
As mentioned in vllm-project/vllm-ascend#5459 and #33341 , this PR enables Full CUDA Graph mode for the Eagle drafter model to improve performance.
The main changes include:
CUDAGraphWrapperduringload_modeland initializes the necessary keys for the dispatcher to manage graph-based execution.dummy_runphase, which is required for successful graph capture.uniform_decodewith target model and basically has the samebatch_descandcudagraph_mode, while for the following steps, theuniform_decode_query_lenwill be set to 1 anduniform_decodetoTrue, making it possible to have separatecudagraph_keys.query_start_locandslot_mappingwithin the attention metadata.num_speculative_tokenswas set to 2. Also fixprepare_inputs_paddedandprepare_next_token_ids_paddedfor padding issues.Collectively, these changes allow the Eagle drafter to leverage the performance benefits of Full CUDA Graph mode, enhancing throughput for speculative decoding.
Test Plan
The feature was tested by running the model with the following configuration:
num_speculative_tokens=2(and also validated with3/4/5)cudagraph_mode="FULL"(and also validated withFULL_DECODE_ONLYandFULL_AND_PIECEWISE)Test Result
The model's acceptance rate in Full CUDA Graph mode is consistent with the results from eager mode.
For
FULL_AND_PIECEWISE:For
FULL:Essential Elements of an Effective PR Description Checklist