-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[v1][spec decode] Run eagle with full cudagraph support #21477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for full CUDA graph execution with the Eagle speculative decoding model. The changes are well-structured, adding a new compilation configuration and updating the necessary components in the model runner and the Eagle proposer to handle CUDA graph-compatible attention metadata. The tests have also been extended to cover this new functionality.
I've identified a critical issue where an IndexError could occur if cudagraph_batch_sizes becomes empty, which can happen under certain configurations with sequence parallelism. I've provided suggestions to fix this by adding a check before accessing the list.
Other than that, the implementation looks solid and correctly follows the patterns for enabling CUDA graphs in vLLM.
99e87fe to
7251b9e
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Thanks for the great work in addressing the output discrepancies between full cudagraph and piecewise cudagraph. May I ask how the numerical difference of rms_norm was addressed compared to the previous #20190? I didn't identify how this was being done after I read through the code. It would be very helpful if you could point out the way, as I encountered a similar issue at CI test of #20059, where some output discrepancies are identified between full cudagraph and piecewise cudagraph. |
Hi @fhl2000, really impressed by your great work in #20059! For the accuracy issue, what I learned from @zou3519 was that when using full cudagraph mode, inductor would pick a rms_norm kernel that has different numerics than what's in the piecewise version: pytorch/pytorch#158699 can confirm if this is the same issue by turning off inductor "use_inductor": false and / or turning off compilation for rms_norm "custom_ops": ["+rms_norm"] in the compilation config. |
eea737f to
fa49837
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and sorry for the delayed review.
LGTM overall. Left some questions.
vllm/v1/spec_decode/eagle.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this? This looks redundant as the loop updates attn_metadata in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the loop updates the eager mode attn_metadata in place, however for the cudagraph case we need to update the attn_metadata_cudagraph with buffered tensors.
Directly copying over the eager mode attn_metadata is the easiest way to do this but let me also try to see if I can refactor it to be cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored and removed this code block, thanks for the suggestion!
aa14306 to
935ee07
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the refactoring!
|
@zixi-qi Can you please rebase the PR with main? While I think the failing tests are not relevant to this PR, but it'd be nice to double check |
Signed-off-by: qizixi <[email protected]>
Signed-off-by: qizixi <[email protected]>
Signed-off-by: qizixi <[email protected]>
Signed-off-by: qizixi <[email protected]>
Signed-off-by: qizixi <[email protected]>
4299dfd to
51939f0
Compare
Done! Btw I also rebased #22691 , which is a small fix and shouldn't trigger any test failures on its own. Maybe we can also use it to cross validate if any test failures are due to this test or not |
Signed-off-by: qizixi <[email protected]>
Seems test_eagle_correctness test is broken on trunk and adding another dimension to it won't help, so moved the full cudagraph validation to a separate test. |
Signed-off-by: qizixi <[email protected]>
|
@zixi-qi Hi, Thanks for the great work. I have a quick question: Have you compared the acceptance rates between full CUDA graph mode and eager mode? If so, do you observe any differences in acceptance behavior between them? |
|
This pull request has merge conflicts that must be resolved before it can be |
Awesome! Let me rebase and try! |
Thanks for the follow up! CUDA graph itself should be supported in #23679 , this change adds full graph support on top of that but still needs to be rebased |
|
After #23679 this PR is no longer relevant and draft full cudagraph needs to be supported based off the new design. Hence closing this one to reduce confusion |
Purpose
Support running eagle speculative decoding draft model in full cudagraph mode for v1 (referencing #16072). On H100 TP1 with Llama3.1 8B model, the full cudagraph version shows on par (or marginal improvement) on performance.
Test Plan & Result
Side Note
In previous draft PR (#20190 (comment)) a numerical difference between full and piecewise cudagraph mode was identified, @zou3519 helped identified the root cause to be rms_norm kernel selected by inductor having slightly better numerics than eager version (pytorch/pytorch#158699) which is expected behavior.