[0.18.0][BugFix] Update capture sizes after rounding operations#8380
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug where capture sizes became mismatched following rounding operations in speculative or sp modes. By shifting from static cached sizes to real-time descriptors provided by the dispatcher, the system ensures that graph parameters are correctly aligned with the actual state of the model runner. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Signed-off-by: Zetong Li <slippersss@126.com>
13602a4 to
959eddf
Compare
There was a problem hiding this comment.
Code Review
Suggested PR Title:\n\nmarkdown\n[Ops][Misc] Use dynamic capture sizes for ACL graph parameter initialization\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis pull request updates the `_check_and_update_cudagraph_mode` method to retrieve capture sizes from the `cudagraph_dispatcher`. This ensures that `set_graph_params` and `set_draft_graph_params` are initialized with the correct token counts for ACL graph execution instead of using static batch sizes. A review comment suggests simplifying the set comprehension used to extract these sizes for better readability.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nCI passed with existing tests.\n
| capture_descs = self.cudagraph_dispatcher.get_capture_descs() | ||
| capture_sizes = sorted({ | ||
| desc.num_tokens | ||
| for _, descs in capture_descs | ||
| for desc in descs | ||
| }) |
There was a problem hiding this comment.
There was a problem hiding this comment.
It got a point, but not so important.
b72ade9
into
vllm-project:releases/v0.18.0
What this PR does / why we need it?
This PR is partially cherry-picked from #8172.
This PR aims to fix mismatched capture sizes after rounding operations when using sp or speculative. The reason is that original
self.cudagraph_batch_sizesis no longer updated and remains as the initial sizes. Now we useself.cudagraph_dispatcher.get_capture_descsto the get up-to-date sizes.Does this PR introduce any user-facing change?
N/A
How was this patch tested?
by ci