[Model Runner V2] support piecewise & mixed cudagraph#32771
[Model Runner V2] support piecewise & mixed cudagraph#32771WoosukKwon merged 19 commits intovllm-project:mainfrom
Conversation
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for piecewise and mixed CUDA graphs in the v2 model runner. The changes are well-structured, refactoring the CUDA graph capture logic to handle different modes (FULL, PIECEWISE, FULL_AND_PIECEWISE, etc.) more cleanly. The runtime dispatch logic in the model runner is also updated accordingly. While the implementation for piecewise graphs seems correct, I've found a critical issue in the full graph capture implementation concerning LoRA support.
|
Hi @izhuhaoran, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
LucasWilkinson
left a comment
There was a problem hiding this comment.
This PR seems to assume that PIECEWISE cudagraphs and FULL cudagprahs will have the same sizes; FULL cudagraphs are upper-bounded by max_num_seqs (or in the case of spec-decode max_num_seqs * (1 + num_speculative_tokens)) while PIECEWISE cudagraphs are upper bounded by max_cudagraph_capture_size. Atleast in V1, for performance I think this should be preserved (doesnt make sense to cut PIECEWISE cudagraphs off at 256 when currently then go up to 512 or 1024.
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Thanks for this suggestion, already fixed ! |
|
Hi @izhuhaoran, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
ba3ff40 to
26729fc
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
Hi @izhuhaoran, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
Hi @izhuhaoran, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…n piecewise Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
b323e3d to
ad4d8fb
Compare
|
Note: after merging main, runtime errors appear; they’ll be resolved in #33004. |
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
Hi @izhuhaoran, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
njhill
left a comment
There was a problem hiding this comment.
Thanks a lot for this @izhuhaoran! Great work
I tested it and it gives a huge speedup on blackwell with a small model / decode-heavy workload.
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
@njhill Thanks for your review, I've updated the codes. PTAL when you have time. |
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Thanks for these suggestions, already reformatted. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Thanks for the contribution! @izhuhaoran I don't think we need to do it now but we should start thinking about how we can more uniquely identify cudagraphs based on the features/configuration of the captured graph. I dont think Currently this PR just maintains 2 types of batches, uniform and non-uniform and handles that maintaining 2 lists of tokens counts, i.e. If more "batch types" are needed this doesn't feel very scalable, e.g. of features that would have different "batch types" are:
In MRV1 this is handled by mapping |
|
@LucasWilkinson Thanks for bringing it up. I do agree that CUDA graph needs more design discussions. I'm accepting this PR to move us forward, but we must revisit this or next week. |
WoosukKwon
left a comment
There was a problem hiding this comment.
Thanks for the PR! The code looks clean and well-structured, and I think it’s a solid implementation given the current CUDA graph design. Great work! 👍
|
@LucasWilkinson @WoosukKwon Thanks for the review and sorry for the late reply due to Chinese New Year. Yes, this PR is only based on the current CUDA graph design, and I agree that it needs further improvement. |
…2771) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
…2771) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
…2771) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Purpose
As titled, this PR supports piecewise & mixed cudagraph for model runner v2