Skip to content

[Model Runner V2] support piecewise & mixed cudagraph#32771

Merged
WoosukKwon merged 19 commits intovllm-project:mainfrom
izhuhaoran:MRV2-support-piecewise
Feb 18, 2026
Merged

[Model Runner V2] support piecewise & mixed cudagraph#32771
WoosukKwon merged 19 commits intovllm-project:mainfrom
izhuhaoran:MRV2-support-piecewise

Conversation

@izhuhaoran
Copy link
Copy Markdown
Contributor

@izhuhaoran izhuhaoran commented Jan 21, 2026

Purpose

As titled, this PR supports piecewise & mixed cudagraph for model runner v2

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for piecewise and mixed CUDA graphs in the v2 model runner. The changes are well-structured, refactoring the CUDA graph capture logic to handle different modes (FULL, PIECEWISE, FULL_AND_PIECEWISE, etc.) more cleanly. The runtime dispatch logic in the model runner is also updated accordingly. While the implementation for piecewise graphs seems correct, I've found a critical issue in the full graph capture implementation concerning LoRA support.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 21, 2026

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Comment thread vllm/v1/worker/gpu/model_runner.py
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems to assume that PIECEWISE cudagraphs and FULL cudagprahs will have the same sizes; FULL cudagraphs are upper-bounded by max_num_seqs (or in the case of spec-decode max_num_seqs * (1 + num_speculative_tokens)) while PIECEWISE cudagraphs are upper bounded by max_cudagraph_capture_size. Atleast in V1, for performance I think this should be preserved (doesnt make sense to cut PIECEWISE cudagraphs off at 256 when currently then go up to 512 or 1024.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@izhuhaoran
Copy link
Copy Markdown
Contributor Author

This PR seems to assume that PIECEWISE cudagraphs and FULL cudagprahs will have the same sizes; FULL cudagraphs are upper-bounded by max_num_seqs (or in the case of spec-decode max_num_seqs * (1 + num_speculative_tokens)) while PIECEWISE cudagraphs are upper bounded by max_cudagraph_capture_size. Atleast in V1, for performance I think this should be preserved (doesnt make sense to cut PIECEWISE cudagraphs off at 256 when currently then go up to 512 or 1024.

Thanks for this suggestion, already fixed !

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 23, 2026

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@izhuhaoran izhuhaoran force-pushed the MRV2-support-piecewise branch from ba3ff40 to 26729fc Compare January 24, 2026 02:51
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @izhuhaoran.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jan 24, 2026
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@mergify mergify Bot removed the needs-rebase label Jan 24, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 24, 2026

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 24, 2026

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

…n piecewise

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@izhuhaoran izhuhaoran force-pushed the MRV2-support-piecewise branch from b323e3d to ad4d8fb Compare January 24, 2026 16:41
@izhuhaoran
Copy link
Copy Markdown
Contributor Author

Note: after merging main, runtime errors appear; they’ll be resolved in #33004.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 3, 2026

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this @izhuhaoran! Great work

I tested it and it gives a huge speedup on blackwell with a small model / decode-heavy workload.

Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Comment thread vllm/v1/worker/gpu/spec_decode/eagle_cudagraph.py Outdated
Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
Comment thread .gitignore Outdated
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@izhuhaoran
Copy link
Copy Markdown
Contributor Author

@njhill Thanks for your review, I've updated the codes. PTAL when you have time.

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting nits to reduce loc

Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@izhuhaoran
Copy link
Copy Markdown
Contributor Author

formatting nits to reduce loc

Thanks for these suggestions, already reformatted.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 17, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @izhuhaoran.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Feb 17, 2026
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@mergify mergify Bot removed the needs-rebase label Feb 18, 2026
@LucasWilkinson
Copy link
Copy Markdown
Collaborator

Thanks for the contribution! @izhuhaoran

I don't think we need to do it now but we should start thinking about how we can more uniquely identify cudagraphs based on the features/configuration of the captured graph. I dont think num_tokens is sufficient or explicit enough.

Currently this PR just maintains 2 types of batches, uniform and non-uniform and handles that maintaining 2 lists of tokens counts, i.e.

if uniform_decode and self.uniform_decode_cudagraph_sizes:
    return self.uniform_decode_cudagraph_sizes.get(num_tokens)
return self.cudagraph_sizes.get(num_tokens)

If more "batch types" are needed this doesn't feel very scalable, e.g. of features that would have different "batch types" are:

  1. lora: Currently (MRV1) we support --specialize-active-lora which captures different cudagraphs for different active LORA counts
  2. dynamic spec-decode: we will likely want to capture different uniform graphs with same token counts but a different number of requests, e.g. for we may want 2 graphs for num_tokens = 8, one with 4 requests for num_speculated_tokens = 1, and one with 2 requests for num_speculated_tokens = 3

In MRV1 this is handled by mapping BatchDescriptor -> cudagraph but is not super clean (matching gets a bit cleaner in #34102). I think its worthing thinking a bit into the future about how we might support these cases more naturally.

cc @njhill @WoosukKwon

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@LucasWilkinson Thanks for bringing it up. I do agree that CUDA graph needs more design discussions. I'm accepting this PR to move us forward, but we must revisit this or next week.

Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The code looks clean and well-structured, and I think it’s a solid implementation given the current CUDA graph design. Great work! 👍

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Feb 18, 2026
@WoosukKwon WoosukKwon merged commit 11d3976 into vllm-project:main Feb 18, 2026
7 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Feb 18, 2026
@izhuhaoran
Copy link
Copy Markdown
Contributor Author

@LucasWilkinson @WoosukKwon Thanks for the review and sorry for the late reply due to Chinese New Year. Yes, this PR is only based on the current CUDA graph design, and I agree that it needs further improvement.

jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026
…2771)

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…2771)

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
jiangkuaixue123 pushed a commit to jiangkuaixue123/vllm that referenced this pull request Apr 28, 2026
…2771)

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants