[Model Runner V2] support piecewise & mixed cudagraph by izhuhaoran · Pull Request #32771 · vllm-project/vllm

izhuhaoran · 2026-01-21T09:59:00Z

Purpose

As titled, this PR supports piecewise & mixed cudagraph for model runner v2

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

gemini-code-assist

Code Review

This pull request introduces support for piecewise and mixed CUDA graphs in the v2 model runner. The changes are well-structured, refactoring the CUDA graph capture logic to handle different modes (FULL, PIECEWISE, FULL_AND_PIECEWISE, etc.) more cleanly. The runtime dispatch logic in the model runner is also updated accordingly. While the implementation for piecewise graphs seems correct, I've found a critical issue in the full graph capture implementation concerning LoRA support.

mergify · 2026-01-21T10:03:22Z

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

LucasWilkinson

This PR seems to assume that PIECEWISE cudagraphs and FULL cudagprahs will have the same sizes; FULL cudagraphs are upper-bounded by max_num_seqs (or in the case of spec-decode max_num_seqs * (1 + num_speculative_tokens)) while PIECEWISE cudagraphs are upper bounded by max_cudagraph_capture_size. Atleast in V1, for performance I think this should be preserved (doesnt make sense to cut PIECEWISE cudagraphs off at 256 when currently then go up to 512 or 1024.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran · 2026-01-23T15:45:27Z

This PR seems to assume that PIECEWISE cudagraphs and FULL cudagprahs will have the same sizes; FULL cudagraphs are upper-bounded by max_num_seqs (or in the case of spec-decode max_num_seqs * (1 + num_speculative_tokens)) while PIECEWISE cudagraphs are upper bounded by max_cudagraph_capture_size. Atleast in V1, for performance I think this should be preserved (doesnt make sense to cut PIECEWISE cudagraphs off at 256 when currently then go up to 512 or 1024.

Thanks for this suggestion, already fixed !

mergify · 2026-01-23T15:49:01Z

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

mergify · 2026-01-24T02:52:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @izhuhaoran.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

mergify · 2026-01-24T14:29:03Z

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

mergify · 2026-01-24T16:37:20Z

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

…n piecewise Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran · 2026-01-24T16:47:24Z

Note: after merging main, runtime errors appear; they’ll be resolved in #33004.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

mergify · 2026-02-03T12:40:37Z

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

njhill

Thanks a lot for this @izhuhaoran! Great work

I tested it and it gives a huge speedup on blackwell with a small model / decode-heavy workload.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran · 2026-02-10T16:42:52Z

@njhill Thanks for your review, I've updated the codes. PTAL when you have time.

njhill

formatting nits to reduce loc

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran · 2026-02-11T03:53:57Z

formatting nits to reduce loc

Thanks for these suggestions, already reformatted.

mergify · 2026-02-17T02:11:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @izhuhaoran.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

LucasWilkinson · 2026-02-18T19:03:38Z

Thanks for the contribution! @izhuhaoran

I don't think we need to do it now but we should start thinking about how we can more uniquely identify cudagraphs based on the features/configuration of the captured graph. I dont think num_tokens is sufficient or explicit enough.

Currently this PR just maintains 2 types of batches, uniform and non-uniform and handles that maintaining 2 lists of tokens counts, i.e.

if uniform_decode and self.uniform_decode_cudagraph_sizes:
    return self.uniform_decode_cudagraph_sizes.get(num_tokens)
return self.cudagraph_sizes.get(num_tokens)

If more "batch types" are needed this doesn't feel very scalable, e.g. of features that would have different "batch types" are:

lora: Currently (MRV1) we support --specialize-active-lora which captures different cudagraphs for different active LORA counts
dynamic spec-decode: we will likely want to capture different uniform graphs with same token counts but a different number of requests, e.g. for we may want 2 graphs for num_tokens = 8, one with 4 requests for num_speculated_tokens = 1, and one with 2 requests for num_speculated_tokens = 3

In MRV1 this is handled by mapping BatchDescriptor -> cudagraph but is not super clean (matching gets a bit cleaner in #34102). I think its worthing thinking a bit into the future about how we might support these cases more naturally.

cc @njhill @WoosukKwon

WoosukKwon · 2026-02-18T23:01:23Z

@LucasWilkinson Thanks for bringing it up. I do agree that CUDA graph needs more design discussions. I'm accepting this PR to move us forward, but we must revisit this or next week.

WoosukKwon

Thanks for the PR! The code looks clean and well-structured, and I think it’s a solid implementation given the current CUDA graph design. Great work! 👍

izhuhaoran · 2026-02-19T02:14:01Z

@LucasWilkinson @WoosukKwon Thanks for the review and sorry for the late reply due to Chinese New Year. Yes, this PR is only based on the current CUDA graph design, and I agree that it needs further improvement.

…2771) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

init support piecewise cudagraph for model runner v2

6ba668d

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran requested a review from WoosukKwon as a code owner January 21, 2026 09:59

mergify Bot added nvidia v1 labels Jan 21, 2026

github-project-automation Bot added this to NVIDIA Jan 21, 2026

gemini-code-assist Bot reviewed Jan 21, 2026

View reviewed changes

cursor Bot reviewed Jan 21, 2026

View reviewed changes

Comment thread vllm/v1/worker/gpu/model_runner.py

izhuhaoran added 4 commits January 21, 2026 19:01

fix lint error

18a968d

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

set batch_descriptor for piecewise cudagraph in execute_model

a293775

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

fix cudagraph capture for eagle

0c4f91d

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

fix mixed cg for spec decode

730654e

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

LucasWilkinson reviewed Jan 23, 2026

View reviewed changes

izhuhaoran mentioned this pull request Jan 23, 2026

[Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend #32936

Merged

use diff capture sizes for uniform_decode full cudagraph

3d9a0c8

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

fix lint

26729fc

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran force-pushed the MRV2-support-piecewise branch from ba3ff40 to 26729fc Compare January 24, 2026 02:51

mergify Bot added the needs-rebase label Jan 24, 2026

Merge branch 'main' into MRV2-support-piecewise

454a85b

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

mergify Bot removed the needs-rebase label Jan 24, 2026

fix slot_mappings_by_layer error after merge

030c9df

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

refactor get_cudagraph_sizes and fix eagle cudagraph size too large i…

ad4d8fb

…n piecewise Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran force-pushed the MRV2-support-piecewise branch from b323e3d to ad4d8fb Compare January 24, 2026 16:41

Merge remote-tracking branch 'origin/main' into MRV2-support-piecewise

893bff8

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

fix mypy error

2e3e94a

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

njhill reviewed Feb 10, 2026

View reviewed changes

izhuhaoran added 3 commits February 10, 2026 20:49

apply suggestions from @njhill

c2e4f37

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Merge remote-tracking branch 'origin/main' into MRV2-support-piecewise

7859ea8

remove no need args for BatchDescriptor

538fb95

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

njhill reviewed Feb 10, 2026

View reviewed changes

Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated

Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated

Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated

Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated

format to remove extra loc

c118319

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

mergify Bot added the needs-rebase label Feb 17, 2026

merge

2f5d512

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

mergify Bot removed the needs-rebase label Feb 18, 2026

WoosukKwon approved these changes Feb 18, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Feb 18, 2026

WoosukKwon merged commit 11d3976 into vllm-project:main Feb 18, 2026
7 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Feb 18, 2026

jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026

[Model Runner V2] support piecewise & mixed cudagraph (vllm-project#3…

4b81ca4

…2771) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[Model Runner V2] support piecewise & mixed cudagraph (vllm-project#3…

df4db5f

…2771) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

jiangkuaixue123 pushed a commit to jiangkuaixue123/vllm that referenced this pull request Apr 28, 2026

[Model Runner V2] support piecewise & mixed cudagraph (vllm-project#3…

14961b1

…2771) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Uh oh!

Conversation

izhuhaoran commented Jan 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Jan 21, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

izhuhaoran commented Jan 23, 2026

Uh oh!

mergify Bot commented Jan 23, 2026

Uh oh!

mergify Bot commented Jan 24, 2026

Uh oh!

mergify Bot commented Jan 24, 2026

Uh oh!

mergify Bot commented Jan 24, 2026

Uh oh!

izhuhaoran commented Jan 24, 2026

Uh oh!

mergify Bot commented Feb 3, 2026

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

izhuhaoran commented Feb 10, 2026

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

izhuhaoran commented Feb 11, 2026

Uh oh!

mergify Bot commented Feb 17, 2026

Uh oh!

LucasWilkinson commented Feb 18, 2026

Uh oh!

WoosukKwon commented Feb 18, 2026

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

izhuhaoran commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

izhuhaoran commented Jan 21, 2026 •

edited by github-actions Bot

Loading