[MRV2] Add FULL CUDA graph support with PP by WoosukKwon · Pull Request #37821 · vllm-project/vllm

WoosukKwon · 2026-03-22T17:59:12Z

Should be merged after #35162 and #37818

Add piecewise CUDA graph capture/replay support for PP in V2 model runner. model_runner.py: - Replace eager-only PP guard with PP-aware cudagraph mode handling - Create persistent IntermediateTensors buffer during capture - Copy received tensors into the buffer at runtime for address stability cudagraph_utils.py: - Thread intermediate_tensors through the capture pipeline - Handle IntermediateTensors output on non-last PP ranks - Fix num_reqs divisibility for uniform query length backends - Assert FULL cudagraph mode is not used with PP Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

gemini-code-assist

Code Review

This pull request introduces full CUDA graph support with pipeline parallelism (PP) in MRV2. The changes primarily involve modifications to cudagraph_utils.py and model_runner.py to handle intermediate tensors and pipeline parallelism during CUDA graph capture and execution. The code adds logic to manage intermediate tensors for non-first PP ranks, ensuring correct data slicing and copying during graph capture and replay. The changes also remove the previous limitation of skipping CUDA graph capture when pipeline parallelism is enabled.

gemini-code-assist · 2026-03-22T18:02:32Z

vllm/v1/worker/gpu/cudagraph_utils.py

+            if not self.is_first_pp_rank:
+                assert intermediate_tensors is not None
+                intermediate_tensors_sliced = intermediate_tensors[:num_tokens]
+            else:
+                intermediate_tensors_sliced = None


The intermediate_tensors_sliced variable is assigned None when self.is_first_pp_rank is true. However, it is used unconditionally in the model_inputs dictionary on line 345. This could lead to a NameError if the model expects intermediate_tensors to always be present.

intermediate_tensors_sliced = None else: intermediate_tensors_sliced = None model_inputs = { "input_ids": input_buffers.input_ids[:num_tokens], "positions": input_buffers.positions[:num_tokens], "intermediate_tensors": intermediate_tensors_sliced if intermediate_tensors_sliced is not None else None, **model_state.prepare_dummy_inputs(num_reqs, num_tokens), }

gemini-code-assist · 2026-03-22T18:02:32Z

vllm/v1/worker/gpu/model_runner.py

            assert intermediate_tensors is not None
+            assert self.intermediate_tensors is not None
+            n = input_batch.num_tokens_after_padding
+            intermediate_tensors = IntermediateTensors(
+                {
+                    k: v[:n].copy_(intermediate_tensors.tensors[k][:n])
+                    for k, v in self.intermediate_tensors.tensors.items()
+                },
+                intermediate_tensors.kv_connector_output,
+            )
+            model_inputs["intermediate_tensors"] = intermediate_tensors


The code copies data from intermediate_tensors to self.intermediate_tensors using copy_. This operation is in-place and modifies self.intermediate_tensors. If the CUDA graph is replayed multiple times with different intermediate_tensors, the initial data in self.intermediate_tensors will be overwritten, potentially leading to incorrect results in subsequent iterations. This is especially problematic if the graph is captured with one set of intermediate tensors and replayed with another.

k: v[:n].clone() # Create a copy to avoid modifying the original tensor for k, v in intermediate_tensors.tensors.items() }, intermediate_tensors.kv_connector_output,

mergify · 2026-03-22T18:03:08Z

Hi @WoosukKwon, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

mergify · 2026-03-22T18:57:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2026-03-22T19:11:42Z

merged with #35162

ZhanqiuHu and others added 7 commits March 2, 2026 18:34

Merge branch 'main' into feature/pp-piecewise-cudagraph

931fbdb

Merge branch 'main' into feature/pp-piecewise-cudagraph

74c0ee6

merge

f121f5c

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

fix

e0e282e

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

[MRV2] Skip hidden states allocation for PW CUDA graphs

2c00c08

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

[MRV2] Add FULL CUDA graph support with PP

0900694

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

WoosukKwon requested a review from njhill as a code owner March 22, 2026 17:59

mergify bot added nvidia v1 labels Mar 22, 2026

github-project-automation bot added this to NVIDIA Mar 22, 2026

gemini-code-assist bot reviewed Mar 22, 2026

View reviewed changes

fix type

942be95

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

mergify bot added the needs-rebase label Mar 22, 2026

WoosukKwon mentioned this pull request Mar 22, 2026

[Model Runner V2] Enable piecewise & full CUDA graphs for pipeline parallelism #35162

Merged

5 tasks

WoosukKwon closed this Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRV2] Add FULL CUDA graph support with PP#37821

[MRV2] Add FULL CUDA graph support with PP#37821
WoosukKwon wants to merge 8 commits intomainfrom
woosuk/mrv2-pp-full-cudagraph

WoosukKwon commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

mergify bot commented Mar 22, 2026

Uh oh!

mergify bot commented Mar 22, 2026

Uh oh!

WoosukKwon commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WoosukKwon commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 22, 2026

Uh oh!

mergify bot commented Mar 22, 2026

Uh oh!

WoosukKwon commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants