Skip to content

[MRV2] Add FULL CUDA graph support with PP#37821

Closed
WoosukKwon wants to merge 8 commits intomainfrom
woosuk/mrv2-pp-full-cudagraph
Closed

[MRV2] Add FULL CUDA graph support with PP#37821
WoosukKwon wants to merge 8 commits intomainfrom
woosuk/mrv2-pp-full-cudagraph

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

Should be merged after #35162 and #37818

ZhanqiuHu and others added 7 commits March 2, 2026 18:34
Add piecewise CUDA graph capture/replay support for PP in V2 model runner.

model_runner.py:
- Replace eager-only PP guard with PP-aware cudagraph mode handling
- Create persistent IntermediateTensors buffer during capture
- Copy received tensors into the buffer at runtime for address stability

cudagraph_utils.py:
- Thread intermediate_tensors through the capture pipeline
- Handle IntermediateTensors output on non-last PP ranks
- Fix num_reqs divisibility for uniform query length backends
- Assert FULL cudagraph mode is not used with PP

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces full CUDA graph support with pipeline parallelism (PP) in MRV2. The changes primarily involve modifications to cudagraph_utils.py and model_runner.py to handle intermediate tensors and pipeline parallelism during CUDA graph capture and execution. The code adds logic to manage intermediate tensors for non-first PP ranks, ensuring correct data slicing and copying during graph capture and replay. The changes also remove the previous limitation of skipping CUDA graph capture when pipeline parallelism is enabled.

Comment on lines +311 to +315
if not self.is_first_pp_rank:
assert intermediate_tensors is not None
intermediate_tensors_sliced = intermediate_tensors[:num_tokens]
else:
intermediate_tensors_sliced = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The intermediate_tensors_sliced variable is assigned None when self.is_first_pp_rank is true. However, it is used unconditionally in the model_inputs dictionary on line 345. This could lead to a NameError if the model expects intermediate_tensors to always be present.

                intermediate_tensors_sliced = None
            else:
                intermediate_tensors_sliced = None

            model_inputs = {
                "input_ids": input_buffers.input_ids[:num_tokens],
                "positions": input_buffers.positions[:num_tokens],
                "intermediate_tensors": intermediate_tensors_sliced if intermediate_tensors_sliced is not None else None,
                **model_state.prepare_dummy_inputs(num_reqs, num_tokens),
            }

Comment on lines 1026 to +1036
assert intermediate_tensors is not None
assert self.intermediate_tensors is not None
n = input_batch.num_tokens_after_padding
intermediate_tensors = IntermediateTensors(
{
k: v[:n].copy_(intermediate_tensors.tensors[k][:n])
for k, v in self.intermediate_tensors.tensors.items()
},
intermediate_tensors.kv_connector_output,
)
model_inputs["intermediate_tensors"] = intermediate_tensors
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code copies data from intermediate_tensors to self.intermediate_tensors using copy_. This operation is in-place and modifies self.intermediate_tensors. If the CUDA graph is replayed multiple times with different intermediate_tensors, the initial data in self.intermediate_tensors will be overwritten, potentially leading to incorrect results in subsequent iterations. This is especially problematic if the graph is captured with one set of intermediate tensors and replayed with another.

                    k: v[:n].clone() # Create a copy to avoid modifying the original tensor
                    for k, v in intermediate_tensors.tensors.items()
                },
                intermediate_tensors.kv_connector_output,

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 22, 2026

Hi @WoosukKwon, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@WoosukKwon
Copy link
Copy Markdown
Collaborator Author

merged with #35162

@WoosukKwon WoosukKwon closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants