[MRV2] Add FULL CUDA graph support with PP#37821
Conversation
Add piecewise CUDA graph capture/replay support for PP in V2 model runner. model_runner.py: - Replace eager-only PP guard with PP-aware cudagraph mode handling - Create persistent IntermediateTensors buffer during capture - Copy received tensors into the buffer at runtime for address stability cudagraph_utils.py: - Thread intermediate_tensors through the capture pipeline - Handle IntermediateTensors output on non-last PP ranks - Fix num_reqs divisibility for uniform query length backends - Assert FULL cudagraph mode is not used with PP Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
There was a problem hiding this comment.
Code Review
This pull request introduces full CUDA graph support with pipeline parallelism (PP) in MRV2. The changes primarily involve modifications to cudagraph_utils.py and model_runner.py to handle intermediate tensors and pipeline parallelism during CUDA graph capture and execution. The code adds logic to manage intermediate tensors for non-first PP ranks, ensuring correct data slicing and copying during graph capture and replay. The changes also remove the previous limitation of skipping CUDA graph capture when pipeline parallelism is enabled.
| if not self.is_first_pp_rank: | ||
| assert intermediate_tensors is not None | ||
| intermediate_tensors_sliced = intermediate_tensors[:num_tokens] | ||
| else: | ||
| intermediate_tensors_sliced = None |
There was a problem hiding this comment.
The intermediate_tensors_sliced variable is assigned None when self.is_first_pp_rank is true. However, it is used unconditionally in the model_inputs dictionary on line 345. This could lead to a NameError if the model expects intermediate_tensors to always be present.
intermediate_tensors_sliced = None
else:
intermediate_tensors_sliced = None
model_inputs = {
"input_ids": input_buffers.input_ids[:num_tokens],
"positions": input_buffers.positions[:num_tokens],
"intermediate_tensors": intermediate_tensors_sliced if intermediate_tensors_sliced is not None else None,
**model_state.prepare_dummy_inputs(num_reqs, num_tokens),
}| assert intermediate_tensors is not None | ||
| assert self.intermediate_tensors is not None | ||
| n = input_batch.num_tokens_after_padding | ||
| intermediate_tensors = IntermediateTensors( | ||
| { | ||
| k: v[:n].copy_(intermediate_tensors.tensors[k][:n]) | ||
| for k, v in self.intermediate_tensors.tensors.items() | ||
| }, | ||
| intermediate_tensors.kv_connector_output, | ||
| ) | ||
| model_inputs["intermediate_tensors"] = intermediate_tensors |
There was a problem hiding this comment.
The code copies data from intermediate_tensors to self.intermediate_tensors using copy_. This operation is in-place and modifies self.intermediate_tensors. If the CUDA graph is replayed multiple times with different intermediate_tensors, the initial data in self.intermediate_tensors will be overwritten, potentially leading to incorrect results in subsequent iterations. This is especially problematic if the graph is captured with one set of intermediate tensors and replayed with another.
k: v[:n].clone() # Create a copy to avoid modifying the original tensor
for k, v in intermediate_tensors.tensors.items()
},
intermediate_tensors.kv_connector_output,|
Hi @WoosukKwon, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
merged with #35162 |
Should be merged after #35162 and #37818