[Model Runner V2] Enable piecewise & full CUDA graphs for pipeline parallelism#35162
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables piecewise CUDA graph capture for pipeline parallelism in the V2 model runner, which was previously unsupported and forced eager mode execution. The changes are well-structured and correctly handle the complexities of this feature. Key changes include downgrading CUDA graph modes for compatibility with pipeline parallelism, introducing a persistent buffer for intermediate tensors to ensure stable memory addresses for graph replay, and updating the graph capture utilities to handle pipeline parallelism constructs. I have one suggestion to improve robustness by adding an assertion to verify the consistency of intermediate tensors between pipeline stages.
yewentao256
left a comment
There was a problem hiding this comment.
Could you also try #34903 this PR?
Not sure if it is the same issue for enabling full cuda graph
Cool! will take a look 👍 |
|
This pull request has merge conflicts that must be resolved before it can be |
I think for now the error I hit was only the uniform query lengths assertion with TRTLLM decode, I don't think it has illegal memory access from misaligned attention metadata at the moment. I will keep that issue in mind, and I think I might run into it later with full cuda graph support. |
0602419 to
17cc226
Compare
Add piecewise CUDA graph capture/replay support for PP in V2 model runner. model_runner.py: - Replace eager-only PP guard with PP-aware cudagraph mode handling - Create persistent IntermediateTensors buffer during capture - Copy received tensors into the buffer at runtime for address stability cudagraph_utils.py: - Thread intermediate_tensors through the capture pipeline - Handle IntermediateTensors output on non-last PP ranks - Fix num_reqs divisibility for uniform query length backends - Assert FULL cudagraph mode is not used with PP Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
|
Friendly ping @WoosukKwon @njhill: This PR has been rebased and is ready for review. Would appreciate any feedback when you get a chance. Thanks! |
|
This pull request has merge conflicts that must be resolved before it can be |
WoosukKwon
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the PR and sorry for the late review.
I've edited the PR for faster merge. Will follow up with the FULL graph support.
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
Actually, I ended up merging with #37821 for easier testing. |
That sounds great! Thank you! |
…sm (vllm-project#35162) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
…sm (vllm-project#35162) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
…sm (vllm-project#35162) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
…sm (vllm-project#35162) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
…sm (vllm-project#35162) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
…sm (vllm-project#35162) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…sm (vllm-project#35162) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…sm (vllm-project#35162) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
Summary
Add piecewise CUDA graph capture/replay support for PP in V2 model runner.
Related: #33960
model_runner.py:
IntermediateTensorsbuffer during graph replaycudagraph_utils.py:
intermediate_tensorsthrough capture pipelineIntermediateTensorsoutput on non-last PP ranksnum_reqsdivisibility for uniform query length backendsPurpose
V2 model runner did not support CUDA graph capture with PP, falling back to eager mode. This PR adds piecewise CUDA graph capture for PP.
Question
With CUDA graph capture, I ran into
AssertionError: TRTLLM decode requires uniform query lengths per request(flashinfer.py:1109) whennum_tokens % num_reqs != 0. So I added this workaround to ensure divisibility, but not sure if this is the right approach.Test plan
Test results
Config: Qwen3-30B-A3B-Thinking-2507-FP8, PP=2, 2×B200,
--max-num-seqs 128Performance (128 prompts, input=2, output=512)
Accuracy (GSM8K, 5-shot)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.