fix: Add PIECEWISE cudagraph mode config for prefill server to avoid startup errors#29079
fix: Add PIECEWISE cudagraph mode config for prefill server to avoid startup errors#29079xbfs wants to merge 4 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--29079.org.readthedocs.build/en/29079/ |
… errors The default cudagraph configuration (FULL_AND_PIECEWISE) causes prefill instance startup errors. This change explicitly sets the cudagraph_mode to PIECEWISE for prefill servers in the disaggregated serving script. Signed-off-by: Bofeng BF1 Xue <xuebf1@Lenovo.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a startup error in the disaggregated serving example script. The error occurs in prefill server instances due to the default cudagraph_mode of FULL_AND_PIECEWISE. The proposed change correctly resolves this issue by explicitly setting the cudagraph_mode to PIECEWISE for the prefill server, using the --compilation-config argument. This is a targeted and appropriate fix, as the PIECEWISE mode is better suited for the dynamic nature of prefill operations, thus avoiding the startup failures. The implementation is correct, and I find no issues with this change.
4b9cb82 to
97a4457
Compare
|
Can you please document the errors you are seeing? |
|
When the prefill instance runs to 'Capturing CUDA graphs (decode, FULL)', an error occurs: |
|
(potentially related: #27026) |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
The default cudagraph configuration (FULL_AND_PIECEWISE) causes prefill instance startup errors. This change explicitly sets the cudagraph_mode to PIECEWISE for prefill servers in the disaggregated serving script.