[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE#25444
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request changes the default CUDA graph mode for the v1 engine from PIECEWISE to FULL_AND_PIECEWISE when using piecewise compilation. This is a performance optimization, as FULL_AND_PIECEWISE can leverage full CUDA graphs for decode steps, which is often more efficient. The change is implemented by updating the default-setting logic in VllmConfig.__post_init__. Correspondingly, docstrings in CompilationConfig have been updated to reflect that FULL_AND_PIECEWISE is now the default mode. The changes are logical, self-contained, and appear to be correct. I have not identified any critical or high-severity issues.
|
Please wait on this, the time to capture the |
|
I was just reading up on this and tried out |
Signed-off-by: mgoin <mgoin64@gmail.com>
LucasWilkinson
left a comment
There was a problem hiding this comment.
Overall LGTM thanks! left one nit
Signed-off-by: mgoin <mgoin64@gmail.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>
…ISE (#25444) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>
Purpose
This PR proposes to enable using full cudagraphs by default in vLLM V1. Support for cudagraphs beyond piecewise only was added over the past months (notably #20059) and while the startup time increase is measurable, we believe the performance gain for full cudagraphs is worth it, especially for low latency with small models or MoEs.
For instance, running Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 on 1xH100 with vLLM for 10 requests of 1024 in and 128 out results in the following throughputs:
vllm serve(default): 4.07 req/svllm serve --async-scheduling: 4.29 req/svllm serve -O.cudagraph_mode=FULL: 5.83 req/svllm serve --async-scheduling -O.cudagraph_mode=FULL: 6.03 req/svllm serve -O.cudagraph_mode=FULL_AND_PIECEWISE: 5.98 req/svllm serve --async-scheduling -O.cudagraph_mode=FULL_AND_PIECEWISE: 6.20 req/sBenchmark command:
vllm bench serve --model Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --port 8000 --num-prompts 10====
Startup time impact:
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.