[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE by mgoin · Pull Request #25444 · vllm-project/vllm

mgoin · 2025-09-23T02:59:11Z

Purpose

This PR proposes to enable using full cudagraphs by default in vLLM V1. Support for cudagraphs beyond piecewise only was added over the past months (notably #20059) and while the startup time increase is measurable, we believe the performance gain for full cudagraphs is worth it, especially for low latency with small models or MoEs.

For instance, running Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 on 1xH100 with vLLM for 10 requests of 1024 in and 128 out results in the following throughputs:

vllm serve (default): 4.07 req/s
vllm serve --async-scheduling: 4.29 req/s
vllm serve -O.cudagraph_mode=FULL: 5.83 req/s
vllm serve --async-scheduling -O.cudagraph_mode=FULL: 6.03 req/s
vllm serve -O.cudagraph_mode=FULL_AND_PIECEWISE: 5.98 req/s
vllm serve --async-scheduling -O.cudagraph_mode=FULL_AND_PIECEWISE: 6.20 req/s

Benchmark command: vllm bench serve --model Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --port 8000 --num-prompts 10

====

Startup time impact:

main: python -c "from vllm import LLM;LLM('Qwen/Qwen3-0.6B')"  34.05s user 7.32s system 190% cpu 21.684 total
pr:   python -c "from vllm import LLM;LLM('Qwen/Qwen3-0.6B')"  34.99s user 9.01s system 175% cpu 25.005 total

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>

gemini-code-assist

Code Review

This pull request changes the default CUDA graph mode for the v1 engine from PIECEWISE to FULL_AND_PIECEWISE when using piecewise compilation. This is a performance optimization, as FULL_AND_PIECEWISE can leverage full CUDA graphs for decode steps, which is often more efficient. The change is implemented by updating the default-setting logic in VllmConfig.__post_init__. Correspondingly, docstrings in CompilationConfig have been updated to reflect that FULL_AND_PIECEWISE is now the default mode. The changes are logical, self-contained, and appear to be correct. I have not identified any critical or high-severity issues.

mgoin · 2025-09-23T03:52:23Z

Please wait on this, the time to capture the (decode, FULL) cudagraphs is very high. Like an order of magnitude more than the (mixed prefill-decode, PIECEWISE) or (mixed prefill-decode, FULL). We should get this down first.

gau-nernst · 2025-09-23T04:16:45Z

I was just reading up on this and tried out FULL_AND_PIECEWISE today. And it's already becoming the new default. Awesome and thank you 🙏

Signed-off-by: mgoin <mgoin64@gmail.com>

LucasWilkinson

Overall LGTM thanks! left one nit

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: mgoin <mgoin64@gmail.com>

…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>

…ISE (#25444) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>

Change the default CUDAGraphMode from PIECEWISE TO FULL_AND_PIECEWISE

b466639

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners September 23, 2025 02:59

mgoin changed the title ~~[Perf] Change the default CUDAGraphMode from PIECEWISE TO FULL_AND_PIECEWISE~~ [Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE Sep 23, 2025

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Sep 23, 2025

simon-mo approved these changes Sep 23, 2025

View reviewed changes

tlrmchlsmth approved these changes Sep 23, 2025

View reviewed changes

Reduce seq_lens to max_query_len and setup fallbacks

91f371f

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from alexm-redhat, comaniac, njhill and ywang96 as code owners September 23, 2025 17:07

mergify bot added the v1 label Sep 23, 2025

LucasWilkinson approved these changes Sep 23, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

LucasWilkinson mentioned this pull request Sep 23, 2025

[Perf] Increase default max splits for FA3 full cudagraphs #25495

Merged

mgoin added this to the v0.11.0 milestone Sep 23, 2025

Move pooler special case to config

74be337

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin merged commit 24fab45 into vllm-project:main Sep 23, 2025
40 checks passed

mgoin deleted the cudagraph_mode-FULL_AND_PIECEWISE-default branch September 23, 2025 19:29

mgoin mentioned this pull request Sep 23, 2025

[Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible with FA3 #25508

Merged

5 tasks

ProExpertProg mentioned this pull request Sep 23, 2025

[V1] address post issues related to #20059 (part 1); cascade attention reenable by default #23046

Merged

8 tasks

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEW…

c2a07e0

…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>

russellb mentioned this pull request Sep 25, 2025

Whisper cudagraphs support #25208

Closed

5 tasks

fhl2000 mentioned this pull request Sep 27, 2025

[Docs] add docs for cuda graph v1 #24374

Merged

5 tasks

cjackal mentioned this pull request Sep 30, 2025

[Bug]: llama 4 family is incompatible with CUDA graph FULL_AND_PIECEWISE mode #25960

Closed

1 task

draftbk mentioned this pull request Oct 2, 2025

[Bug][ROCm]: large max_num_seqs hurts performance on AMD #25718

Closed

1 task

frankwang28 mentioned this pull request Oct 3, 2025

[Feature] Allow configuring FlashInfer workspace size (#25342) #25344

Closed

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEW…

8222e26

…ISE (#25444) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

FENP mentioned this pull request Oct 10, 2025

[Bugfix][DCP] Set default CUDAGraphMode to PIECEWISE for DCP #26574

Merged

5 tasks

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEW…

f1a9c7b

…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEW…

318be7f

…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEW…

49aa696

…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>

xyang16 mentioned this pull request Nov 26, 2025

[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 #28971

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE#25444

[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE#25444
mgoin merged 3 commits intovllm-project:mainfrom
neuralmagic:cudagraph_mode-FULL_AND_PIECEWISE-default

mgoin commented Sep 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin commented Sep 23, 2025

Uh oh!

gau-nernst commented Sep 23, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

mgoin commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin commented Sep 23, 2025

Uh oh!

gau-nernst commented Sep 23, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mgoin commented Sep 23, 2025 •

edited by github-actions bot

Loading