[Core] Whisper enable `FULL_DECODE_ONLY` CudaGraph by NickLucche · Pull Request #30072 · vllm-project/vllm

NickLucche · 2025-12-04T17:42:18Z

Overview

Whisper currently only supports PIECEWISE cudagraph mode, but it does NOT support torch.compile, so in practice this results in running in eager mode (see profiler below, cudagraphlaunch query).

This PR addresses another important performance limitation by adding support for cuda graph through the FULL_DECODE_ONLY mode.
This guarantees that all decode steps after the first one will just replay the graph.
Mind that the first decode step is actually branching in flow due to encoder_outputs being present for populating the crossattn kv cache. Hence we only GC the steps that follow.

New profile(r) pic:

Full explicit command:

vllm serve openai/whisper-large-v3-turbo --max-num-batched-tokens 32874 -cc.cudagraph_mode=FULL_DECODE_ONLY

New default:

vllm serve openai/whisper-large-v3-turbo
...
INFO 12-04 17:44:48 [vllm.py:704] Encoder-decoder models do not support FULL_AND_PIECEWISE. Overriding cudagraph_mode to FULL_DECODE_ONLY.

Benchmark

Pre

================================================================================
RESULTS SUMMARY
================================================================================
Total samples: 50
Successful: 50
Failed: 0
Total time: 1.36s
Average latency: 0.88s
Throughput: 36.82 requests/s

Post

================================================================================
RESULTS SUMMARY
================================================================================
Total samples: 50
Successful: 50
Failed: 0
Total time: 1.02s
Average latency: 0.61s
Throughput: 49.11 requests/s

cc @ProExpertProg @LucasWilkinson @robertgshaw2-redhat

NickLucche · 2025-12-04T17:42:43Z

+            self.compilation_config.cudagraph_mode
+            in (CUDAGraphMode.PIECEWISE, CUDAGraphMode.FULL_AND_PIECEWISE)


@ProExpertProg need your eyes on this bit

gemini-code-assist

Code Review

This pull request introduces a significant performance enhancement for Whisper models by enabling CUDA graph support through a new FULL_DECODE_ONLY mode. The changes are well-implemented and logically sound. Key modifications include updating the configuration to handle the new CUDA graph mode for encoder-decoder models and adjusting the model runner to correctly distinguish between the initial encoder-involved step and subsequent decode-only steps. This ensures that CUDA graphs are only used for the decode steps, which is the correct approach for models like Whisper. The benchmarks provided demonstrate a substantial improvement in throughput and latency. Overall, this is an excellent contribution that improves the performance of encoder-decoder models in vLLM.

mergify · 2025-12-05T08:57:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-12-05T17:22:13Z

Thanks for reviewing @ProExpertProg @LucasWilkinson , I hope I've addressed your comments

ProExpertProg

Looks good apart from 2 nits!

ProExpertProg · 2025-12-05T18:43:37Z

@@ -145,7 +145,7 @@ def dispatch(
        num_tokens: int,
        uniform_decode: bool,
        has_lora: bool,
-        use_cascade_attn: bool = False,
+        piecewise_or_eager_only: bool = False,


skip_attention or attention_cg_unsupported?

this is as-per @LucasWilkinson suggestion

actually went with disable_full in #30173 to help with line width overflows; can resolve the conflicts depending on which lands first (I think we should go with disable_full since its a bit more terse

ProExpertProg · 2025-12-05T18:47:13Z

                        "Overriding cudagraph_mode to PIECEWISE."
                    )
                    self.compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
-                elif self.model_config.is_encoder_decoder:
-                    logger.warning_once(
-                        "Encoder-decoder models do not support full cudagraphs. "
-                        "Overriding cudagraph_mode to PIECEWISE."
+                elif (
+                    self.model_config.is_encoder_decoder
+                    and self.compilation_config.cudagraph_mode
+                    not in (CUDAGraphMode.NONE, CUDAGraphMode.FULL_DECODE_ONLY)
+                ):


if somebody sets mode=piecewise, we don't handle it here, but we should:

# check support based on model type if self.model_config is not None: if self.model_config.pooler_config is not None and self.compilation_config.cudagraph_mode.has_full_cudagraphs(): ... elif ( self.model_config.is_encoder_decoder and self.compilation_config.cudagraph_mode not in (CUDAGraphMode.NONE, CUDAGraphMode.FULL_DECODE_ONLY) ):

Right! Done:

vllm serve openai/whisper-large-v3-turbo -cc.cudagraph_mode=PIECEWISE ... INFO 12-09 11:15:02 [vllm.py:690] Encoder-decoder models do not support PIECEWISE. Overriding cudagraph_mode to FULL_DECODE_ONLY.

LucasWilkinson

LGTM; thanks for doing this!

LucasWilkinson · 2025-12-09T15:21:33Z

@@ -145,7 +145,7 @@ def dispatch(
        num_tokens: int,
        uniform_decode: bool,
        has_lora: bool,
-        use_cascade_attn: bool = False,
+        piecewise_or_eager_only: bool = False,


actually went with disable_full in #30173 to help with line width overflows; can resolve the conflicts depending on which lands first (I think we should go with disable_full since its a bit more terse

mergify · 2025-12-09T15:46:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

DarkLight1337 · 2025-12-10T08:49:22Z

@@ -102,6 +102,7 @@ def run_test(
        max_model_len=448,
        tensor_parallel_size=tensor_parallel_size,
        distributed_executor_backend=distributed_executor_backend,
+        enforce_eager=True,


Could you explain a bit why the test needs to set this explicitly?

@DarkLight1337 There's a subtle difference in output which makes the test fail.

[2025-12-09T18:42:49Z] - And the 0-1 pitch on the way to Edgar Martinez. Swung on the line down the left field line for a base hit. Here comes Joy. Here is Junior to third base. They're going to wave him in. The throw to the plate will be late. The Mariners are going to play for the American League Championship. I don't believe it. It just continues. My, oh, my. -- [2025-12-09T18:42:49Z] ? ^ [2025-12-09T18:42:49Z] + And the 0-1 pitch on the way to Edgar Martinez. Swung on the line down the left field line for a base hit. Here comes Joy. Here is Junior to third base. They're going to wave him in. The throw to the plate will be late. The Mariners are going to play for the American League Championship. I don't believe it. It just continues. My, oh, my God.

but tbh I think checking token by token a copy-pasted output is too strict.
We can restructure the test to use some new invariant techniques, or just relax it in a separate PR.

Yeah we can check logprobs instead. For now can you add a code comment with a TODO to fix this later?

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

NickLucche requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners December 4, 2025 17:42

mergify Bot added nvidia v1 labels Dec 4, 2025

github-project-automation Bot added this to NVIDIA Dec 4, 2025

NickLucche commented Dec 4, 2025

View reviewed changes

gemini-code-assist Bot reviewed Dec 4, 2025

View reviewed changes

ProExpertProg reviewed Dec 4, 2025

View reviewed changes

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated

Comment thread vllm/config/vllm.py Outdated

LucasWilkinson reviewed Dec 4, 2025

View reviewed changes

Comment thread vllm/v1/worker/gpu_model_runner.py

mergify Bot added the needs-rebase label Dec 5, 2025

NickLucche force-pushed the whisper-gc branch from 735d14f to 9681358 Compare December 5, 2025 17:19

mergify Bot removed the needs-rebase label Dec 5, 2025

ProExpertProg approved these changes Dec 5, 2025

View reviewed changes

github-project-automation Bot moved this to In review in NVIDIA Dec 5, 2025

LucasWilkinson approved these changes Dec 9, 2025

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 9, 2025

NickLucche enabled auto-merge (squash) December 9, 2025 15:26

mergify Bot added the needs-rebase label Dec 9, 2025

NickLucche force-pushed the whisper-gc branch from 76566f4 to 1b2757f Compare December 9, 2025 15:59

mergify Bot removed the needs-rebase label Dec 9, 2025

NickLucche requested review from DarkLight1337 and ywang96 as code owners December 10, 2025 08:45

mergify Bot added the multi-modality Related to multi-modality (#4194) label Dec 10, 2025

DarkLight1337 reviewed Dec 10, 2025

View reviewed changes

NickLucche added 8 commits December 10, 2025 10:53

init

0353950

Signed-off-by: NickLucche <nlucches@redhat.com>

mind the padding!

97eb2c3

Signed-off-by: NickLucche <nlucches@redhat.com>

has_encoder_output check+config changes

548af89

Signed-off-by: NickLucche <nlucches@redhat.com>

cruft

21626df

Signed-off-by: NickLucche <nlucches@redhat.com>

address review

c06589f

Signed-off-by: NickLucche <nlucches@redhat.com>

address piecewise mode

cd13d1f

Signed-off-by: NickLucche <nlucches@redhat.com>

skip cg in tests

427ad52

Signed-off-by: NickLucche <nlucches@redhat.com>

todo

fc2bedd

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the whisper-gc branch from aa11276 to fc2bedd Compare December 10, 2025 10:53

NickLucche mentioned this pull request Dec 10, 2025

[Core] Whisper support torch.compile #30385

Merged

vllm-bot merged commit c756fb6 into vllm-project:main Dec 10, 2025
49 of 51 checks passed

github-project-automation Bot moved this from In review to Done in NVIDIA Dec 10, 2025

NickLucche mentioned this pull request Dec 11, 2025

[Bug]: The inference speed of the whisper model under the v1 engine is much slower than v0 #24946

Closed

1 task

NickLucche mentioned this pull request Dec 16, 2025

[Core] WhisperEncoder support torch.compile #30549

Open

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[Core] Whisper enable FULL_DECODE_ONLY CudaGraph (vllm-project#30072)

1f22071

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

		self.compilation_config.cudagraph_mode
		in (CUDAGraphMode.PIECEWISE, CUDAGraphMode.FULL_AND_PIECEWISE)

Uh oh!

Conversation

NickLucche commented Dec 4, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Benchmark

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Dec 5, 2025

Uh oh!

NickLucche commented Dec 5, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Dec 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NickLucche commented Dec 4, 2025 •

edited by github-actions Bot

Loading

LucasWilkinson Dec 9, 2025 •

edited

Loading

LucasWilkinson Dec 9, 2025 •

edited

Loading