[Bugfix] Don't use FULL_AND_PIECEWISE cudagraph for Llama4 #26033

mgoin · 2025-10-01T16:16:26Z

Purpose

Test Plan

Test Result

vllm serve RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 --load-format dummy --max-model-len 32K
...
... 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, ...

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <[email protected]>

LucasWilkinson

I think it might be cleaner to do: #26034 ?

gemini-code-assist

Code Review

This pull request correctly adds a check to disable full CUDA graphs for models with chunked attention. This is a good bugfix. However, I've identified a potential issue with the placement of this check. It's located within a conditional block that only executes if cudagraph_mode is not explicitly set by the user. This could lead to runtime errors if a user provides an incompatible configuration. I've suggested a refactoring to make this check unconditional, which would improve the robustness of the configuration.

gemini-code-assist · 2025-10-01T16:19:17Z

vllm/config/vllm.py

                    if self.model_config is not None and \
                        (self.model_config.pooler_config is not None
-                         or self.model_config.is_encoder_decoder):
+                         or self.model_config.is_encoder_decoder
+                         or self.model_config.attention_chunk_size is not None):
                        self.compilation_config.cudagraph_mode = \
                            CUDAGraphMode.PIECEWISE


While this change correctly identifies another condition for disabling full CUDA graphs, the check is located inside a block that only executes if cudagraph_mode is not explicitly set by the user (if self.compilation_config.cudagraph_mode is None:).

This means if a user explicitly sets cudagraph_mode to FULL or FULL_AND_PIECEWISE for a model with pooling, an encoder-decoder architecture, or chunked attention, the setting will not be overridden, which can lead to runtime errors.

To make the configuration more robust, I suggest moving this check, along with the existing ones for pooler_config and is_encoder_decoder, outside of the if self.compilation_config.cudagraph_mode is None: block. This would ensure that incompatible settings are always corrected, regardless of whether they are user-provided or default. A similar override is already performed for enforce_eager a few lines below.

Consider refactoring this logic to apply these correctness checks unconditionally. For example:

# In vllm/config/vllm.py, __post_init__ # ... after setting default cudagraph_mode # pooling models, encoder-decoder models, and models with # chunked attention do not support full cudagraphs. # This check overrides user settings for correctness. is_incompatible = ( self.model_config is not None and ( self.model_config.pooler_config is not None or self.model_config.is_encoder_decoder or self.model_config.attention_chunk_size is not None ) ) if is_incompatible and self.compilation_config.cudagraph_mode.has_full_cudagraphs(): logger.warning( "The model has features that are not compatible with " "full CUDAGraphs. Disabling full CUDAGraphs." ) if self.compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE: self.compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE else: # FULL or FULL_DECODE_ONLY self.compilation_config.cudagraph_mode = CUDAGraphMode.NONE

[Bugfix] Don't use FULL_AND_PIECEWISE cudagraph for Llama4

408338d

Signed-off-by: mgoin <[email protected]>

mgoin requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 1, 2025 16:16

mergify bot added the llama Related to Llama models label Oct 1, 2025

LucasWilkinson requested changes Oct 1, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

mgoin closed this Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Don't use FULL_AND_PIECEWISE cudagraph for Llama4 #26033

[Bugfix] Don't use FULL_AND_PIECEWISE cudagraph for Llama4 #26033

Uh oh!

mgoin commented Oct 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

LucasWilkinson left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Bugfix] Don't use FULL_AND_PIECEWISE cudagraph for Llama4 #26033

[Bugfix] Don't use FULL_AND_PIECEWISE cudagraph for Llama4 #26033

Uh oh!

Conversation

mgoin commented Oct 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mgoin commented Oct 1, 2025 •

edited by github-actions bot

Loading