Skip to content

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Oct 1, 2025

Purpose

FIX #25960

Test Plan

Test Result

vllm serve RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 --load-format dummy --max-model-len 32K
...
... 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, ...

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be cleaner to do: #26034 ?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds a check to disable full CUDA graphs for models with chunked attention. This is a good bugfix. However, I've identified a potential issue with the placement of this check. It's located within a conditional block that only executes if cudagraph_mode is not explicitly set by the user. This could lead to runtime errors if a user provides an incompatible configuration. I've suggested a refactoring to make this check unconditional, which would improve the robustness of the configuration.

Comment on lines 334 to 339
if self.model_config is not None and \
(self.model_config.pooler_config is not None
or self.model_config.is_encoder_decoder):
or self.model_config.is_encoder_decoder
or self.model_config.attention_chunk_size is not None):
self.compilation_config.cudagraph_mode = \
CUDAGraphMode.PIECEWISE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change correctly identifies another condition for disabling full CUDA graphs, the check is located inside a block that only executes if cudagraph_mode is not explicitly set by the user (if self.compilation_config.cudagraph_mode is None:).

This means if a user explicitly sets cudagraph_mode to FULL or FULL_AND_PIECEWISE for a model with pooling, an encoder-decoder architecture, or chunked attention, the setting will not be overridden, which can lead to runtime errors.

To make the configuration more robust, I suggest moving this check, along with the existing ones for pooler_config and is_encoder_decoder, outside of the if self.compilation_config.cudagraph_mode is None: block. This would ensure that incompatible settings are always corrected, regardless of whether they are user-provided or default. A similar override is already performed for enforce_eager a few lines below.

Consider refactoring this logic to apply these correctness checks unconditionally. For example:

# In vllm/config/vllm.py, __post_init__

# ... after setting default cudagraph_mode

# pooling models, encoder-decoder models, and models with
# chunked attention do not support full cudagraphs.
# This check overrides user settings for correctness.
is_incompatible = (
    self.model_config is not None and (
        self.model_config.pooler_config is not None
        or self.model_config.is_encoder_decoder
        or self.model_config.attention_chunk_size is not None
    )
)
if is_incompatible and self.compilation_config.cudagraph_mode.has_full_cudagraphs():
    logger.warning(
        "The model has features that are not compatible with "
        "full CUDAGraphs. Disabling full CUDAGraphs."
    )
    if self.compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE:
        self.compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
    else: # FULL or FULL_DECODE_ONLY
        self.compilation_config.cudagraph_mode = CUDAGraphMode.NONE

@mgoin mgoin closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llama Related to Llama models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: llama 4 family is incompatible with CUDA graph FULL_AND_PIECEWISE mode

2 participants