Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions vllm/config/vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -329,11 +329,12 @@ def __post_init__(self):
self.compilation_config.cudagraph_mode = \
CUDAGraphMode.FULL_AND_PIECEWISE

# pooling models and encoder-decoder models
# do not support full cudagraphs
# pooling models, encoder-decoder models, and models with
# chunked attention do not support full cudagraphs
if self.model_config is not None and \
(self.model_config.pooler_config is not None
or self.model_config.is_encoder_decoder):
or self.model_config.is_encoder_decoder
or self.model_config.attention_chunk_size is not None):
self.compilation_config.cudagraph_mode = \
CUDAGraphMode.PIECEWISE
Comment on lines 334 to 339
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change correctly identifies another condition for disabling full CUDA graphs, the check is located inside a block that only executes if cudagraph_mode is not explicitly set by the user (if self.compilation_config.cudagraph_mode is None:).

This means if a user explicitly sets cudagraph_mode to FULL or FULL_AND_PIECEWISE for a model with pooling, an encoder-decoder architecture, or chunked attention, the setting will not be overridden, which can lead to runtime errors.

To make the configuration more robust, I suggest moving this check, along with the existing ones for pooler_config and is_encoder_decoder, outside of the if self.compilation_config.cudagraph_mode is None: block. This would ensure that incompatible settings are always corrected, regardless of whether they are user-provided or default. A similar override is already performed for enforce_eager a few lines below.

Consider refactoring this logic to apply these correctness checks unconditionally. For example:

# In vllm/config/vllm.py, __post_init__

# ... after setting default cudagraph_mode

# pooling models, encoder-decoder models, and models with
# chunked attention do not support full cudagraphs.
# This check overrides user settings for correctness.
is_incompatible = (
    self.model_config is not None and (
        self.model_config.pooler_config is not None
        or self.model_config.is_encoder_decoder
        or self.model_config.attention_chunk_size is not None
    )
)
if is_incompatible and self.compilation_config.cudagraph_mode.has_full_cudagraphs():
    logger.warning(
        "The model has features that are not compatible with "
        "full CUDAGraphs. Disabling full CUDAGraphs."
    )
    if self.compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE:
        self.compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
    else: # FULL or FULL_DECODE_ONLY
        self.compilation_config.cudagraph_mode = CUDAGraphMode.NONE

else:
Expand Down
Loading