-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Bugfix] Don't use FULL_AND_PIECEWISE cudagraph for Llama4 #26033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: mgoin <[email protected]>
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be cleaner to do: #26034 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly adds a check to disable full CUDA graphs for models with chunked attention. This is a good bugfix. However, I've identified a potential issue with the placement of this check. It's located within a conditional block that only executes if cudagraph_mode is not explicitly set by the user. This could lead to runtime errors if a user provides an incompatible configuration. I've suggested a refactoring to make this check unconditional, which would improve the robustness of the configuration.
| if self.model_config is not None and \ | ||
| (self.model_config.pooler_config is not None | ||
| or self.model_config.is_encoder_decoder): | ||
| or self.model_config.is_encoder_decoder | ||
| or self.model_config.attention_chunk_size is not None): | ||
| self.compilation_config.cudagraph_mode = \ | ||
| CUDAGraphMode.PIECEWISE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change correctly identifies another condition for disabling full CUDA graphs, the check is located inside a block that only executes if cudagraph_mode is not explicitly set by the user (if self.compilation_config.cudagraph_mode is None:).
This means if a user explicitly sets cudagraph_mode to FULL or FULL_AND_PIECEWISE for a model with pooling, an encoder-decoder architecture, or chunked attention, the setting will not be overridden, which can lead to runtime errors.
To make the configuration more robust, I suggest moving this check, along with the existing ones for pooler_config and is_encoder_decoder, outside of the if self.compilation_config.cudagraph_mode is None: block. This would ensure that incompatible settings are always corrected, regardless of whether they are user-provided or default. A similar override is already performed for enforce_eager a few lines below.
Consider refactoring this logic to apply these correctness checks unconditionally. For example:
# In vllm/config/vllm.py, __post_init__
# ... after setting default cudagraph_mode
# pooling models, encoder-decoder models, and models with
# chunked attention do not support full cudagraphs.
# This check overrides user settings for correctness.
is_incompatible = (
self.model_config is not None and (
self.model_config.pooler_config is not None
or self.model_config.is_encoder_decoder
or self.model_config.attention_chunk_size is not None
)
)
if is_incompatible and self.compilation_config.cudagraph_mode.has_full_cudagraphs():
logger.warning(
"The model has features that are not compatible with "
"full CUDAGraphs. Disabling full CUDAGraphs."
)
if self.compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE:
self.compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
else: # FULL or FULL_DECODE_ONLY
self.compilation_config.cudagraph_mode = CUDAGraphMode.NONE
Purpose
FIX #25960
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.