Do not allow disabling chunked prefill for generation models#28833
Do not allow disabling chunked prefill for generation models#2883322quinn wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
This PR prevents users from disabling chunked prefill for generation models (except on restricted CPU architectures). Previously, PR #28665 accidentally allowed this configuration, which could cause issues since generation models require chunked prefill to function properly.
Key changes:
- Added explicit validation to prevent disabling chunked prefill for generation models on non-restricted platforms
- Refactored the CPU architecture restriction logic to validate settings before applying defaults
- Platform restrictions (ARM, POWER, S390X, RISC-V CPUs) take precedence over model requirements
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| vllm/engine/arg_utils.py | Adds validation to enforce chunked prefill requirement for generation models and refactors restricted CPU handling logic |
| tests/engine/test_arg_utils.py | Adds comprehensive test coverage for restricted CPU behavior, generation model validation, and platform-specific settings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| and model_config.runner_type == "generate" | ||
| and self.enable_chunked_prefill is False | ||
| ): | ||
| raise ValueError("Chunked prefill is required for generation models. ") |
There was a problem hiding this comment.
The error message has a trailing space before the closing quote. This should be removed for consistency with other error messages.
| raise ValueError("Chunked prefill is required for generation models. ") | |
| raise ValueError("Chunked prefill is required for generation models.") |
There was a problem hiding this comment.
I would suggest changing the message to "Chunked prefill cannot be disabled for generation models."
|
Why can't we disable chunked prefill for generation models? |
It's never tested but deferring to @WoosukKwon whether it should be allowed |
Lines 702 to 737 in 60e089f As far as I know, VllmRunner in ci defaults to disable chunked_prefill. May i ask which scenarios will cause errors when chunked prefill is disabled? I am refactoring the logic to determine whether to enable chunked prefill |
| self, usage_context: UsageContext, model_config: ModelConfig | ||
| ) -> None: | ||
| """Set Default Arguments for V1 Engine.""" | ||
| # Check if running on CPU architecture with feature restrictions |
There was a problem hiding this comment.
Why not check these after applying defaults?
njhill
left a comment
There was a problem hiding this comment.
I think we should move all of this arg imputation/validation logic into config/vllm.py since it's currently split between arg processing and config postinit, and the logic in the former will be missed in cases VllmConfig is created directly and not via args.
@DarkLight1337 wdty?
| and model_config.runner_type == "generate" | ||
| and self.enable_chunked_prefill is False | ||
| ): | ||
| raise ValueError("Chunked prefill is required for generation models. ") |
There was a problem hiding this comment.
I would suggest changing the message to "Chunked prefill cannot be disabled for generation models."
| "Chunked prefill is not supported for %s; " | ||
| "disabling it for V1 backend.", |
There was a problem hiding this comment.
shouldn't mention V1 anymore ...
| "Chunked prefill is not supported for %s; " | |
| "disabling it for V1 backend.", | |
| "Chunked prefill is not supported for %s " | |
| "and will be disabled.", |
| "Prefix caching is not supported for %s; " | ||
| "disabling it for V1 backend.", |
There was a problem hiding this comment.
| "Prefix caching is not supported for %s; " | |
| "disabling it for V1 backend.", | |
| "Prefix caching is not supported for %s; " | |
| "and will be disabled.", |
@noooop there won't be any errors, the setting just has no effect (you set it to disabled but it will still do prefill chunking). Hence it's better to fail since disabling in this case is essentially not supported. |
It's a bit of a catch-22 situation, since we need to impute the default values before |
I think in other cases, where there are sub-config parameters whose default depend on parts of the config outside of the same sub-config, we have those default to I think we should decide on a standard/consistent way for how we handle this. Ideally I think it's better to move it all into the config classes rather than arg parsing for the reason mentioned above. |
That would mean we also have to move the validation logic from the sub-configs into |
vllm/vllm/v1/core/sched/scheduler.py Lines 500 to 508 in 0168f69 As far as I understand, disabling chunked prefill still works, although it's not exactly the same as V0. |
|
@noooop apologies I missed that somehow and had an incorrect understanding. Then I guess I have the same question of why it should be disallowed to disable chunked prefill for generative models. Perhaps because the way that it's implemented means a long prefill request could get stuck indefinitely if there's a continual stream of smaller requests. We don't want that to be the default of course in any case, and I don't think it should be set by default in VllmRunner either. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
Purpose
#28665 accidentally opened up the path to disable chunked prefill for generation models. This PR bans disabling chunked prefill unless it's one of the restricted CPUs.
Test Plan
pytest tests/engine/test_arg_utils.py
Test Result
pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.