Skip to content

Do not allow disabling chunked prefill for generation models#28833

Closed
22quinn wants to merge 1 commit intovllm-project:mainfrom
22quinn:chunked-prefill
Closed

Do not allow disabling chunked prefill for generation models#28833
22quinn wants to merge 1 commit intovllm-project:mainfrom
22quinn:chunked-prefill

Conversation

@22quinn
Copy link
Copy Markdown
Collaborator

@22quinn 22quinn commented Nov 17, 2025

Purpose

#28665 accidentally opened up the path to disable chunked prefill for generation models. This PR bans disabling chunked prefill unless it's one of the restricted CPUs.

Test Plan

pytest tests/engine/test_arg_utils.py

Test Result

pass


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR prevents users from disabling chunked prefill for generation models (except on restricted CPU architectures). Previously, PR #28665 accidentally allowed this configuration, which could cause issues since generation models require chunked prefill to function properly.

Key changes:

  • Added explicit validation to prevent disabling chunked prefill for generation models on non-restricted platforms
  • Refactored the CPU architecture restriction logic to validate settings before applying defaults
  • Platform restrictions (ARM, POWER, S390X, RISC-V CPUs) take precedence over model requirements

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
vllm/engine/arg_utils.py Adds validation to enforce chunked prefill requirement for generation models and refactors restricted CPU handling logic
tests/engine/test_arg_utils.py Adds comprehensive test coverage for restricted CPU behavior, generation model validation, and platform-specific settings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

and model_config.runner_type == "generate"
and self.enable_chunked_prefill is False
):
raise ValueError("Chunked prefill is required for generation models. ")
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message has a trailing space before the closing quote. This should be removed for consistency with other error messages.

Suggested change
raise ValueError("Chunked prefill is required for generation models. ")
raise ValueError("Chunked prefill is required for generation models.")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing the message to "Chunked prefill cannot be disabled for generation models."

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Nov 17, 2025

Why can't we disable chunked prefill for generation models?

@22quinn
Copy link
Copy Markdown
Collaborator Author

22quinn commented Nov 17, 2025

Why can't we disable chunked prefill for generation models?

It's never tested but deferring to @WoosukKwon whether it should be allowed

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Nov 17, 2025

Why can't we disable chunked prefill for generation models?

It's never tested but deferring to @WoosukKwon whether it should be allowed

vllm/tests/conftest.py

Lines 702 to 737 in 60e089f

class VllmRunner:
"""
The default value of some arguments have been modified from
{class}`~vllm.LLM` as follows:
- `trust_remote_code`: Set to `True` instead of `False` for convenience.
- `seed`: Set to `0` instead of `None` for test reproducibility.
- `max_model_len`: Set to `1024` instead of `None` to reduce memory usage.
- `block_size`: To reduce memory usage, set default to `64` if on XPU
devices, otherwise default to `16`.
- `enable_chunked_prefill`: Set to `False` instead of `None` for
test reproducibility.
- `enforce_eager`: Set to `False` to test CUDA graph.
"""
def __init__(
self,
model_name: str,
runner: RunnerOption = "auto",
convert: ConvertOption = "auto",
tokenizer_name: str | None = None,
tokenizer_mode: str = "auto",
trust_remote_code: bool = True,
seed: int | None = 0,
max_model_len: int | None = 1024,
dtype: str = "auto",
disable_log_stats: bool = True,
tensor_parallel_size: int = 1,
block_size: int = 16 if not torch.xpu.is_available() else 64,
enable_chunked_prefill: bool | None = False,
swap_space: int = 4,
enforce_eager: bool | None = False,
# Set this to avoid hanging issue
default_torch_num_threads: int | None = None,
**kwargs,
) -> None:

As far as I know, VllmRunner in ci defaults to disable chunked_prefill.

May i ask which scenarios will cause errors when chunked prefill is disabled?


I am refactoring the logic to determine whether to enable chunked prefill

self, usage_context: UsageContext, model_config: ModelConfig
) -> None:
"""Set Default Arguments for V1 Engine."""
# Check if running on CPU architecture with feature restrictions
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check these after applying defaults?

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move all of this arg imputation/validation logic into config/vllm.py since it's currently split between arg processing and config postinit, and the logic in the former will be missed in cases VllmConfig is created directly and not via args.

@DarkLight1337 wdty?

and model_config.runner_type == "generate"
and self.enable_chunked_prefill is False
):
raise ValueError("Chunked prefill is required for generation models. ")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing the message to "Chunked prefill cannot be disabled for generation models."

Comment on lines +1961 to +1962
"Chunked prefill is not supported for %s; "
"disabling it for V1 backend.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't mention V1 anymore ...

Suggested change
"Chunked prefill is not supported for %s; "
"disabling it for V1 backend.",
"Chunked prefill is not supported for %s "
"and will be disabled.",

Comment on lines +1969 to +1970
"Prefix caching is not supported for %s; "
"disabling it for V1 backend.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Prefix caching is not supported for %s; "
"disabling it for V1 backend.",
"Prefix caching is not supported for %s; "
"and will be disabled.",

@njhill
Copy link
Copy Markdown
Member

njhill commented Nov 17, 2025

May i ask which scenarios will cause errors when chunked prefill is disabled?

@noooop there won't be any errors, the setting just has no effect (you set it to disabled but it will still do prefill chunking).

Hence it's better to fail since disabling in this case is essentially not supported.

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Nov 17, 2025

I think we should move all of this arg imputation/validation logic into config/vllm.py since it's currently split between arg processing and config postinit, and the logic in the former will be missed in cases VllmConfig is created directly and not via args

It's a bit of a catch-22 situation, since we need to impute the default values before VllmConfig is created. Otherwise, the sub-config static defaults are applied which do not match the defaults in EngineArgs (which are dynamically set).

@njhill
Copy link
Copy Markdown
Member

njhill commented Nov 17, 2025

I think we should move all of this arg imputation/validation logic into config/vllm.py since it's currently split between arg processing and config postinit, and the logic in the former will be missed in cases VllmConfig is created directly and not via args

It's a bit of a catch-22 situation, since we need to impute the default values before VllmConfig is created. Otherwise, the sub-config static defaults are applied which do not match the defaults in EngineArgs (which are dynamically set).

I think in other cases, where there are sub-config parameters whose default depend on parts of the config outside of the same sub-config, we have those default to None and resolve the value in VllmConfig post-init. Could we do that in all such cases?

I think we should decide on a standard/consistent way for how we handle this. Ideally I think it's better to move it all into the config classes rather than arg parsing for the reason mentioned above.

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Nov 18, 2025

I think in other cases, where there are sub-config parameters whose default depend on parts of the config outside of the same sub-config, we have those default to None and resolve the value in VllmConfig post-init. Could we do that in all such cases?

That would mean we also have to move the validation logic from the sub-configs into VllmConfig. (e.g. for validating that max_num_batched_tokens is set correctly)

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Nov 18, 2025

May i ask which scenarios will cause errors when chunked prefill is disabled?

@noooop there won't be any errors, the setting just has no effect (you set it to disabled but it will still do prefill chunking).

Hence it's better to fail since disabling in this case is essentially not supported.

# chunked prefill has to be enabled explicitly to allow
# pooling requests to be chunked
if (
not self.scheduler_config.enable_chunked_prefill
and num_new_tokens > token_budget
):
self.waiting.pop_request()
skipped_waiting_requests.prepend_request(request)
continue

As far as I understand, disabling chunked prefill still works, although it's not exactly the same as V0.

@njhill
Copy link
Copy Markdown
Member

njhill commented Nov 18, 2025

@noooop apologies I missed that somehow and had an incorrect understanding. Then I guess I have the same question of why it should be disallowed to disable chunked prefill for generative models.

Perhaps because the way that it's implemented means a long prefill request could get stuck indefinitely if there's a continual stream of smaller requests.

We don't want that to be the default of course in any case, and I don't think it should be set by default in VllmRunner either.

@mergify
Copy link
Copy Markdown

mergify bot commented Nov 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @22quinn.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Feb 19, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

@github-actions github-actions bot closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase stale Over 90 days of inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants