Do not allow disabling chunked prefill for generation models by 22quinn · Pull Request #28833 · vllm-project/vllm

22quinn · 2025-11-17T04:44:46Z

Purpose

#28665 accidentally opened up the path to disable chunked prefill for generation models. This PR bans disabling chunked prefill unless it's one of the restricted CPUs.

Test Plan

pytest tests/engine/test_arg_utils.py

Test Result

pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

Copilot

Pull Request Overview

This PR prevents users from disabling chunked prefill for generation models (except on restricted CPU architectures). Previously, PR #28665 accidentally allowed this configuration, which could cause issues since generation models require chunked prefill to function properly.

Key changes:

Added explicit validation to prevent disabling chunked prefill for generation models on non-restricted platforms
Refactored the CPU architecture restriction logic to validate settings before applying defaults
Platform restrictions (ARM, POWER, S390X, RISC-V CPUs) take precedence over model requirements

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
vllm/engine/arg_utils.py	Adds validation to enforce chunked prefill requirement for generation models and refactors restricted CPU handling logic
tests/engine/test_arg_utils.py	Adds comprehensive test coverage for restricted CPU behavior, generation model validation, and platform-specific settings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-17T04:50:06Z

vllm/engine/arg_utils.py

+            and model_config.runner_type == "generate"
+            and self.enable_chunked_prefill is False
+        ):
+            raise ValueError("Chunked prefill is required for generation models. ")


The error message has a trailing space before the closing quote. This should be removed for consistency with other error messages.

Suggested change

raise ValueError("Chunked prefill is required for generation models. ")

raise ValueError("Chunked prefill is required for generation models.")

I would suggest changing the message to "Chunked prefill cannot be disabled for generation models."

noooop · 2025-11-17T04:52:22Z

Why can't we disable chunked prefill for generation models?

22quinn · 2025-11-17T04:55:10Z

Why can't we disable chunked prefill for generation models?

It's never tested but deferring to @WoosukKwon whether it should be allowed

noooop · 2025-11-17T04:58:51Z

Why can't we disable chunked prefill for generation models?

It's never tested but deferring to @WoosukKwon whether it should be allowed

vllm/tests/conftest.py

Lines 702 to 737 in 60e089f

    
           class VllmRunner: 
        
               """ 
        
               The default value of some arguments have been modified from 
        
               {class}`~vllm.LLM` as follows: 
        
               - `trust_remote_code`: Set to `True` instead of `False` for convenience. 
        
               - `seed`: Set to `0` instead of `None` for test reproducibility. 
        
               - `max_model_len`: Set to `1024` instead of `None` to reduce memory usage. 
        
               - `block_size`: To reduce memory usage, set default to `64` if on XPU 
        
                   devices, otherwise default to `16`. 
        
               - `enable_chunked_prefill`: Set to `False` instead of `None` for 
        
                 test reproducibility. 
        
               - `enforce_eager`: Set to `False` to test CUDA graph. 
        
               """ 
        
               def __init__( 
        
                   self, 
        
                   model_name: str, 
        
                   runner: RunnerOption = "auto", 
        
                   convert: ConvertOption = "auto", 
        
                   tokenizer_name: str | None = None, 
        
                   tokenizer_mode: str = "auto", 
        
                   trust_remote_code: bool = True, 
        
                   seed: int | None = 0, 
        
                   max_model_len: int | None = 1024, 
        
                   dtype: str = "auto", 
        
                   disable_log_stats: bool = True, 
        
                   tensor_parallel_size: int = 1, 
        
                   block_size: int = 16 if not torch.xpu.is_available() else 64, 
        
                   enable_chunked_prefill: bool | None = False, 
        
                   swap_space: int = 4, 
        
                   enforce_eager: bool | None = False, 
        
                   # Set this to avoid hanging issue 
        
                   default_torch_num_threads: int | None = None, 
        
                   **kwargs, 
        
               ) -> None:

As far as I know, VllmRunner in ci defaults to disable chunked_prefill.

May i ask which scenarios will cause errors when chunked prefill is disabled?

I am refactoring the logic to determine whether to enable chunked prefill

DarkLight1337 · 2025-11-17T06:10:43Z

vllm/engine/arg_utils.py

        self, usage_context: UsageContext, model_config: ModelConfig
    ) -> None:
        """Set Default Arguments for V1 Engine."""
+        # Check if running on CPU architecture with feature restrictions


Why not check these after applying defaults?

njhill

I think we should move all of this arg imputation/validation logic into config/vllm.py since it's currently split between arg processing and config postinit, and the logic in the former will be missed in cases VllmConfig is created directly and not via args.

@DarkLight1337 wdty?

njhill · 2025-11-17T16:45:58Z

vllm/engine/arg_utils.py

+            and model_config.runner_type == "generate"
+            and self.enable_chunked_prefill is False
+        ):
+            raise ValueError("Chunked prefill is required for generation models. ")


I would suggest changing the message to "Chunked prefill cannot be disabled for generation models."

njhill · 2025-11-17T16:46:57Z

vllm/engine/arg_utils.py

+                    "Chunked prefill is not supported for %s; "
+                    "disabling it for V1 backend.",


shouldn't mention V1 anymore ...

Suggested change

"Chunked prefill is not supported for %s; "

"disabling it for V1 backend.",

"Chunked prefill is not supported for %s "

"and will be disabled.",

njhill · 2025-11-17T16:47:07Z

vllm/engine/arg_utils.py

+                    "Prefix caching is not supported for %s; "
+                    "disabling it for V1 backend.",


Suggested change

"Prefix caching is not supported for %s; "

"disabling it for V1 backend.",

"Prefix caching is not supported for %s; "

"and will be disabled.",

njhill · 2025-11-17T16:50:39Z

May i ask which scenarios will cause errors when chunked prefill is disabled?

@noooop there won't be any errors, the setting just has no effect (you set it to disabled but it will still do prefill chunking).

Hence it's better to fail since disabling in this case is essentially not supported.

DarkLight1337 · 2025-11-17T16:52:51Z

I think we should move all of this arg imputation/validation logic into config/vllm.py since it's currently split between arg processing and config postinit, and the logic in the former will be missed in cases VllmConfig is created directly and not via args

It's a bit of a catch-22 situation, since we need to impute the default values before VllmConfig is created. Otherwise, the sub-config static defaults are applied which do not match the defaults in EngineArgs (which are dynamically set).

njhill · 2025-11-17T18:51:02Z

I think we should move all of this arg imputation/validation logic into config/vllm.py since it's currently split between arg processing and config postinit, and the logic in the former will be missed in cases VllmConfig is created directly and not via args

It's a bit of a catch-22 situation, since we need to impute the default values before VllmConfig is created. Otherwise, the sub-config static defaults are applied which do not match the defaults in EngineArgs (which are dynamically set).

I think in other cases, where there are sub-config parameters whose default depend on parts of the config outside of the same sub-config, we have those default to None and resolve the value in VllmConfig post-init. Could we do that in all such cases?

I think we should decide on a standard/consistent way for how we handle this. Ideally I think it's better to move it all into the config classes rather than arg parsing for the reason mentioned above.

DarkLight1337 · 2025-11-18T04:27:56Z

I think in other cases, where there are sub-config parameters whose default depend on parts of the config outside of the same sub-config, we have those default to None and resolve the value in VllmConfig post-init. Could we do that in all such cases?

That would mean we also have to move the validation logic from the sub-configs into VllmConfig. (e.g. for validating that max_num_batched_tokens is set correctly)

noooop · 2025-11-18T04:41:30Z

May i ask which scenarios will cause errors when chunked prefill is disabled?

@noooop there won't be any errors, the setting just has no effect (you set it to disabled but it will still do prefill chunking).

Hence it's better to fail since disabling in this case is essentially not supported.

vllm/vllm/v1/core/sched/scheduler.py

Lines 500 to 508 in 0168f69

    
           # chunked prefill has to be enabled explicitly to allow 
        
           # pooling requests to be chunked 
        
           if ( 
        
               not self.scheduler_config.enable_chunked_prefill 
        
               and num_new_tokens > token_budget 
        
           ): 
        
               self.waiting.pop_request() 
        
               skipped_waiting_requests.prepend_request(request) 
        
               continue

As far as I understand, disabling chunked prefill still works, although it's not exactly the same as V0.

njhill · 2025-11-18T05:02:43Z

@noooop apologies I missed that somehow and had an incorrect understanding. Then I guess I have the same question of why it should be disallowed to disable chunked prefill for generative models.

Perhaps because the way that it's implemented means a long prefill request could get stuck indefinitely if there's a continual stream of smaller requests.

We don't want that to be the default of course in any case, and I don't think it should be set by default in VllmRunner either.

mergify · 2025-11-20T03:40:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @22quinn.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2026-02-19T02:14:45Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2026-03-21T02:28:16Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

Do not allow disabling chunked prefill for generation models

f9c6569

Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>

22quinn requested review from DarkLight1337, WoosukKwon, Copilot, hmellor, ywang96 and zhuohan123 November 17, 2025 04:44

22quinn marked this pull request as ready for review November 17, 2025 04:45

Copilot started reviewing on behalf of 22quinn November 17, 2025 04:45 View session

Copilot finished reviewing on behalf of 22quinn November 17, 2025 04:47

Copilot AI reviewed Nov 17, 2025

View reviewed changes

22quinn mentioned this pull request Nov 17, 2025

[V1] Raise error if chunked prefill is explicitly disabled #21645

Closed

4 tasks

DarkLight1337 reviewed Nov 17, 2025

View reviewed changes

njhill reviewed Nov 17, 2025

View reviewed changes

noooop mentioned this pull request Nov 18, 2025

[Bugfix] If chunked_prefill is disabled, end the scheduling early. #28911

Merged

5 tasks

mergify bot added the needs-rebase label Nov 20, 2025

MengqingCao mentioned this pull request Nov 28, 2025

[FixCI] Enable chunked prefill for auto-prefix-caching test vllm-project/vllm-ascend#4551

Closed

github-actions bot added the stale Over 90 days of inactivity label Feb 19, 2026

github-actions bot closed this Mar 21, 2026

	raise ValueError("Chunked prefill is required for generation models. ")
	raise ValueError("Chunked prefill is required for generation models.")

		"Chunked prefill is not supported for %s; "
		"disabling it for V1 backend.",

		"Prefix caching is not supported for %s; "
		"disabling it for V1 backend.",

Uh oh!

Conversation

22quinn commented Nov 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

noooop commented Nov 17, 2025

Uh oh!

22quinn commented Nov 17, 2025

Uh oh!

noooop commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented Nov 17, 2025

Uh oh!

DarkLight1337 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Nov 17, 2025

Uh oh!

DarkLight1337 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Nov 18, 2025

Uh oh!

njhill commented Nov 18, 2025

Uh oh!

mergify bot commented Nov 20, 2025

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

22quinn commented Nov 17, 2025 •

edited by github-actions bot

Loading

noooop commented Nov 17, 2025 •

edited

Loading

DarkLight1337 commented Nov 17, 2025 •

edited

Loading

DarkLight1337 commented Nov 18, 2025 •

edited

Loading