Skip to content

feat: add --prefill-step-size CLI flag#105

Merged
janhilgard merged 3 commits intowaybarrios:mainfrom
kol22:fix/prefill-step-size-mllm
Apr 12, 2026
Merged

feat: add --prefill-step-size CLI flag#105
janhilgard merged 3 commits intowaybarrios:mainfrom
kol22:fix/prefill-step-size-mllm

Conversation

@kol22
Copy link
Copy Markdown
Contributor

@kol22 kol22 commented Feb 23, 2026

Summary

MLLMSchedulerConfig.prefill_step_size defaults to 1024 but isn't exposed as a
CLI argument. The MLLM batch generator enforces
total_prompt_tokens <= prefill_step_size * batch_count, so any single vision
request exceeding 1024 tokens fails with:

ValueError: Total prompt tokens (2289) exceeds safe limit (1024)

Since vision tokens alone typically exceed 1024 (images contribute ~1400+
tokens), this effectively blocks all MLLM inference under the default config.

Fix

Add --prefill-step-size to both serve and bench commands. Default of 0
means "use engine default" — 2048 for LLM, 1024 for MLLM — preserving existing
behavior. When set, the value flows through SchedulerConfig to both the LLM
scheduler and the MLLM scheduler.

Example usage:

vllm-mlx serve model-name --continuous-batching --prefill-step-size 16384

Test

Tested with Qwen3-VL-32B-Instruct-8bit. Before this fix, vision requests fail
with the ValueError above. After passing --prefill-step-size 16384, they
complete successfully.

@kol22 kol22 marked this pull request as draft February 24, 2026 23:32
@kol22 kol22 marked this pull request as ready for review February 25, 2026 01:31
@waybarrios waybarrios added the enhancement New feature or request label Feb 25, 2026
@waybarrios
Copy link
Copy Markdown
Owner

Looked through this PR. The core plumbing for serve works correctly, the flag flows through SchedulerConfig into MLLMSchedulerConfig as expected.

A few things I noticed:

  1. The --prefill-step-size flag is also added to bench_parser, but bench_command uses AsyncEngineCore which only runs the LLM Scheduler. The MLLM scheduler path lives in BatchedEngine._start_mllm(), which bench never calls. So the flag gets accepted on vllm-mlx bench but does nothing. Might be worth removing it from bench or adding a note.

  2. The flag name --prefill-step-size is pretty generic, but it only controls the MLLM path (mllm_prefill_step_size). SchedulerConfig already has a separate prefill_step_size field (default 2048) for the LLM BatchGenerator. Someone running a plain LLM model with --prefill-step-size 512 would see the flag accepted with no effect. Something like --mllm-prefill-step-size would make the scope clearer and match the internal field name.

  3. The 0 means use default contract lives only in the CLI layer. The dataclass field is Optional[int] with no validation, so if anything constructs SchedulerConfig(mllm_prefill_step_size=0) directly, that 0 gets forwarded straight to MLLMSchedulerConfig and you'd end up with max_batch_tokens = 0.

None of these are blockers, the fix for the original MLLM prefill guard issue works.

@waybarrios
Copy link
Copy Markdown
Owner

Would you mind addressing point 1 (removing the flag from bench_parser or wiring it to the MLLM bench path) and point 2 (renaming to --mllm-prefill-step-size) before merging? Those two would avoid confusion for users. Point 3 is minor and can be addressed later if you prefer.

@waybarrios
Copy link
Copy Markdown
Owner

btw @kol22 great work!

@kol22
Copy link
Copy Markdown
Contributor Author

kol22 commented Feb 25, 2026

ee69b94

Addressed all three points :

  1. Removed the prefill override flag from bench_parser since bench_command runs AsyncEngineCore (LLM scheduler path)
  2. Renamed the serve flag to --mllm-prefill-step-size to make scope explicit and avoid confusion with the LLM prefill_step_size.
  3. Added validation in SchedulerConfig.__post_init__ so mllm_prefill_step_size must be > 0 when provided (prevents direct non-CLI construction with 0 from propagating).

Appreciate the feedback and quick review!

@janhilgard
Copy link
Copy Markdown
Collaborator

@waybarrios — all three review items were addressed back on Feb 25 (removed from bench, renamed to --mllm-prefill-step-size, added > 0 validation). This has been waiting for a follow-up review for ~7 weeks now.

It does have merge conflicts from recent main changes — @kol22 would you mind rebasing? After that it should be ready to merge.

kol22 added 3 commits April 12, 2026 15:08
Expose prefill_step_size as a CLI argument for both serve and bench
commands. Default of 0 means "use engine default" (2048 for LLM, 1024
for MLLM), preserving existing behavior.

Vision models routinely exceed 1024 tokens per prompt (images alone
contribute 1400+), hitting the MLLM batch generator's safe limit.
This flag lets users raise the limit without patching source code.
@kol22 kol22 force-pushed the fix/prefill-step-size-mllm branch from ee69b94 to 186371d Compare April 12, 2026 20:09
@kol22
Copy link
Copy Markdown
Contributor Author

kol22 commented Apr 12, 2026

@janhilgard - resolved & rebased. Should be good to go.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all three review items from @waybarrios have been addressed:

  1. Removed from bench_parser — the flag is now serve-only, matching the actual MLLM scheduler path.
  2. Renamed to --mllm-prefill-step-size — clear scope, matches the internal field name.
  3. __post_init__ validationmllm_prefill_step_size must be > 0 when provided, preventing the 0 propagation edge case.

Rebase on current main is clean. Code is minimal and consistent with existing patterns. Ready to merge.

@janhilgard janhilgard merged commit 7cfae14 into waybarrios:main Apr 12, 2026
7 checks passed
@kol22 kol22 deleted the fix/prefill-step-size-mllm branch April 12, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants