feat: add --prefill-step-size CLI flag#105
Conversation
|
Looked through this PR. The core plumbing for serve works correctly, the flag flows through A few things I noticed:
None of these are blockers, the fix for the original MLLM prefill guard issue works. |
|
Would you mind addressing point 1 (removing the flag from |
|
btw @kol22 great work! |
|
Addressed all three points :
Appreciate the feedback and quick review! |
|
@waybarrios — all three review items were addressed back on Feb 25 (removed from bench, renamed to It does have merge conflicts from recent main changes — @kol22 would you mind rebasing? After that it should be ready to merge. |
Expose prefill_step_size as a CLI argument for both serve and bench commands. Default of 0 means "use engine default" (2048 for LLM, 1024 for MLLM), preserving existing behavior. Vision models routinely exceed 1024 tokens per prompt (images alone contribute 1400+), hitting the MLLM batch generator's safe limit. This flag lets users raise the limit without patching source code.
ee69b94 to
186371d
Compare
|
@janhilgard - resolved & rebased. Should be good to go. |
janhilgard
left a comment
There was a problem hiding this comment.
LGTM — all three review items from @waybarrios have been addressed:
- Removed from
bench_parser— the flag is nowserve-only, matching the actual MLLM scheduler path. - Renamed to
--mllm-prefill-step-size— clear scope, matches the internal field name. __post_init__validation —mllm_prefill_step_sizemust be> 0when provided, preventing the0propagation edge case.
Rebase on current main is clean. Code is minimal and consistent with existing patterns. Ready to merge.
Summary
MLLMSchedulerConfig.prefill_step_sizedefaults to 1024 but isn't exposed as aCLI argument. The MLLM batch generator enforces
total_prompt_tokens <= prefill_step_size * batch_count, so any single visionrequest exceeding 1024 tokens fails with:
Since vision tokens alone typically exceed 1024 (images contribute ~1400+
tokens), this effectively blocks all MLLM inference under the default config.
Fix
Add
--prefill-step-sizeto bothserveandbenchcommands. Default of0means "use engine default" — 2048 for LLM, 1024 for MLLM — preserving existing
behavior. When set, the value flows through
SchedulerConfigto both the LLMscheduler and the MLLM scheduler.
Example usage:
Test
Tested with Qwen3-VL-32B-Instruct-8bit. Before this fix, vision requests fail
with the ValueError above. After passing
--prefill-step-size 16384, theycomplete successfully.