Add mid-batch extend for text-only MLLM requests#263
Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Closed
Add mid-batch extend for text-only MLLM requests#263janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Conversation
Previously, MLLM batch generator only started new batches when num_active == 0, serializing concurrent requests. A short text-only request had to wait for a long request to finish all its tokens. Now text-only requests (no images/videos) can join an active batch mid-generation via MLLMBatch.extend(). Multimodal requests still wait for batch completion due to vision encoding shape constraints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
num_active == 0, serializing all concurrent requests. A short text-only request had to wait for a long request to finish generating all its tokens.MLLMBatch.extend(). Multimodal requests still wait for batch completion due to vision encoding shape constraints.MLLMBatch.extend()method andBatchKVCache.extend()— no new infrastructure needed.Before
After
Tested on Qwen3.5-35B-A3B (--mllm --continuous-batching)
Test plan
num_active == 0(no regression)_process_promptsreturnsNone(handled, no extend called)🤖 Generated with Claude Code