Skip to content

Add mid-batch extend for text-only MLLM requests#263

Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/mllm-mid-batch-extend
Closed

Add mid-batch extend for text-only MLLM requests#263
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/mllm-mid-batch-extend

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • MLLM batch generator previously only started new batches when num_active == 0, serializing all concurrent requests. A short text-only request had to wait for a long request to finish generating all its tokens.
  • Now text-only requests (no images/videos) can join an active batch mid-generation via MLLMBatch.extend(). Multimodal requests still wait for batch completion due to vision encoding shape constraints.
  • This leverages the existing MLLMBatch.extend() method and BatchKVCache.extend() — no new infrastructure needed.

Before

Request 1 (long):  |===prefill===|=========generation (2000 tokens)=========|
Request 2 (short): |                    waiting...                           |===prefill+gen===|

After

Request 1 (long):  |===prefill===|=========generation (2000 tokens)=========|
Request 2 (short):        |==prefill+gen==| done!

Tested on Qwen3.5-35B-A3B (--mllm --continuous-batching)

  • Long request: 53.8s (2000 max_tokens)
  • Short request: 1.0s (joined mid-batch, finished 48.8s earlier)
  • Both outputs correct, no artifacts

Test plan

  • Concurrent text-only requests: short request joins active batch and finishes independently
  • Output correctness: both requests produce coherent text
  • Multimodal requests still wait for num_active == 0 (no regression)
  • Edge case: _process_prompts returns None (handled, no extend called)

🤖 Generated with Claude Code

Previously, MLLM batch generator only started new batches when
num_active == 0, serializing concurrent requests. A short text-only
request had to wait for a long request to finish all its tokens.

Now text-only requests (no images/videos) can join an active batch
mid-generation via MLLMBatch.extend(). Multimodal requests still
wait for batch completion due to vision encoding shape constraints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard
Copy link
Copy Markdown
Collaborator Author

@Thump604 Superseded — mid-batch extend for text-only MLLM requests is already in main via #278. Closing.

@janhilgard janhilgard closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant