fix: MLLM hybrid batching + message normalization by Thump604 · Pull Request #224 · waybarrios/vllm-mlx

Thump604 · 2026-03-25T01:28:45Z

Summary

Split from #165 — prefix cache hybrid changes deferred to #217 (better approach).

This PR contains the non-prefix-cache fixes from #165:

mllm_batch_generator.py: Hybrid cache handling (ArraysCache + KVCache interleaved) for _make_batch_cache, merge, filter, extract, extend operations
mllm_scheduler.py: Hybrid cache scheduling for BatchedEngine
server.py: _normalize_messages() — maps developer -> system role, merges consecutive same-role messages. Applied to both MLLM and LLM paths.
tokenizer.py: VLM tokenizer loader with mlx-lm fallback
qwen3_5_mllm.py: Qwen3.5 MLLM patch for hybrid batching

Fixes #137 (developer role crash on Qwen 3.5 templates). Fixes OpenCode crash when sending [system, system, user, user] message sequences.

Test plan

524-line test suite for hybrid cache operations (test_mllm_hybrid_cache.py)
Validated on M2 Ultra 128GB with Qwen3.5-122B-A10B (hybrid MoE, ArraysCache + KVCache layers)
Tested with BatchedEngine continuous batching + prefix cache enabled

Context

The original #165 also included prefix cache hybrid support (prefix_cache.py). That part overlaps with #217 which has a cleaner storage-type dispatch model. This PR extracts the unique, non-overlapping fixes so they can be reviewed and merged independently.

Split from waybarrios#165 — prefix cache hybrid changes deferred to waybarrios#217. Fixes: - mllm_batch_generator: hybrid cache handling (ArraysCache + KVCache interleaved) for make_batch_cache, merge, filter, extract, extend - mllm_scheduler: hybrid cache scheduling - server.py: _normalize_messages (developer->system role mapping, consecutive same-role merge) applied to MLLM and LLM paths - tokenizer: VLM tokenizer loader with fallback - qwen3_5_mllm: Qwen3.5 MLLM patch for hybrid batching Fixes waybarrios#137 (developer role crash), fixes OpenCode multi-system crash.

Thump604 · 2026-03-25T01:34:31Z

Evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit, BatchedEngine:

Test	Result
Text request through BatchedEngine	PASS
Message normalization (consecutive user messages)	PASS
Developer role mapping	PASS

All 3/3 tests pass. This is the slim split of #165 with prefix cache changes deferred to #217.

The message normalization fix is particularly important -- Qwen 3.5 templates reject consecutive same-role messages and system messages out of order. Without it, multi-turn agent conversations crash (OpenCode sends [system, system, user, user] sequences, and the OpenAI Responses API sends developer role instead of system). This has caused production crashes three times.

Thump604 · 2026-03-31T23:33:08Z

@janhilgard - this fixes multi-turn agent crashes from developer role messages and consecutive same-role messages. Would appreciate a review if you have time.

Thump604 · 2026-03-31T23:49:56Z

Split into three focused PRs for easier review:

fix: normalize messages before chat template application #240 - Message normalization (_normalize_messages, developer role mapping, consecutive same-role merge)
fix: MLLM hybrid model batching (ArraysCache support) #241 - MLLM hybrid batching (ArraysCache support, _make_batch_cache rewrite, extend fix, error handling)
fix: MLLM scheduler streaming detokenizer + VLM model pre-detection #242 - MLLM scheduler streaming detokenizer + VLM pre-detection

Each PR is based on origin/main and stands alone with no cross-dependencies.

Thump604 mentioned this pull request Mar 25, 2026

fix: MLLM continuous batching for hybrid models #165

Closed

This was referenced Mar 25, 2026

prefix_cache: preserve hybrid recurrent state across blocks #217

Merged

feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213

Merged

This was referenced Mar 31, 2026

fix: normalize messages before chat template application #240

Merged

fix: MLLM hybrid model batching (ArraysCache support) #241

Closed

fix: MLLM scheduler streaming detokenizer + VLM model pre-detection #242

Closed

Thump604 closed this Mar 31, 2026

Thump604 mentioned this pull request Mar 31, 2026

Looking for collaborators #238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: MLLM hybrid batching + message normalization#224

fix: MLLM hybrid batching + message normalization#224
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:fix/mllm-hybrid-batching-slim

Thump604 commented Mar 25, 2026

Uh oh!

Thump604 commented Mar 25, 2026

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 25, 2026

Summary

Test plan

Context

Uh oh!

Thump604 commented Mar 25, 2026

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant