fix: MLLM hybrid batching + message normalization#224
fix: MLLM hybrid batching + message normalization#224Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Conversation
Split from waybarrios#165 — prefix cache hybrid changes deferred to waybarrios#217. Fixes: - mllm_batch_generator: hybrid cache handling (ArraysCache + KVCache interleaved) for make_batch_cache, merge, filter, extract, extend - mllm_scheduler: hybrid cache scheduling - server.py: _normalize_messages (developer->system role mapping, consecutive same-role merge) applied to MLLM and LLM paths - tokenizer: VLM tokenizer loader with fallback - qwen3_5_mllm: Qwen3.5 MLLM patch for hybrid batching Fixes waybarrios#137 (developer role crash), fixes OpenCode multi-system crash.
|
Evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit, BatchedEngine:
All 3/3 tests pass. This is the slim split of #165 with prefix cache changes deferred to #217. The message normalization fix is particularly important -- Qwen 3.5 templates reject consecutive same-role messages and system messages out of order. Without it, multi-turn agent conversations crash (OpenCode sends |
|
@janhilgard - this fixes multi-turn agent crashes from developer role messages and consecutive same-role messages. Would appreciate a review if you have time. |
|
Split into three focused PRs for easier review:
Each PR is based on origin/main and stands alone with no cross-dependencies. |
Summary
Split from #165 — prefix cache hybrid changes deferred to #217 (better approach).
This PR contains the non-prefix-cache fixes from #165:
_make_batch_cache, merge, filter, extract, extend operations_normalize_messages()— mapsdeveloper->systemrole, merges consecutive same-role messages. Applied to both MLLM and LLM paths.Fixes #137 (developer role crash on Qwen 3.5 templates). Fixes OpenCode crash when sending
[system, system, user, user]message sequences.Test plan
test_mllm_hybrid_cache.py)Context
The original #165 also included prefix cache hybrid support (
prefix_cache.py). That part overlaps with #217 which has a cleaner storage-type dispatch model. This PR extracts the unique, non-overlapping fixes so they can be reviewed and merged independently.