Skip to content

fix: MLLM hybrid batching + message normalization#224

Closed
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:fix/mllm-hybrid-batching-slim
Closed

fix: MLLM hybrid batching + message normalization#224
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:fix/mllm-hybrid-batching-slim

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

Split from #165 — prefix cache hybrid changes deferred to #217 (better approach).

This PR contains the non-prefix-cache fixes from #165:

  • mllm_batch_generator.py: Hybrid cache handling (ArraysCache + KVCache interleaved) for _make_batch_cache, merge, filter, extract, extend operations
  • mllm_scheduler.py: Hybrid cache scheduling for BatchedEngine
  • server.py: _normalize_messages() — maps developer -> system role, merges consecutive same-role messages. Applied to both MLLM and LLM paths.
  • tokenizer.py: VLM tokenizer loader with mlx-lm fallback
  • qwen3_5_mllm.py: Qwen3.5 MLLM patch for hybrid batching

Fixes #137 (developer role crash on Qwen 3.5 templates). Fixes OpenCode crash when sending [system, system, user, user] message sequences.

Test plan

  • 524-line test suite for hybrid cache operations (test_mllm_hybrid_cache.py)
  • Validated on M2 Ultra 128GB with Qwen3.5-122B-A10B (hybrid MoE, ArraysCache + KVCache layers)
  • Tested with BatchedEngine continuous batching + prefix cache enabled

Context

The original #165 also included prefix cache hybrid support (prefix_cache.py). That part overlaps with #217 which has a cleaner storage-type dispatch model. This PR extracts the unique, non-overlapping fixes so they can be reviewed and merged independently.

Split from waybarrios#165 — prefix cache hybrid changes deferred to waybarrios#217.

Fixes:
- mllm_batch_generator: hybrid cache handling (ArraysCache + KVCache
  interleaved) for make_batch_cache, merge, filter, extract, extend
- mllm_scheduler: hybrid cache scheduling
- server.py: _normalize_messages (developer->system role mapping,
  consecutive same-role merge) applied to MLLM and LLM paths
- tokenizer: VLM tokenizer loader with fallback
- qwen3_5_mllm: Qwen3.5 MLLM patch for hybrid batching

Fixes waybarrios#137 (developer role crash), fixes OpenCode multi-system crash.
@Thump604
Copy link
Copy Markdown
Collaborator Author

Evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit, BatchedEngine:

Test Result
Text request through BatchedEngine PASS
Message normalization (consecutive user messages) PASS
Developer role mapping PASS

All 3/3 tests pass. This is the slim split of #165 with prefix cache changes deferred to #217.

The message normalization fix is particularly important -- Qwen 3.5 templates reject consecutive same-role messages and system messages out of order. Without it, multi-turn agent conversations crash (OpenCode sends [system, system, user, user] sequences, and the OpenAI Responses API sends developer role instead of system). This has caused production crashes three times.

@Thump604
Copy link
Copy Markdown
Collaborator Author

@janhilgard - this fixes multi-turn agent crashes from developer role messages and consecutive same-role messages. Would appreciate a review if you have time.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Split into three focused PRs for easier review:

Each PR is based on origin/main and stands alone with no cross-dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segfault on long context with unrecognized message role (developer)

1 participant