feat: MTP per-request routing + system KV cache in BatchedEngine by Thump604 · Pull Request #192 · waybarrios/vllm-mlx

Thump604 · 2026-03-21T14:38:25Z

Port MTP routing, SpecPrefill, and system KV caching to BatchedEngine for continuous batching parity with SimpleEngine.

What:

Build TextModel from VLM backbone at load time (zero-copy)
Text-only → TextModel with MTP, media → MLLM (same routing as SimpleEngine)
SpecPrefill as pre-scheduler preprocessing (draft scoring → sparse tokens → batch queue)
System prompt KV caching via ChatML boundary detection
Fix specprefill CLI pipeline (args were parsed but never passed to engine)

Results (M2 Ultra 128GB, Qwen3.5-122B):

5 concurrent x 1024 tokens in 8s (was 29s with single-request lock)
System KV cache: 5.9s cold → 0.8s cached (7.3x)
Zero crashes under concurrent load (MLX thread-local-streams #3281)

Depends on: #171, #180

Files: engine/batched.py, server.py, cli.py, api/models.py, specprefill.py, message_utils.py

When both --mllm and --enable-mtp are set, SimpleEngine builds a parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy). Text-only requests route to mlx_lm with MTP speculative decoding; media requests route to the mlx_vlm MLLM path. Key components: - text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights - Per-request routing in stream_chat() via _has_media_content() - _stream_generate_text() for MTP-accelerated text generation - MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit): - Text (MTP): 65.3 tok/s - Vision (MLLM): 63.8 tok/s - Memory: 38.7 GB (zero-copy, same as single model)

Multi-turn conversations with tool results can produce messages where system messages are out of order or consecutive same-role messages appear. Qwen 3.5's chat template rejects these with "System message must be at the beginning", crashing CLI agents on turn 2. Add _normalize_messages() to SimpleEngine.stream_chat() to: 1. Map developer -> system (OpenAI Responses API compat) 2. Merge consecutive same-role messages (alternating-role requirement) This matches the normalization already done in BatchedEngine paths (PR waybarrios#165) but was missing from SimpleEngine's MTP text path.

Many CLIs (OpenCode, Qwen Code, Kilo) send system messages mid-conversation. The Qwen 3.5 chat template enforces system messages at position [0], causing TemplateError crashes and dropped connections (2600+ occurrences in logs). Changes: - _normalize_messages() now hoists all system messages to position [0] and merges their content, after the existing role-mapping and same-role merge - Added _normalize_messages() call to SimpleEngine.chat() (non-streaming path was missing it) The condition only triggers when system messages are out of position (len > 1 or merged[0] != system), so well-ordered messages pass through unchanged.

Add _hoist_system_messages() to mllm.py for the MLLM-internal get_chat_template calls. These go through mlx_vlm's template application which also enforces system-at-position-[0]. Defense-in-depth: SimpleEngine already normalizes before calling MLLM methods, but direct MLLM usage (benchmarks, tests) and edge cases during startup now also handle out-of-order system messages correctly.

SimpleEngine.chat() lacked MLLM+MTP per-request routing that stream_chat() already had. Text-only requests via non-streaming API went to mlx_vlm MLLM path, causing 160GB Metal buffer allocation attempts and server crashes. Routes text-only chat() requests through _stream_generate_text() (MTP path) matching stream_chat() behavior.

Extract _normalize_messages() to shared message_utils.py module. Add calls in BatchedEngine.chat() and BatchedEngine.stream_chat(). Qwen 3.5 templates reject developer role, consecutive same-role messages, and system messages out of position [0].

Build mlx_lm TextModel with MTP weights from VLM backbone at load time (zero-copy weight sharing). Route text-only requests to TextModel with MTP speculative decoding, media requests to MLLM path. Dual-backend architecture: MLLMScheduler for media, direct TextModel generation with asyncio.Lock for text. Metal serialization prevents concurrent GPU access. Depends on PR waybarrios#171 (text_model_from_vlm.py).

Add specprefill_enabled, specprefill_draft_model_path, specprefill_threshold, specprefill_keep_pct to load_model() and SimpleEngine.__init__. Previously these were parsed from CLI but never passed to the engine — _specprefill_threshold and _specprefill_keep_pct would AttributeError if triggered.

On the first text-only request, prefills the system prompt tokens and snapshots the backbone KV state. Subsequent requests with the same system prompt restore the snapshot and only prefill suffix tokens, saving ~57s per request on 122B. Cache hit path: restores KV snapshot into fresh cache, passes suffix tokens + primed prompt_cache to mlx_lm stream_generate with MTP. Cache miss path: runs full generation, then separately prefills system prefix on a fresh cache and saves the snapshot for next time. Handles both KVCache (Qwen 3.5) and ArraysCache (hybrid) state formats. System prefix boundary detected via ChatML <|im_start|>user marker. Stats exposed via get_stats() for /v1/status visibility.

Load draft model at init, dispatch specprefill for prompts exceeding threshold. Uses score_tokens/select_chunks/sparse_prefill from specprefill.py module. Composes with system KV cache: when cache hits, sparse-prefills only the suffix with position_offset. Per-request API: extra_body specprefill/specprefill_keep_pct. Graceful fallback to normal MTP path on any error. SpecPrefill generates autoregressively (no MTP) because sparse cache is incompatible with MTP speculative decoding. _SPECPREFILL_MAX_TOKENS = 196608 (supports up to 192K context).

Thump604 · 2026-03-21T14:52:21Z

Update: SpecPrefill support added (commit 1244d93)

Load draft model at init alongside TextModel
Dispatch specprefill for prompts exceeding threshold (default 8K tokens)
Composes with system KV cache: cache HIT → score only suffix tokens
Per-request API: extra_body: {specprefill: true/false, specprefill_keep_pct: 0.1-1.0}
_SPECPREFILL_MAX_TOKENS raised to 196608 (was 65536)
Graceful fallback to normal generation on any error
specprefill.py module included (from feat/specprefill branch)

Port SimpleEngine features to BatchedEngine for continuous batching: - Per-request MTP routing: text-only → TextModel (MTP), media → MLLM - message_utils.py: shared _normalize_messages (developer→system, merge consecutive same-role, hoist system to [0]) - SpecPrefill config + draft model lifecycle in BatchedEngine - System KV cache with ChatML boundary detection Replaces PR waybarrios#192 (rebased against main after merge of waybarrios#180, waybarrios#97).

Thump604 · 2026-03-22T12:27:25Z

Rebased as fresh branch against current main (after #180, #97 merges). Reopening as new PR.

Thump604 added 10 commits March 17, 2026 11:12

Thump604 added 2 commits March 21, 2026 10:51

style: black formatting

867648b

style: black formatting for test files

e61cad3

Thump604 mentioned this pull request Mar 21, 2026

Contributing guidelines and PR review process #186

Open

Thump604 closed this Mar 22, 2026

Thump604 mentioned this pull request Mar 22, 2026

feat: BatchedEngine parity — MTP routing, normalization, SpecPrefill #203

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MTP per-request routing + system KV cache in BatchedEngine#192

feat: MTP per-request routing + system KV cache in BatchedEngine#192
Thump604 wants to merge 12 commits intowaybarrios:mainfrom
Thump604:feat/batched-engine-parity

Thump604 commented Mar 21, 2026 •

edited

Loading

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

Thump604 commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

Thump604 commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Thump604 commented Mar 21, 2026 •

edited

Loading