Skip to content

feat: MTP per-request routing + system KV cache in BatchedEngine#192

Closed
Thump604 wants to merge 12 commits intowaybarrios:mainfrom
Thump604:feat/batched-engine-parity
Closed

feat: MTP per-request routing + system KV cache in BatchedEngine#192
Thump604 wants to merge 12 commits intowaybarrios:mainfrom
Thump604:feat/batched-engine-parity

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 commented Mar 21, 2026

Port MTP routing, SpecPrefill, and system KV caching to BatchedEngine for continuous batching parity with SimpleEngine.

What:

  • Build TextModel from VLM backbone at load time (zero-copy)
  • Text-only → TextModel with MTP, media → MLLM (same routing as SimpleEngine)
  • SpecPrefill as pre-scheduler preprocessing (draft scoring → sparse tokens → batch queue)
  • System prompt KV caching via ChatML boundary detection
  • Fix specprefill CLI pipeline (args were parsed but never passed to engine)

Results (M2 Ultra 128GB, Qwen3.5-122B):

  • 5 concurrent x 1024 tokens in 8s (was 29s with single-request lock)
  • System KV cache: 5.9s cold → 0.8s cached (7.3x)
  • Zero crashes under concurrent load (MLX thread-local-streams #3281)

Depends on: #171, #180

Files: engine/batched.py, server.py, cli.py, api/models.py, specprefill.py, message_utils.py

Thump604 added 10 commits March 17, 2026 11:12
When both --mllm and --enable-mtp are set, SimpleEngine builds a
parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy).
Text-only requests route to mlx_lm with MTP speculative decoding;
media requests route to the mlx_vlm MLLM path.

Key components:
- text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights
- Per-request routing in stream_chat() via _has_media_content()
- _stream_generate_text() for MTP-accelerated text generation
- MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM

Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit):
- Text (MTP): 65.3 tok/s
- Vision (MLLM): 63.8 tok/s
- Memory: 38.7 GB (zero-copy, same as single model)
Multi-turn conversations with tool results can produce messages where
system messages are out of order or consecutive same-role messages appear.
Qwen 3.5's chat template rejects these with "System message must be at
the beginning", crashing CLI agents on turn 2.

Add _normalize_messages() to SimpleEngine.stream_chat() to:
1. Map developer -> system (OpenAI Responses API compat)
2. Merge consecutive same-role messages (alternating-role requirement)

This matches the normalization already done in BatchedEngine paths
(PR waybarrios#165) but was missing from SimpleEngine's MTP text path.
Many CLIs (OpenCode, Qwen Code, Kilo) send system messages mid-conversation.
The Qwen 3.5 chat template enforces system messages at position [0], causing
TemplateError crashes and dropped connections (2600+ occurrences in logs).

Changes:
- _normalize_messages() now hoists all system messages to position [0] and
  merges their content, after the existing role-mapping and same-role merge
- Added _normalize_messages() call to SimpleEngine.chat() (non-streaming
  path was missing it)

The condition only triggers when system messages are out of position
(len > 1 or merged[0] != system), so well-ordered messages pass through
unchanged.
Add _hoist_system_messages() to mllm.py for the MLLM-internal
get_chat_template calls. These go through mlx_vlm's template
application which also enforces system-at-position-[0].

Defense-in-depth: SimpleEngine already normalizes before calling
MLLM methods, but direct MLLM usage (benchmarks, tests) and
edge cases during startup now also handle out-of-order system
messages correctly.
SimpleEngine.chat() lacked MLLM+MTP per-request routing that
stream_chat() already had. Text-only requests via non-streaming API
went to mlx_vlm MLLM path, causing 160GB Metal buffer allocation
attempts and server crashes.

Routes text-only chat() requests through _stream_generate_text()
(MTP path) matching stream_chat() behavior.
Extract _normalize_messages() to shared message_utils.py module.
Add calls in BatchedEngine.chat() and BatchedEngine.stream_chat().

Qwen 3.5 templates reject developer role, consecutive same-role
messages, and system messages out of position [0].
Build mlx_lm TextModel with MTP weights from VLM backbone at load time
(zero-copy weight sharing). Route text-only requests to TextModel with
MTP speculative decoding, media requests to MLLM path.

Dual-backend architecture: MLLMScheduler for media, direct TextModel
generation with asyncio.Lock for text. Metal serialization prevents
concurrent GPU access.

Depends on PR waybarrios#171 (text_model_from_vlm.py).
Add specprefill_enabled, specprefill_draft_model_path,
specprefill_threshold, specprefill_keep_pct to load_model()
and SimpleEngine.__init__. Previously these were parsed from CLI
but never passed to the engine — _specprefill_threshold and
_specprefill_keep_pct would AttributeError if triggered.
On the first text-only request, prefills the system prompt tokens and
snapshots the backbone KV state. Subsequent requests with the same
system prompt restore the snapshot and only prefill suffix tokens,
saving ~57s per request on 122B.

Cache hit path: restores KV snapshot into fresh cache, passes suffix
tokens + primed prompt_cache to mlx_lm stream_generate with MTP.

Cache miss path: runs full generation, then separately prefills system
prefix on a fresh cache and saves the snapshot for next time.

Handles both KVCache (Qwen 3.5) and ArraysCache (hybrid) state formats.
System prefix boundary detected via ChatML <|im_start|>user marker.
Stats exposed via get_stats() for /v1/status visibility.
Load draft model at init, dispatch specprefill for prompts exceeding
threshold. Uses score_tokens/select_chunks/sparse_prefill from
specprefill.py module. Composes with system KV cache: when cache hits,
sparse-prefills only the suffix with position_offset. Per-request API:
extra_body specprefill/specprefill_keep_pct. Graceful fallback to
normal MTP path on any error.

SpecPrefill generates autoregressively (no MTP) because sparse cache
is incompatible with MTP speculative decoding.

_SPECPREFILL_MAX_TOKENS = 196608 (supports up to 192K context).
@Thump604
Copy link
Copy Markdown
Collaborator Author

Update: SpecPrefill support added (commit 1244d93)

  • Load draft model at init alongside TextModel
  • Dispatch specprefill for prompts exceeding threshold (default 8K tokens)
  • Composes with system KV cache: cache HIT → score only suffix tokens
  • Per-request API: extra_body: {specprefill: true/false, specprefill_keep_pct: 0.1-1.0}
  • _SPECPREFILL_MAX_TOKENS raised to 196608 (was 65536)
  • Graceful fallback to normal generation on any error
  • specprefill.py module included (from feat/specprefill branch)

Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Mar 22, 2026
Port SimpleEngine features to BatchedEngine for continuous batching:

- Per-request MTP routing: text-only → TextModel (MTP), media → MLLM
- message_utils.py: shared _normalize_messages (developer→system,
  merge consecutive same-role, hoist system to [0])
- SpecPrefill config + draft model lifecycle in BatchedEngine
- System KV cache with ChatML boundary detection

Replaces PR waybarrios#192 (rebased against main after merge of waybarrios#180, waybarrios#97).
@Thump604
Copy link
Copy Markdown
Collaborator Author

Rebased as fresh branch against current main (after #180, #97 merges). Reopening as new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant