feat: system prompt KV caching for SimpleEngine by Thump604 · Pull Request #175 · waybarrios/vllm-mlx

Thump604 · 2026-03-18T00:11:48Z

Summary

Persist backbone KV cache after prefilling system prompt tokens in SimpleEngine's MTP text path
On subsequent requests with the same system prompt, restore the snapshot and only prefill the suffix (user + history) tokens
Saves ~57 seconds per request on 122B MoE with a 10K-token system prompt (coding assistant with 13 tools)

How it works

Detect system prefix via ChatML boundary markers (<|im_start|>user\n)
Hash prefix text for cache key validation
Cache MISS: prefill system tokens through model, snapshot backbone KV state (immutable MLX arrays)
Cache HIT: restore snapshot into fresh cache objects, pass only suffix tokens to stream_generate with pre-populated prompt_cache
Token prefix validation ensures tokenization splits cleanly at message boundaries
Single-entry cache (one system prompt at a time) — sufficient for single-user SimpleEngine

Results (M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit)

Metric	Value
System prompt tokens	9,934
Snapshot memory	399 MB
TTFT savings per request	~57s
Request 1 (MISS) total	147.0s
Request 2 (HIT) total	90.4s

Scope

Only the MTP text path (_stream_generate_text) — the primary production path for MLLM+MTP models
Cache cleared on stop(), invalidated on system prompt change
Stats exposed via get_stats() → system_kv_cache (tokens, hash, memory_mb)

Depends on

PR feat: MLLM+MTP per-request routing for text and vision #171 (MLLM+MTP per-request routing) — adds the _stream_generate_text method this builds on

Test plan

Cache MISS on first request (logged with token count and snapshot size)
Cache HIT on subsequent requests with same system prompt (logged with reused/new token counts)
Cache invalidation on system prompt change
Correct generation output (verified with live OpenCode traffic)
No errors under normal operation
Verify with tools enabled/disabled (both produce correct prefix hash)

When both --mllm and --enable-mtp are set, SimpleEngine builds a parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy). Text-only requests route to mlx_lm with MTP speculative decoding; media requests route to the mlx_vlm MLLM path. Key components: - text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights - Per-request routing in stream_chat() via _has_media_content() - _stream_generate_text() for MTP-accelerated text generation - MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit): - Text (MTP): 65.3 tok/s - Vision (MLLM): 63.8 tok/s - Memory: 38.7 GB (zero-copy, same as single model)

Persist backbone KV cache after prefilling system prompt tokens. On subsequent requests with the same system prompt, restore the snapshot and only prefill the suffix (user + history) tokens. For a 10K-token system prompt on the 122B model, this saves ~57s per request by avoiding redundant system prompt prefill. Implementation: - Detect system prefix via ChatML boundary markers - Hash prefix text for cache key validation - On cache miss: prefill system tokens, snapshot backbone KV state - On cache hit: restore snapshot into fresh cache, send suffix only - Token prefix validation ensures correct split at tokenization boundary - Single-entry cache (one system prompt at a time) - Stats exposed via get_stats() → system_kv_cache - Cache cleared on stop(), invalidated on system prompt change

Thump604 · 2026-03-21T19:45:28Z

Superseded — native prefix caching in AsyncEngineCore handles this. System KV cache was a workaround for SimpleEngine's lack of prefix caching.

Thump604 added 2 commits March 17, 2026 11:12

Thump604 closed this Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: system prompt KV caching for SimpleEngine#175

feat: system prompt KV caching for SimpleEngine#175
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:feat/system-prompt-kv-cache

Thump604 commented Mar 18, 2026

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 18, 2026

Summary

How it works

Results (M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit)

Scope

Depends on

Test plan

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant