Skip to content

feat: system prompt KV caching for SimpleEngine#175

Closed
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:feat/system-prompt-kv-cache
Closed

feat: system prompt KV caching for SimpleEngine#175
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:feat/system-prompt-kv-cache

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

  • Persist backbone KV cache after prefilling system prompt tokens in SimpleEngine's MTP text path
  • On subsequent requests with the same system prompt, restore the snapshot and only prefill the suffix (user + history) tokens
  • Saves ~57 seconds per request on 122B MoE with a 10K-token system prompt (coding assistant with 13 tools)

How it works

  1. Detect system prefix via ChatML boundary markers (<|im_start|>user\n)
  2. Hash prefix text for cache key validation
  3. Cache MISS: prefill system tokens through model, snapshot backbone KV state (immutable MLX arrays)
  4. Cache HIT: restore snapshot into fresh cache objects, pass only suffix tokens to stream_generate with pre-populated prompt_cache
  5. Token prefix validation ensures tokenization splits cleanly at message boundaries
  6. Single-entry cache (one system prompt at a time) — sufficient for single-user SimpleEngine

Results (M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit)

Metric Value
System prompt tokens 9,934
Snapshot memory 399 MB
TTFT savings per request ~57s
Request 1 (MISS) total 147.0s
Request 2 (HIT) total 90.4s

Scope

  • Only the MTP text path (_stream_generate_text) — the primary production path for MLLM+MTP models
  • Cache cleared on stop(), invalidated on system prompt change
  • Stats exposed via get_stats()system_kv_cache (tokens, hash, memory_mb)

Depends on

Test plan

  • Cache MISS on first request (logged with token count and snapshot size)
  • Cache HIT on subsequent requests with same system prompt (logged with reused/new token counts)
  • Cache invalidation on system prompt change
  • Correct generation output (verified with live OpenCode traffic)
  • No errors under normal operation
  • Verify with tools enabled/disabled (both produce correct prefix hash)

When both --mllm and --enable-mtp are set, SimpleEngine builds a
parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy).
Text-only requests route to mlx_lm with MTP speculative decoding;
media requests route to the mlx_vlm MLLM path.

Key components:
- text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights
- Per-request routing in stream_chat() via _has_media_content()
- _stream_generate_text() for MTP-accelerated text generation
- MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM

Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit):
- Text (MTP): 65.3 tok/s
- Vision (MLLM): 63.8 tok/s
- Memory: 38.7 GB (zero-copy, same as single model)
Persist backbone KV cache after prefilling system prompt tokens.
On subsequent requests with the same system prompt, restore the
snapshot and only prefill the suffix (user + history) tokens.

For a 10K-token system prompt on the 122B model, this saves ~57s
per request by avoiding redundant system prompt prefill.

Implementation:
- Detect system prefix via ChatML boundary markers
- Hash prefix text for cache key validation
- On cache miss: prefill system tokens, snapshot backbone KV state
- On cache hit: restore snapshot into fresh cache, send suffix only
- Token prefix validation ensures correct split at tokenization boundary
- Single-entry cache (one system prompt at a time)
- Stats exposed via get_stats() → system_kv_cache
- Cache cleared on stop(), invalidated on system prompt change
@Thump604
Copy link
Copy Markdown
Collaborator Author

Superseded — native prefix caching in AsyncEngineCore handles this. System KV cache was a workaround for SimpleEngine's lack of prefix caching.

@Thump604 Thump604 closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant