Skip to content

feat: add KV prefix cache for MLLM text-only requests#259

Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/mllm-prefix-cache
Closed

feat: add KV prefix cache for MLLM text-only requests#259
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/mllm-prefix-cache

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Add prefix caching to MLLM batch generator to reuse KV states between text-only requests sharing the same prompt prefix (system prompt, conversation history)
  • Multimodal requests (with images/videos) are excluded since identical input_ids can correspond to different images
  • Uses existing MemoryAwarePrefixCache with exact, prefix, LCP, and supersequence matching

How it works

  1. During preprocessing, requests are tagged is_text_only based on whether images are present
  2. In _process_prompts(), text-only requests check the prefix cache before running the full VLM forward pass:
    • Prefix/LCP match: Run language model on remaining tokens only (skip shared prefix)
    • Exact/supersequence match: Run language model on last token to get logits
    • Miss: Full VLM forward pass (existing behavior)
  3. On request completion, KV cache is extracted and stored (trimmed to prompt-only tokens)
  4. Cache stats exposed via /v1/status endpoint (memory_aware_cache section)

Configuration

Enabled by default. New MLLMSchedulerConfig fields:

  • enable_prefix_cache: bool = True — toggle prefix cache
  • prefix_cache_memory_mb: Optional[int] = None — memory limit (auto-detect if None)

Measured speedups (Apple M3 Ultra, text-only requests)

Model Cold Cached Speedup
Gemma 4 26B-A4B (MoE) 497ms 255ms 1.94x
Gemma 4 31B (Dense) 1055ms 691ms 1.52x
Gemma 4 26B Uncensored 651ms 302ms 2.15x
Qwen3.5-27B 871ms 326ms 2.67x

Test plan

  • Text-only requests: second identical prompt hits cache (entry_count > 0, hits > 0)
  • Text-only requests: prefix match works (shared system prompt, different user message)
  • Multimodal requests: NOT cached (entry_count unchanged after image request)
  • Concurrent requests: no race conditions with BatchKVCache merge
  • /v1/status shows correct memory_aware_cache stats
  • Non-MLLM models unaffected (prefix cache only in MLLM path)

Note: This PR depends on #256 for correct Gemma 4 generation with BatchKVCache. Without #256, Gemma 4 models produce repetitive output which masks the cache benefit.

🤖 Generated with Claude Code

Add prefix caching support to the MLLM batch generator to reuse KV
states between requests sharing the same prompt prefix (system prompt,
conversation history). This significantly reduces prefill latency for
repeated/similar prompts on MLLM-mode servers.

Key changes:
- Text-only requests (no images/videos) are eligible for prefix cache
- Uses existing MemoryAwarePrefixCache with exact/prefix/LCP matching
- Cache is stored on request completion (before batch.filter)
- Multimodal requests are explicitly excluded from caching since
  identical input_ids can correspond to different images
- Enabled by default; configurable via MLLMSchedulerConfig
- Stats exposed via /v1/status endpoint (memory_aware_cache section)

Measured speedups on Apple M3 Ultra (text-only requests):
- Gemma 4 26B-A4B: 1.94x (497ms → 255ms)
- Gemma 4 31B: 1.52x (1055ms → 691ms)
- Qwen3.5-27B: 2.67x (871ms → 326ms)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard janhilgard requested a review from waybarrios April 6, 2026 12:34
@janhilgard
Copy link
Copy Markdown
Collaborator Author

@Thump604 Superseded — KV prefix cache (MemoryAwarePrefixCache) for MLLM is already in main via #278. Closing.

@janhilgard janhilgard closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant