feat: add KV prefix cache for MLLM text-only requests by janhilgard · Pull Request #259 · waybarrios/vllm-mlx

janhilgard · 2026-04-06T12:33:03Z

Summary

Add prefix caching to MLLM batch generator to reuse KV states between text-only requests sharing the same prompt prefix (system prompt, conversation history)
Multimodal requests (with images/videos) are excluded since identical input_ids can correspond to different images
Uses existing MemoryAwarePrefixCache with exact, prefix, LCP, and supersequence matching

How it works

During preprocessing, requests are tagged is_text_only based on whether images are present
In _process_prompts(), text-only requests check the prefix cache before running the full VLM forward pass:
- Prefix/LCP match: Run language model on remaining tokens only (skip shared prefix)
- Exact/supersequence match: Run language model on last token to get logits
- Miss: Full VLM forward pass (existing behavior)
On request completion, KV cache is extracted and stored (trimmed to prompt-only tokens)
Cache stats exposed via /v1/status endpoint (memory_aware_cache section)

Configuration

Enabled by default. New MLLMSchedulerConfig fields:

enable_prefix_cache: bool = True — toggle prefix cache
prefix_cache_memory_mb: Optional[int] = None — memory limit (auto-detect if None)

Measured speedups (Apple M3 Ultra, text-only requests)

Model	Cold	Cached	Speedup
Gemma 4 26B-A4B (MoE)	497ms	255ms	1.94x
Gemma 4 31B (Dense)	1055ms	691ms	1.52x
Gemma 4 26B Uncensored	651ms	302ms	2.15x
Qwen3.5-27B	871ms	326ms	2.67x

Test plan

Text-only requests: second identical prompt hits cache (entry_count > 0, hits > 0)
Text-only requests: prefix match works (shared system prompt, different user message)
Multimodal requests: NOT cached (entry_count unchanged after image request)
Concurrent requests: no race conditions with BatchKVCache merge
/v1/status shows correct memory_aware_cache stats
Non-MLLM models unaffected (prefix cache only in MLLM path)

Note: This PR depends on #256 for correct Gemma 4 generation with BatchKVCache. Without #256, Gemma 4 models produce repetitive output which masks the cache benefit.

🤖 Generated with Claude Code

Add prefix caching support to the MLLM batch generator to reuse KV states between requests sharing the same prompt prefix (system prompt, conversation history). This significantly reduces prefill latency for repeated/similar prompts on MLLM-mode servers. Key changes: - Text-only requests (no images/videos) are eligible for prefix cache - Uses existing MemoryAwarePrefixCache with exact/prefix/LCP matching - Cache is stored on request completion (before batch.filter) - Multimodal requests are explicitly excluded from caching since identical input_ids can correspond to different images - Enabled by default; configurable via MLLMSchedulerConfig - Stats exposed via /v1/status endpoint (memory_aware_cache section) Measured speedups on Apple M3 Ultra (text-only requests): - Gemma 4 26B-A4B: 1.94x (497ms → 255ms) - Gemma 4 31B: 1.52x (1055ms → 691ms) - Qwen3.5-27B: 2.67x (871ms → 326ms) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard · 2026-04-11T14:18:27Z

@Thump604 Superseded — KV prefix cache (MemoryAwarePrefixCache) for MLLM is already in main via #278. Closing.

janhilgard requested a review from waybarrios April 6, 2026 12:34

janhilgard closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add KV prefix cache for MLLM text-only requests#259

feat: add KV prefix cache for MLLM text-only requests#259
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/mllm-prefix-cache

janhilgard commented Apr 6, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

janhilgard commented Apr 6, 2026

Summary

How it works

Configuration

Measured speedups (Apple M3 Ultra, text-only requests)

Test plan

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant