fix: enable prefix caching for multi-turn conversations by krestenkrab · Pull Request #3 · waybarrios/vllm-mlx

krestenkrab · 2026-01-12T22:37:11Z

Summary

Fix prefix caching to work correctly for multi-turn chat conversations
Handle empty remaining_tokens list properly when there's an exact cache match
Store cache with full token sequence (prompt + output) to enable prefix matching across conversation turns

Problem

Prefix caching was enabled but not actually providing speedups because:

Empty list [] is falsy in Python, so remaining_tokens or prompt_token_ids always returned all tokens for exact matches
Cache was stored with only prompt tokens as key, but the KV cache included generated tokens, causing offset mismatch

Solution

Explicitly check for empty remaining_tokens and pass only the last token to BatchGenerator for exact matches
Store cache keyed by full token sequence (prompt + output) so next turn's prompt (which includes previous response) matches the cached prefix

Test Results

Scenario	Before	After	Speedup
Multi-turn (298→342 tokens)	3.45s	1.75s	2x
Exact match (short)	0.78s	0.55s	1.4x
Long prompt (800+ tokens)	5.3s	0.2s	26x

Two changes to make prefix caching work correctly: 1. Handle empty remaining_tokens list properly (line 451): When there's an exact cache match, remaining_tokens=[] but Python treats [] as falsy, so `[] or prompt_token_ids` returns all tokens. Fix: Explicitly check for empty list and pass only the last token so BatchGenerator can start generation without reprocessing. 2. Store cache with full token sequence (prompt + output): For multi-turn chat, we want to cache the entire conversation. The next turn's prompt includes the previous response, so storing with full sequence enables prefix matching across turns. Testing shows: - Multi-turn conversations: 2x speedup (3.45s -> 1.75s) - Exact match requests: 1.4x speedup on short prompts - Long prompts (800+ tokens): up to 26x speedup

waybarrios · 2026-01-14T00:09:28Z

Verification Results

I tested this PR on M4 Max (128GB) and everything looks good.

Test Suite Results

pytest tests/ -v
================================
243 passed, 5 skipped, 8 failed, 15 deselected
================================

Note: The 8 failing tests in test_server.py are pre-existing import issues (also fail on main) and are not related to this PR.

Benchmark Results

All benchmarks pass successfully with the prefix caching fix:

LLM Performance

Model	Gen Speed	TTFT	Memory
Qwen3-0.6B-8bit	404.4 tok/s	64.3 ms	0.67 GB
Llama-3.2-1B-Instruct-4bit	456.3 tok/s	48.7 ms	0.67 GB
Llama-3.2-3B-Instruct-4bit	202.3 tok/s	63.6 ms	1.79 GB

Paged Cache Test (20 requests, 2 rounds)

Without paged cache: 682.4 tok/s
With paged cache:    758.2 tok/s
Speedup: 1.11x

Cache hits: 10 (all Round 2 requests)
Tokens saved: 2,560 (~256 tokens × 10 requests)

Continuous Batching Test

5 concurrent requests
Throughput: 1121.2 tok/s
Requests/sec: 17.80

Code Review

The fix correctly addresses the issues described in the PR:

Empty list handling - The fix properly checks len(request.remaining_tokens) == 0 instead of relying on Python's falsy empty list behavior
Full token sequence caching - Storing prompt + output tokens enables proper prefix matching for multi-turn conversations

The implementation is clean and well-documented. Ready to merge.

…mple-engine feat: Prompt cache for SimpleEngine + tool logits safety

waybarrios merged commit d34d5de into waybarrios:main Jan 14, 2026

janhilgard mentioned this pull request Feb 11, 2026

Add KV cache quantization for prefix cache memory reduction #62

Merged

WainWong pushed a commit to WainWong/vllm-mlx that referenced this pull request Mar 2, 2026

Merge pull request waybarrios#3 from raullenchai/feat/prompt-cache-si…

2bc9aaa

…mple-engine feat: Prompt cache for SimpleEngine + tool logits safety

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable prefix caching for multi-turn conversations#3

fix: enable prefix caching for multi-turn conversations#3
waybarrios merged 1 commit intowaybarrios:mainfrom
krestenkrab:fix-prefix-caching

krestenkrab commented Jan 12, 2026

Uh oh!

waybarrios commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krestenkrab commented Jan 12, 2026

Summary

Problem

Solution

Test Results

Uh oh!

waybarrios commented Jan 14, 2026

Verification Results

Test Suite Results

Benchmark Results

LLM Performance

Paged Cache Test (20 requests, 2 rounds)

Continuous Batching Test

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants