Skip to content

fix: enable prefix caching for multi-turn conversations#3

Merged
waybarrios merged 1 commit intowaybarrios:mainfrom
krestenkrab:fix-prefix-caching
Jan 14, 2026
Merged

fix: enable prefix caching for multi-turn conversations#3
waybarrios merged 1 commit intowaybarrios:mainfrom
krestenkrab:fix-prefix-caching

Conversation

@krestenkrab
Copy link
Copy Markdown
Contributor

Summary

  • Fix prefix caching to work correctly for multi-turn chat conversations
  • Handle empty remaining_tokens list properly when there's an exact cache match
  • Store cache with full token sequence (prompt + output) to enable prefix matching across conversation turns

Problem

Prefix caching was enabled but not actually providing speedups because:

  1. Empty list [] is falsy in Python, so remaining_tokens or prompt_token_ids always returned all tokens for exact matches
  2. Cache was stored with only prompt tokens as key, but the KV cache included generated tokens, causing offset mismatch

Solution

  1. Explicitly check for empty remaining_tokens and pass only the last token to BatchGenerator for exact matches
  2. Store cache keyed by full token sequence (prompt + output) so next turn's prompt (which includes previous response) matches the cached prefix

Test Results

Scenario Before After Speedup
Multi-turn (298→342 tokens) 3.45s 1.75s 2x
Exact match (short) 0.78s 0.55s 1.4x
Long prompt (800+ tokens) 5.3s 0.2s 26x

Two changes to make prefix caching work correctly:

1. Handle empty remaining_tokens list properly (line 451):
   When there's an exact cache match, remaining_tokens=[] but Python
   treats [] as falsy, so `[] or prompt_token_ids` returns all tokens.
   Fix: Explicitly check for empty list and pass only the last token
   so BatchGenerator can start generation without reprocessing.

2. Store cache with full token sequence (prompt + output):
   For multi-turn chat, we want to cache the entire conversation.
   The next turn's prompt includes the previous response, so storing
   with full sequence enables prefix matching across turns.

Testing shows:
- Multi-turn conversations: 2x speedup (3.45s -> 1.75s)
- Exact match requests: 1.4x speedup on short prompts
- Long prompts (800+ tokens): up to 26x speedup
@waybarrios
Copy link
Copy Markdown
Owner

Verification Results

I tested this PR on M4 Max (128GB) and everything looks good.

Test Suite Results

pytest tests/ -v
================================
243 passed, 5 skipped, 8 failed, 15 deselected
================================

Note: The 8 failing tests in test_server.py are pre-existing import issues (also fail on main) and are not related to this PR.

Benchmark Results

All benchmarks pass successfully with the prefix caching fix:

LLM Performance

Model Gen Speed TTFT Memory
Qwen3-0.6B-8bit 404.4 tok/s 64.3 ms 0.67 GB
Llama-3.2-1B-Instruct-4bit 456.3 tok/s 48.7 ms 0.67 GB
Llama-3.2-3B-Instruct-4bit 202.3 tok/s 63.6 ms 1.79 GB

Paged Cache Test (20 requests, 2 rounds)

Without paged cache: 682.4 tok/s
With paged cache:    758.2 tok/s
Speedup: 1.11x

Cache hits: 10 (all Round 2 requests)
Tokens saved: 2,560 (~256 tokens × 10 requests)

Continuous Batching Test

5 concurrent requests
Throughput: 1121.2 tok/s
Requests/sec: 17.80

Code Review

The fix correctly addresses the issues described in the PR:

  1. Empty list handling - The fix properly checks len(request.remaining_tokens) == 0 instead of relying on Python's falsy empty list behavior
  2. Full token sequence caching - Storing prompt + output tokens enables proper prefix matching for multi-turn conversations

The implementation is clean and well-documented. Ready to merge.

@waybarrios waybarrios merged commit d34d5de into waybarrios:main Jan 14, 2026
WainWong pushed a commit to WainWong/vllm-mlx that referenced this pull request Mar 2, 2026
…mple-engine

feat: Prompt cache for SimpleEngine + tool logits safety
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants