Apply LRU policy only to proper cache entries#42656
Conversation
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
There was a problem hiding this comment.
Code Review
This pull request optimizes the block freeing process in the KV cache by distinguishing between blocks with and without hashes. Blocks without hashes are now prepended to the free block queue for immediate reallocation, while blocks with hashes are appended to maintain LRU order for potential reuse. To support this, a prepend_n method was added to the free list utility. I have no feedback to provide.
|
Hi @s3woz, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @s3woz, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Purpose
During execution vLLM not only allocates full blocks, which are reused through caching mechanism, but also partial blocks that aren't assigned any hash, and thus won't ever be reused by the caching mechanism. Currently both types of blocks are treated equally by the LRU policy. This means that partial blocks, which aren't in practice "cache" entries, are treated as if they were standard cache entries and evict in LRU manner other proper cache entries. As a consequence, this promotes earlier than necessary eviction of proper cache entries.
This PR applies LRU policy only to proper cache entries by filtering out partial blocks, and inserting them at the head instead of the tail. This improves caching behavior, e.g. :
This PR affects in principle all models. Test cases below showcase this for various architecture types:
ibm-granite/granite-4.0-tiny-preview- 46% speed-up for the test case 1 with this PRSWA -(update: covered by later PR [Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache #43447 , so currently no speed-up here)google/gemma-4-31B- 44% speed-up for the test case 1 with this PRQwen/Qwen2.5-32B- enables proper cache hit, reducing latency by 97% for promptsA in test case 2 with this PR@tdoublep
Test Plan
Test case 1:
Test case 2:
Test Result
Test case 1 (
ibm-granite/granite-4.0-tiny-preview):Main Always cache misses as partial blocks of two prompts evict each other's valid cache entries
This PR Cache hits as LRU is applied only to valid cache blocks
Test case 1 (
google/gemma-4-31B):Main Partial blocks of two prompts evict each other's valid cache entries, reducing the cache hits (SWA architecture is a bit more robust than Mamba, as still some cache entries partially match)
This PR Proper full cache hits
Test case 2 (
Qwen/Qwen2.5-32B):Main Prompts A - cache miss as entire cache is evicted by partial blocks (
vllm:prefix_cache_hitscounter doesn't increase beyond115200)This PR Prompts A - cache hit (
vllm:prefix_cache_hits 129200increased from115200)