Apply LRU policy only to proper cache entries by s3woz · Pull Request #42656 · vllm-project/vllm

s3woz · 2026-05-14T16:01:39Z

Purpose

During execution vLLM not only allocates full blocks, which are reused through caching mechanism, but also partial blocks that aren't assigned any hash, and thus won't ever be reused by the caching mechanism. Currently both types of blocks are treated equally by the LRU policy. This means that partial blocks, which aren't in practice "cache" entries, are treated as if they were standard cache entries and evict in LRU manner other proper cache entries. As a consequence, this promotes earlier than necessary eviction of proper cache entries.

This PR applies LRU policy only to proper cache entries by filtering out partial blocks, and inserting them at the head instead of the tail. This improves caching behavior, e.g. :

in memory tight scenarios handling few prompts (see test case 1: partial blocks from two prompts evict each other's valid cache entries)
in memory abundant scenarios handling a large number of concurrent prompts (see test case 2: partial blocks from numerous concurrent prompts completely evict entire KVcache on H100 - more than 2k blocks - resulting in cache miss of a prompt that should have had a cache hit)

This PR affects in principle all models. Test cases below showcase this for various architecture types:

Mamba2 - ibm-granite/granite-4.0-tiny-preview - 46% speed-up for the test case 1 with this PR
~~SWA - google/gemma-4-31B - 44% speed-up for the test case 1 with this PR~~ (update: covered by later PR [Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache #43447 , so currently no speed-up here)
pure attention - Qwen/Qwen2.5-32B - enables proper cache hit, reducing latency by 97% for promptsA in test case 2 with this PR

@tdoublep

Test Plan

Test case 1:

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory
import time
MODEL, BLOCKS = [
    ["google/gemma-4-31B", 963], 
    ["ibm-granite/granite-4.0-tiny-preview", 70]
][1] # <- select setup here
sampling_params = SamplingParams(temperature=0.0, max_tokens=5)
prompt1 = "The president of the United States is " * 200
prompt2 = "." + prompt1
engine = LLM(model=MODEL, enable_prefix_caching=True,
    max_model_len=2000, max_num_seqs=50, # to avoid engine start errors
    num_gpu_blocks_override=BLOCKS, #emulate low memory scenario
    disable_log_stats=False)
for i in range(11):
    if i == 0:
        print('Warm-up')
    if i == 1:
        print('Measuring')
        start_time = time.time()
    outputs = engine.generate(prompt1, sampling_params)
    print(f"Prompt 1: Generated text: {outputs[0].outputs[0].text!r}")    
    outputs = engine.generate(prompt2, sampling_params)
    print(f"Prompt 2: Generated text: {outputs[0].outputs[0].text!r}")
    for m in engine.llm_engine.get_metrics():
        if 'vllm:prefix_cache_hits' in m.name:
            print(m.name, m.value)
print("Took --- %s seconds ---" % (time.time() - start_time))

Test case 2:

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory
import time
sampling_params = SamplingParams(temperature=0.0, max_tokens=1)
promptsA = [f"A{i} " + "The president of the United States is " * 2000 for i in range(2)] 
promptsB = [f"B{i} " + "The president of the United States is " * 10 for i in range(60)] 
engine = LLM(model="Qwen/Qwen2.5-32B", enable_prefix_caching=True,
    max_model_len=30000, max_num_seqs=50, # to avoid engine start errors
    disable_log_stats=False)
start_time = time.time()
print('Warm-up')
outputs = engine.generate(promptsA, sampling_params)
outputs = engine.generate(promptsB, sampling_params)
print("Warm-up took --- %s seconds ---" % (time.time() - start_time), "A and B are in cache")
start_time = time.time()
print("Prompts B")
for i in range(30): 
    outputs = engine.generate(promptsB, sampling_params)
for m in engine.llm_engine.get_metrics():
    if 'vllm:prefix_cache_hits' in m.name:
        print(m.name, m.value)
print("Running repeatedly Prompts B took --- %s seconds ---" % (time.time() - start_time), "This shouldn't affect the cache state.")
start_time = time.time()
print("Prompts A")
outputs = engine.generate(promptsA, sampling_params)
for m in engine.llm_engine.get_metrics():
    if 'vllm:prefix_cache_hits' in m.name:
        print(m.name, m.value)
print("Running prompts A took --- %s seconds ---" % (time.time() - start_time), "We should get the KV cache hit.")

Test Result

Test case 1 (ibm-granite/granite-4.0-tiny-preview):
Main Always cache misses as partial blocks of two prompts evict each other's valid cache entries

Warm-up
Rendering prompts: 100%|-----------| 1/1 [00:00<00:00, 27.01it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.96it/s, est. speed input: 6350.58 toks/s, output: 19.83 toks/s]
Prompt 1: Generated text: '10. The pres'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 276.41it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  7.67it/s, est. speed input: 12326.12 toks/s, output: 38.47 toks/s]
Prompt 2: Generated text: ' The president of the'
vllm:prefix_cache_hits 0
Measuring
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 271.34it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  6.95it/s, est. speed input: 11156.71 toks/s, output: 34.84 toks/s]
[...]
Prompt 1: Generated text: '10. The pres'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 285.60it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  6.98it/s, est. speed input: 11247.76 toks/s, output: 35.10 toks/s]
Prompt 2: Generated text: ' The president of the'
vllm:prefix_cache_hits 0
Took --- 2.9889769554138184 seconds ---

This PR Cache hits as LRU is applied only to valid cache blocks

Warm-up
Rendering prompts: 100%|-----------| 1/1 [00:00<00:00, 55.78it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  5.82it/s, est. speed input: 9364.15 toks/s, output: 29.24 toks/s]
Prompt 1: Generated text: '10. The pres'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 236.81it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00, 10.61it/s, est. speed input: 17096.17 toks/s, output: 53.34 toks/s]
Prompt 2: Generated text: ' The president of the'
vllm:prefix_cache_hits 0
Measuring
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 298.38it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  9.98it/s, est. speed input: 16075.32 toks/s, output: 50.18 toks/s]
[...]
Prompt 1: Generated text: '10. The pres'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 223.46it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00, 13.01it/s, est. speed input: 20984.68 toks/s, output: 65.47 toks/s]
Prompt 2: Generated text: ' The president of the'
vllm:prefix_cache_hits 30208
Took --- 1.6180028915405273 seconds ---

Test case 1 (google/gemma-4-31B):
Main Partial blocks of two prompts evict each other's valid cache entries, reducing the cache hits (SWA architecture is a bit more robust than Mamba, as still some cache entries partially match)

Warm-up
Rendering prompts: 100%|-----------| 1/1 [00:00<00:00, 16.52it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.59it/s, est. speed input: 5047.25 toks/s, output: 18.00 toks/s]
Prompt 1: Generated text: '\n\nThe president of the'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 199.55it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.70it/s, est. speed input: 5206.13 toks/s, output: 18.55 toks/s]
Prompt 2: Generated text: '\n\nThe president of the'
vllm:prefix_cache_hits 0
Measuring
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 245.73it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  4.06it/s, est. speed input: 5713.19 toks/s, output: 20.37 toks/s]
[...]
Prompt 1: Generated text: '\n\nThe president of the'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 230.48it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  4.07it/s, est. speed input: 5729.38 toks/s, output: 20.42 toks/s]
Prompt 2: Generated text: '\n\nThe president of the'
vllm:prefix_cache_hits 5120
Took --- 5.051628112792969 seconds ---

This PR Proper full cache hits

Warm-up
Rendering prompts: 100%|-----------| 1/1 [00:00<00:00, 16.70it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.56it/s, est. speed input: 4992.74 toks/s, output: 17.80 toks/s]
Prompt 1: Generated text: '\n\nThe president of the'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 235.12it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.68it/s, est. speed input: 5174.89 toks/s, output: 18.44 toks/s]
Prompt 2: Generated text: '\n\nThe president of the'
vllm:prefix_cache_hits 0
Measuring
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 247.99it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  7.51it/s, est. speed input: 10548.47 toks/s, output: 37.62 toks/s]
[...]
Prompt 1: Generated text: '\n\nThe president of the'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 247.39it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  7.46it/s, est. speed input: 10528.95 toks/s, output: 37.51 toks/s]
Prompt 2: Generated text: '\n\nThe president of the'
vllm:prefix_cache_hits 27520
Took --- 2.811150074005127 seconds ---

Test case 2 (Qwen/Qwen2.5-32B):
Main Prompts A - cache miss as entire cache is evicted by partial blocks (vllm:prefix_cache_hits counter doesn't increase beyond 115200)

Warm-up
Rendering prompts: 100%|--------------| 2/2 [00:00<00:00, 17.78it/s]
Processed prompts: 100%|-------------------| 2/2 [00:03<00:00,  1.58s/it, est. speed input: 8845.22 toks/s, output: 0.63 toks/s]
...
Warm-up took --- 3.763427972793579 seconds --- A and B are in cache
Prompts B
Rendering prompts: 100%|----------| 60/60 [00:00<00:00, 2764.29it/s]
Processed prompts: 100%|-------------| 60/60 [00:00<00:00, 499.90it/s, est. speed input: 36963.12 toks/s, output: 500.60 toks/s]
...
vllm:prefix_cache_hits 115200
Running repeatedly Prompts B took --- 4.006891965866089 seconds --- This shouldn't affect the cache state.
Prompts A
Rendering prompts: 100%|--------------| 2/2 [00:00<00:00, 32.58it/s]
Processed prompts: 100%|-------------------| 2/2 [00:03<00:00,  1.58s/it, est. speed input: 8855.25 toks/s, output: 0.63 toks/s]
vllm:prefix_cache_hits 115200
Running prompts A took --- 3.2269678115844727 seconds --- We should get the KV cache hit. (But we don't)

This PR Prompts A - cache hit (vllm:prefix_cache_hits 129200 increased from 115200)

Warm-up
Rendering prompts: 100%|-----------| 2/2 [00:00<00:00, 19.60it/s]
Processed prompts: 100%|----------------| 2/2 [00:03<00:00,  1.58s/it, est. speed input: 8884.42 toks/s, output: 0.63 toks/s]
...
Warm-up took --- 3.7505805492401123 seconds --- A and B are in cache
Prompts B
Rendering prompts: 100%|-------| 60/60 [00:00<00:00, 2752.68it/s]
Processed prompts: 100%|----------| 60/60 [00:00<00:00, 495.13it/s, est. speed input: 36667.98 toks/s, output: 496.57 toks/s]
...
vllm:prefix_cache_hits 115200
Running repeatedly Prompts B took --- 3.807586431503296 seconds --- This shouldn't affect the cache state.
Prompts A
Rendering prompts: 100%|-----------| 2/2 [00:00<00:00, 32.40it/s]
Processed prompts: 100%|-------------| 2/2 [00:00<00:00, 45.63it/s, est. speed input: 642011.74 toks/s, output: 45.83 toks/s]
vllm:prefix_cache_hits 129200
Running prompts A took --- 0.10813021659851074 seconds --- We should get the KV cache hit. (And we do)

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request optimizes the block freeing process in the KV cache by distinguishing between blocks with and without hashes. Blocks without hashes are now prepended to the free block queue for immediate reallocation, while blocks with hashes are appended to maintain LRU order for potential reuse. To support this, a prepend_n method was added to the free list utility. I have no feedback to provide.

mergify · 2026-05-14T16:11:04Z

Hi @s3woz, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-05-14T16:19:30Z

Hi @s3woz, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

mergify · 2026-06-04T08:07:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @s3woz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

tdoublep · 2026-06-10T15:48:37Z

@njhill Could you please take a look at this one? I believe it generalizes the approach of #43447 to cover more cases.

Optimized LRU logic

2dd0759

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

s3woz requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners May 14, 2026 16:01

claude Bot reviewed May 14, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

mergify Bot added the v1 label May 14, 2026

Pre-commit fixes

1c3ca42

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

s3woz mentioned this pull request Jun 2, 2026

[Bug]: Prefix cache align-mode has a 0% cache hit rate for Qwen3.6-35B-A3B #42317

Open

1 task

Merge branch 'main' into free_blocks

d093365

mergify Bot added the needs-rebase label Jun 4, 2026

Merge

aef2a0f

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

mergify Bot removed the needs-rebase label Jun 10, 2026

Comments cleanup

28c3805

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply LRU policy only to proper cache entries#42656

Apply LRU policy only to proper cache entries#42656
s3woz wants to merge 5 commits into
vllm-project:mainfrom
s3woz:free_blocks

s3woz commented May 14, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented May 14, 2026

Uh oh!

mergify Bot commented May 14, 2026

Uh oh!

mergify Bot commented Jun 4, 2026

Uh oh!

tdoublep commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

s3woz commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented May 14, 2026

Uh oh!

mergify Bot commented May 14, 2026

Uh oh!

mergify Bot commented Jun 4, 2026

Uh oh!

tdoublep commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

s3woz commented May 14, 2026 •

edited

Loading