Skip to content

Apply LRU policy only to proper cache entries#42656

Open
s3woz wants to merge 5 commits into
vllm-project:mainfrom
s3woz:free_blocks
Open

Apply LRU policy only to proper cache entries#42656
s3woz wants to merge 5 commits into
vllm-project:mainfrom
s3woz:free_blocks

Conversation

@s3woz

@s3woz s3woz commented May 14, 2026

Copy link
Copy Markdown
Contributor

Purpose

During execution vLLM not only allocates full blocks, which are reused through caching mechanism, but also partial blocks that aren't assigned any hash, and thus won't ever be reused by the caching mechanism. Currently both types of blocks are treated equally by the LRU policy. This means that partial blocks, which aren't in practice "cache" entries, are treated as if they were standard cache entries and evict in LRU manner other proper cache entries. As a consequence, this promotes earlier than necessary eviction of proper cache entries.

This PR applies LRU policy only to proper cache entries by filtering out partial blocks, and inserting them at the head instead of the tail. This improves caching behavior, e.g. :

  • in memory tight scenarios handling few prompts (see test case 1: partial blocks from two prompts evict each other's valid cache entries)
  • in memory abundant scenarios handling a large number of concurrent prompts (see test case 2: partial blocks from numerous concurrent prompts completely evict entire KVcache on H100 - more than 2k blocks - resulting in cache miss of a prompt that should have had a cache hit)

This PR affects in principle all models. Test cases below showcase this for various architecture types:

@tdoublep

Test Plan

Test case 1:

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory
import time
MODEL, BLOCKS = [
    ["google/gemma-4-31B", 963], 
    ["ibm-granite/granite-4.0-tiny-preview", 70]
][1] # <- select setup here
sampling_params = SamplingParams(temperature=0.0, max_tokens=5)
prompt1 = "The president of the United States is " * 200
prompt2 = "." + prompt1
engine = LLM(model=MODEL, enable_prefix_caching=True,
    max_model_len=2000, max_num_seqs=50, # to avoid engine start errors
    num_gpu_blocks_override=BLOCKS, #emulate low memory scenario
    disable_log_stats=False)
for i in range(11):
    if i == 0:
        print('Warm-up')
    if i == 1:
        print('Measuring')
        start_time = time.time()
    outputs = engine.generate(prompt1, sampling_params)
    print(f"Prompt 1: Generated text: {outputs[0].outputs[0].text!r}")    
    outputs = engine.generate(prompt2, sampling_params)
    print(f"Prompt 2: Generated text: {outputs[0].outputs[0].text!r}")
    for m in engine.llm_engine.get_metrics():
        if 'vllm:prefix_cache_hits' in m.name:
            print(m.name, m.value)
print("Took --- %s seconds ---" % (time.time() - start_time))

Test case 2:

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory
import time
sampling_params = SamplingParams(temperature=0.0, max_tokens=1)
promptsA = [f"A{i} " + "The president of the United States is " * 2000 for i in range(2)] 
promptsB = [f"B{i} " + "The president of the United States is " * 10 for i in range(60)] 
engine = LLM(model="Qwen/Qwen2.5-32B", enable_prefix_caching=True,
    max_model_len=30000, max_num_seqs=50, # to avoid engine start errors
    disable_log_stats=False)
start_time = time.time()
print('Warm-up')
outputs = engine.generate(promptsA, sampling_params)
outputs = engine.generate(promptsB, sampling_params)
print("Warm-up took --- %s seconds ---" % (time.time() - start_time), "A and B are in cache")
start_time = time.time()
print("Prompts B")
for i in range(30): 
    outputs = engine.generate(promptsB, sampling_params)
for m in engine.llm_engine.get_metrics():
    if 'vllm:prefix_cache_hits' in m.name:
        print(m.name, m.value)
print("Running repeatedly Prompts B took --- %s seconds ---" % (time.time() - start_time), "This shouldn't affect the cache state.")
start_time = time.time()
print("Prompts A")
outputs = engine.generate(promptsA, sampling_params)
for m in engine.llm_engine.get_metrics():
    if 'vllm:prefix_cache_hits' in m.name:
        print(m.name, m.value)
print("Running prompts A took --- %s seconds ---" % (time.time() - start_time), "We should get the KV cache hit.")

Test Result

Test case 1 (ibm-granite/granite-4.0-tiny-preview):
Main Always cache misses as partial blocks of two prompts evict each other's valid cache entries

Warm-up
Rendering prompts: 100%|-----------| 1/1 [00:00<00:00, 27.01it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.96it/s, est. speed input: 6350.58 toks/s, output: 19.83 toks/s]
Prompt 1: Generated text: '10. The pres'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 276.41it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  7.67it/s, est. speed input: 12326.12 toks/s, output: 38.47 toks/s]
Prompt 2: Generated text: ' The president of the'
vllm:prefix_cache_hits 0
Measuring
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 271.34it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  6.95it/s, est. speed input: 11156.71 toks/s, output: 34.84 toks/s]
[...]
Prompt 1: Generated text: '10. The pres'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 285.60it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  6.98it/s, est. speed input: 11247.76 toks/s, output: 35.10 toks/s]
Prompt 2: Generated text: ' The president of the'
vllm:prefix_cache_hits 0
Took --- 2.9889769554138184 seconds ---

This PR Cache hits as LRU is applied only to valid cache blocks

Warm-up
Rendering prompts: 100%|-----------| 1/1 [00:00<00:00, 55.78it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  5.82it/s, est. speed input: 9364.15 toks/s, output: 29.24 toks/s]
Prompt 1: Generated text: '10. The pres'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 236.81it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00, 10.61it/s, est. speed input: 17096.17 toks/s, output: 53.34 toks/s]
Prompt 2: Generated text: ' The president of the'
vllm:prefix_cache_hits 0
Measuring
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 298.38it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  9.98it/s, est. speed input: 16075.32 toks/s, output: 50.18 toks/s]
[...]
Prompt 1: Generated text: '10. The pres'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 223.46it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00, 13.01it/s, est. speed input: 20984.68 toks/s, output: 65.47 toks/s]
Prompt 2: Generated text: ' The president of the'
vllm:prefix_cache_hits 30208
Took --- 1.6180028915405273 seconds ---

Test case 1 (google/gemma-4-31B):
Main Partial blocks of two prompts evict each other's valid cache entries, reducing the cache hits (SWA architecture is a bit more robust than Mamba, as still some cache entries partially match)

Warm-up
Rendering prompts: 100%|-----------| 1/1 [00:00<00:00, 16.52it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.59it/s, est. speed input: 5047.25 toks/s, output: 18.00 toks/s]
Prompt 1: Generated text: '\n\nThe president of the'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 199.55it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.70it/s, est. speed input: 5206.13 toks/s, output: 18.55 toks/s]
Prompt 2: Generated text: '\n\nThe president of the'
vllm:prefix_cache_hits 0
Measuring
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 245.73it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  4.06it/s, est. speed input: 5713.19 toks/s, output: 20.37 toks/s]
[...]
Prompt 1: Generated text: '\n\nThe president of the'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 230.48it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  4.07it/s, est. speed input: 5729.38 toks/s, output: 20.42 toks/s]
Prompt 2: Generated text: '\n\nThe president of the'
vllm:prefix_cache_hits 5120
Took --- 5.051628112792969 seconds ---

This PR Proper full cache hits

Warm-up
Rendering prompts: 100%|-----------| 1/1 [00:00<00:00, 16.70it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.56it/s, est. speed input: 4992.74 toks/s, output: 17.80 toks/s]
Prompt 1: Generated text: '\n\nThe president of the'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 235.12it/s]
Processed prompts: 100%|---------| 1/1 [00:00<00:00,  3.68it/s, est. speed input: 5174.89 toks/s, output: 18.44 toks/s]
Prompt 2: Generated text: '\n\nThe president of the'
vllm:prefix_cache_hits 0
Measuring
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 247.99it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  7.51it/s, est. speed input: 10548.47 toks/s, output: 37.62 toks/s]
[...]
Prompt 1: Generated text: '\n\nThe president of the'
Rendering prompts: 100%|----------| 1/1 [00:00<00:00, 247.39it/s]
Processed prompts: 100%|--------| 1/1 [00:00<00:00,  7.46it/s, est. speed input: 10528.95 toks/s, output: 37.51 toks/s]
Prompt 2: Generated text: '\n\nThe president of the'
vllm:prefix_cache_hits 27520
Took --- 2.811150074005127 seconds ---

Test case 2 (Qwen/Qwen2.5-32B):
Main Prompts A - cache miss as entire cache is evicted by partial blocks (vllm:prefix_cache_hits counter doesn't increase beyond 115200)

Warm-up
Rendering prompts: 100%|--------------| 2/2 [00:00<00:00, 17.78it/s]
Processed prompts: 100%|-------------------| 2/2 [00:03<00:00,  1.58s/it, est. speed input: 8845.22 toks/s, output: 0.63 toks/s]
...
Warm-up took --- 3.763427972793579 seconds --- A and B are in cache
Prompts B
Rendering prompts: 100%|----------| 60/60 [00:00<00:00, 2764.29it/s]
Processed prompts: 100%|-------------| 60/60 [00:00<00:00, 499.90it/s, est. speed input: 36963.12 toks/s, output: 500.60 toks/s]
...
vllm:prefix_cache_hits 115200
Running repeatedly Prompts B took --- 4.006891965866089 seconds --- This shouldn't affect the cache state.
Prompts A
Rendering prompts: 100%|--------------| 2/2 [00:00<00:00, 32.58it/s]
Processed prompts: 100%|-------------------| 2/2 [00:03<00:00,  1.58s/it, est. speed input: 8855.25 toks/s, output: 0.63 toks/s]
vllm:prefix_cache_hits 115200
Running prompts A took --- 3.2269678115844727 seconds --- We should get the KV cache hit. (But we don't)

This PR Prompts A - cache hit (vllm:prefix_cache_hits 129200 increased from 115200)

Warm-up
Rendering prompts: 100%|-----------| 2/2 [00:00<00:00, 19.60it/s]
Processed prompts: 100%|----------------| 2/2 [00:03<00:00,  1.58s/it, est. speed input: 8884.42 toks/s, output: 0.63 toks/s]
...
Warm-up took --- 3.7505805492401123 seconds --- A and B are in cache
Prompts B
Rendering prompts: 100%|-------| 60/60 [00:00<00:00, 2752.68it/s]
Processed prompts: 100%|----------| 60/60 [00:00<00:00, 495.13it/s, est. speed input: 36667.98 toks/s, output: 496.57 toks/s]
...
vllm:prefix_cache_hits 115200
Running repeatedly Prompts B took --- 3.807586431503296 seconds --- This shouldn't affect the cache state.
Prompts A
Rendering prompts: 100%|-----------| 2/2 [00:00<00:00, 32.40it/s]
Processed prompts: 100%|-------------| 2/2 [00:00<00:00, 45.63it/s, est. speed input: 642011.74 toks/s, output: 45.83 toks/s]
vllm:prefix_cache_hits 129200
Running prompts A took --- 0.10813021659851074 seconds --- We should get the KV cache hit. (And we do)

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the block freeing process in the KV cache by distinguishing between blocks with and without hashes. Blocks without hashes are now prepended to the free block queue for immediate reallocation, while blocks with hashes are appended to maintain LRU order for potential reuse. To support this, a prepend_n method was added to the free list utility. I have no feedback to provide.

@mergify

mergify Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Hi @s3woz, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify mergify Bot added the v1 label May 14, 2026
@mergify

mergify Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Hi @s3woz, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@mergify

mergify Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @s3woz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 4, 2026
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@mergify mergify Bot removed the needs-rebase label Jun 10, 2026
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@tdoublep

Copy link
Copy Markdown
Member

@njhill Could you please take a look at this one? I believe it generalizes the approach of #43447 to cover more cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants