Skip to content

[Attention] Cache attention metadata builds across hybrid KV-cache groups#29627

Merged
LucasWilkinson merged 10 commits intovllm-project:mainfrom
neuralmagic:lwilkinson/cache-2
Dec 16, 2025
Merged

[Attention] Cache attention metadata builds across hybrid KV-cache groups#29627
LucasWilkinson merged 10 commits intovllm-project:mainfrom
neuralmagic:lwilkinson/cache-2

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Nov 27, 2025

Scaled back version of: #22788 and generalization of: #29444

Depends on #29628 land that first

Test Plan

Decode-heavy workload with APC enabled => latency reduction through this PR: 13.7%

vllm bench latency --model ibm-granite/granite-4.0-tiny-preview --input-len 128  --output-len 2048 \
    --batch-size 8 --enable-prefix-caching --num-iters 10 --num-iters-warmup 2

Decode-heavy workload with APC disabled => latency reduction through this PR: 2.4%

vllm bench latency --model ibm-granite/granite-4.0-tiny-preview --input-len 128  --output-len 2048 \
    --batch-size 8 --num-iters 10 --num-iters-warmup 2

Test Result

APC enabled:

Before:

Avg latency: 17.656723083648830 seconds
10% percentile latency: 17.629154206998646 seconds
25% percentile latency: 17.638833652948960 seconds
50% percentile latency: 17.659772212151438 seconds
75% percentile latency: 17.681711453711614 seconds
90% percentile latency: 17.693322991393508 seconds
99% percentile latency: 17.694964875970037 seconds

This PR:

Avg latency: 15.236061835102737 seconds
10% percentile latency: 15.214241081103683 seconds
25% percentile latency: 15.218232876388356 seconds
50% percentile latency: 15.237253282219172 seconds
75% percentile latency: 15.245128114242107 seconds
90% percentile latency: 15.250385893788188 seconds
99% percentile latency: 15.278700626296922 seconds

APC disabled:

Before:

Avg latency: 14.836663954611868 seconds
10% percentile latency: 14.762344243377447 seconds
25% percentile latency: 14.770147853763774 seconds
50% percentile latency: 14.822804148774594 seconds
75% percentile latency: 14.853949831798673 seconds
90% percentile latency: 14.937904387619346 seconds
99% percentile latency: 15.010869848066940 seconds

This PR:

Avg latency: 14.485000116657465 seconds
10% percentile latency: 14.469038359075785 seconds
25% percentile latency: 14.475818350212649 seconds
50% percentile latency: 14.479432722553610 seconds
75% percentile latency: 14.504760635318235 seconds
90% percentile latency: 14.514666751492769 seconds
99% percentile latency: 14.517034560767934 seconds

Co-authored by: @s3woz (original PR: #29444)

@mergify mergify bot added the v1 label Nov 27, 2025
@mergify
Copy link

mergify bot commented Nov 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Dec 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 10, 2025
@mergify mergify bot removed the needs-rebase label Dec 10, 2025
@LucasWilkinson LucasWilkinson marked this pull request as ready for review December 11, 2025 05:13
@mergify
Copy link

mergify bot commented Dec 11, 2025

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@LucasWilkinson LucasWilkinson changed the title [Attention] Cache attention metadata builds across hybrid KV-cache groups [Do not land][Attention] Cache attention metadata builds across hybrid KV-cache groups Dec 11, 2025
@mergify
Copy link

mergify bot commented Dec 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 12, 2025
LucasWilkinson and others added 5 commits December 12, 2025 04:30
…oups

Manually applied PR vllm-project#22788 from neuralmagic:lwilkinson/cache-metadata-builds,
excluding FlashInfer changes.

This PR caches metadata across kv-cache groups and does a lightweight
update_block_table instead of full rebuild. When using the hybrid kv-cache
manager we build a new attention metadata from scratch for each kv-cache
group even when the kv-cache groups only differ by the block-table. This
happens when we have a repeated n:1 pattern like Llama4 where we have 3
local chunked attention layers for every one full attention layer.

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 12, 2025
@mergify mergify bot removed the needs-rebase label Dec 12, 2025
@mergify
Copy link

mergify bot commented Dec 12, 2025

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson changed the title [Do not land][Attention] Cache attention metadata builds across hybrid KV-cache groups [Attention] Cache attention metadata builds across hybrid KV-cache groups Dec 12, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank for you doing this - left one nit

common_attn_metadata=common_attn_metadata,
**extra_attn_metadata_args,
)
cached_attn_metadata[cache_key] = attn_metadata_i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we do it all the time or also condition it on builder.supports_update_block_table ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 👍

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson merged commit 9fec0e1 into vllm-project:main Dec 16, 2025
54 checks passed
NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Dec 17, 2025
…oups (vllm-project#29627)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@Josephasafg
Copy link
Contributor

Josephasafg commented Dec 18, 2025

@LucasWilkinson is there a reason not to add this support for Mamba1 in mamba1_attn.py?

I have this PR which should consolidate most of the common logic between mamba1 and mamba2 so features like that won't be overlooked. I can add this logic to mamba_attn.py in my opened PR, so both models can enjoy this. @tdoublep what do you think?

@tdoublep
Copy link
Member

@Josephasafg Yes, fully agreed on that. Will prioritize reviewing + merging your cleanup PR.

@Josephasafg
Copy link
Contributor

One interesting observation I made while adding supports_update_block_table to both Mamba models in my PR is that the tests for Mamba1 hybrid models using ai21labs/Jamba-tiny-dev started failing.

After some debugging, I noticed that the Mamba2 model used in the hybrid APC tests is hmellor/tiny-random-BambaForCausalLM, and for this model the test passes. Looking deeper, it turns out this model has only a single Mamba layer and a single attention layer, so builder.update_block_table is never invoked.

When I replaced the model in the APC test with ibm-granite/granite-4.0-tiny-preview, the test failed with the same issue observed in Jamba, specifically a logprobs mismatch.

@LucasWilkinson do you have any insight into why this might be happening?

cc: @tdoublep

@LucasWilkinson
Copy link
Collaborator Author

One interesting observation I made while adding supports_update_block_table to both Mamba models in my PR is that the tests for Mamba1 hybrid models using ai21labs/Jamba-tiny-dev started failing.

After some debugging, I noticed that the Mamba2 model used in the hybrid APC tests is hmellor/tiny-random-BambaForCausalLM, and for this model the test passes. Looking deeper, it turns out this model has only a single Mamba layer and a single attention layer, so builder.update_block_table is never invoked.

When I replaced the model in the APC test with ibm-granite/granite-4.0-tiny-preview, the test failed with the same issue observed in Jamba, specifically a logprobs mismatch.

@LucasWilkinson do you have any insight into why this might be happening?

cc: @tdoublep

Can you give me a repro command? happy to help out

@Josephasafg
Copy link
Contributor

Josephasafg commented Dec 18, 2025

@LucasWilkinson Thanks.
You can run this test
pytest tests/models/language/generation/test_hybrid.py::test_apc_single_prompt[1-5-2-64-ibm-granite/granite-4.0-tiny-preview] but you would need to add HYBRID_MODELS[4] to list of allowed models in the test's parameterize

@Josephasafg
Copy link
Contributor

@LucasWilkinson have you managed to reproduce it?

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
…oups (vllm-project#29627)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…oups (vllm-project#29627)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…oups (vllm-project#29627)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants