[Attention] Cache attention metadata builds across hybrid KV-cache groups by LucasWilkinson · Pull Request #29627 · vllm-project/vllm

LucasWilkinson · 2025-11-27T21:56:41Z

Scaled back version of: #22788 and generalization of: #29444

Depends on #29628 land that first

Test Plan

Decode-heavy workload with APC enabled => latency reduction through this PR: 13.7%

vllm bench latency --model ibm-granite/granite-4.0-tiny-preview --input-len 128  --output-len 2048 \
    --batch-size 8 --enable-prefix-caching --num-iters 10 --num-iters-warmup 2

Decode-heavy workload with APC disabled => latency reduction through this PR: 2.4%

vllm bench latency --model ibm-granite/granite-4.0-tiny-preview --input-len 128  --output-len 2048 \
    --batch-size 8 --num-iters 10 --num-iters-warmup 2

Test Result

APC enabled:

Before:

Avg latency: 17.656723083648830 seconds
10% percentile latency: 17.629154206998646 seconds
25% percentile latency: 17.638833652948960 seconds
50% percentile latency: 17.659772212151438 seconds
75% percentile latency: 17.681711453711614 seconds
90% percentile latency: 17.693322991393508 seconds
99% percentile latency: 17.694964875970037 seconds

This PR:

Avg latency: 15.236061835102737 seconds
10% percentile latency: 15.214241081103683 seconds
25% percentile latency: 15.218232876388356 seconds
50% percentile latency: 15.237253282219172 seconds
75% percentile latency: 15.245128114242107 seconds
90% percentile latency: 15.250385893788188 seconds
99% percentile latency: 15.278700626296922 seconds

APC disabled:

Before:

Avg latency: 14.836663954611868 seconds
10% percentile latency: 14.762344243377447 seconds
25% percentile latency: 14.770147853763774 seconds
50% percentile latency: 14.822804148774594 seconds
75% percentile latency: 14.853949831798673 seconds
90% percentile latency: 14.937904387619346 seconds
99% percentile latency: 15.010869848066940 seconds

This PR:

Avg latency: 14.485000116657465 seconds
10% percentile latency: 14.469038359075785 seconds
25% percentile latency: 14.475818350212649 seconds
50% percentile latency: 14.479432722553610 seconds
75% percentile latency: 14.504760635318235 seconds
90% percentile latency: 14.514666751492769 seconds
99% percentile latency: 14.517034560767934 seconds

Co-authored by: @s3woz (original PR: #29444)

mergify · 2025-11-27T21:57:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-12-10T19:57:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-12-11T05:35:58Z

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2025-12-12T04:29:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…oups Manually applied PR vllm-project#22788 from neuralmagic:lwilkinson/cache-metadata-builds, excluding FlashInfer changes. This PR caches metadata across kv-cache groups and does a lightweight update_block_table instead of full rebuild. When using the hybrid kv-cache manager we build a new attention metadata from scratch for each kv-cache group even when the kv-cache groups only differ by the block-table. This happens when we have a repeated n:1 pattern like Llama4 where we have 3 local chunked attention layers for every one full attention layer. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify · 2025-12-12T04:48:28Z

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tdoublep

Looks great, thank for you doing this - left one nit

tdoublep · 2025-12-16T09:12:22Z

vllm/v1/worker/gpu_model_runner.py

                    common_attn_metadata=common_attn_metadata,
                    **extra_attn_metadata_args,
                )
+                cached_attn_metadata[cache_key] = attn_metadata_i


nit: should we do it all the time or also condition it on builder.supports_update_block_table ?

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Josephasafg · 2025-12-18T08:04:06Z

@LucasWilkinson is there a reason not to add this support for Mamba1 in mamba1_attn.py?

I have this PR which should consolidate most of the common logic between mamba1 and mamba2 so features like that won't be overlooked. I can add this logic to mamba_attn.py in my opened PR, so both models can enjoy this. @tdoublep what do you think?

tdoublep · 2025-12-18T11:29:14Z

@Josephasafg Yes, fully agreed on that. Will prioritize reviewing + merging your cleanup PR.

Josephasafg · 2025-12-18T19:17:58Z

One interesting observation I made while adding supports_update_block_table to both Mamba models in my PR is that the tests for Mamba1 hybrid models using ai21labs/Jamba-tiny-dev started failing.

After some debugging, I noticed that the Mamba2 model used in the hybrid APC tests is hmellor/tiny-random-BambaForCausalLM, and for this model the test passes. Looking deeper, it turns out this model has only a single Mamba layer and a single attention layer, so builder.update_block_table is never invoked.

When I replaced the model in the APC test with ibm-granite/granite-4.0-tiny-preview, the test failed with the same issue observed in Jamba, specifically a logprobs mismatch.

@LucasWilkinson do you have any insight into why this might be happening?

cc: @tdoublep

LucasWilkinson · 2025-12-18T19:22:57Z

One interesting observation I made while adding supports_update_block_table to both Mamba models in my PR is that the tests for Mamba1 hybrid models using ai21labs/Jamba-tiny-dev started failing.

After some debugging, I noticed that the Mamba2 model used in the hybrid APC tests is hmellor/tiny-random-BambaForCausalLM, and for this model the test passes. Looking deeper, it turns out this model has only a single Mamba layer and a single attention layer, so builder.update_block_table is never invoked.

When I replaced the model in the APC test with ibm-granite/granite-4.0-tiny-preview, the test failed with the same issue observed in Jamba, specifically a logprobs mismatch.

@LucasWilkinson do you have any insight into why this might be happening?

cc: @tdoublep

Can you give me a repro command? happy to help out

Josephasafg · 2025-12-18T19:27:46Z

@LucasWilkinson Thanks.
You can run this test
pytest tests/models/language/generation/test_hybrid.py::test_apc_single_prompt[1-5-2-64-ibm-granite/granite-4.0-tiny-preview] but you would need to add HYBRID_MODELS[4] to list of allowed models in the test's parameterize

Josephasafg · 2025-12-20T19:38:22Z

@LucasWilkinson have you managed to reproduce it?

…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>

mergify bot added the v1 label Nov 27, 2025

mergify bot added the needs-rebase label Nov 27, 2025

LucasWilkinson mentioned this pull request Nov 27, 2025

[Hybrid] Mamba2 metadata caching #29444

Closed

4 tasks

LucasWilkinson mentioned this pull request Dec 9, 2025

[Core] Refactor _build_attention_metadata #29628

Merged

LucasWilkinson force-pushed the lwilkinson/cache-2 branch from 694d4e6 to 2237c63 Compare December 9, 2025 20:31

mergify bot removed the needs-rebase label Dec 9, 2025

mergify bot added the needs-rebase label Dec 10, 2025

LucasWilkinson force-pushed the lwilkinson/cache-2 branch from 2237c63 to 3e8c226 Compare December 10, 2025 21:48

mergify bot removed the needs-rebase label Dec 10, 2025

LucasWilkinson force-pushed the lwilkinson/cache-2 branch from 3e8c226 to 131d48c Compare December 11, 2025 05:12

LucasWilkinson marked this pull request as ready for review December 11, 2025 05:13

LucasWilkinson changed the title ~~[Attention] Cache attention metadata builds across hybrid KV-cache groups~~ [Do not land][Attention] Cache attention metadata builds across hybrid KV-cache groups Dec 11, 2025

mergify bot added the needs-rebase label Dec 12, 2025

LucasWilkinson and others added 5 commits December 12, 2025 04:30

fix

d85f857

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mamba

bef4d8b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>

add caching

ce42547

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

fix

3feee58

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson force-pushed the lwilkinson/cache-2 branch from ed18b07 to 3feee58 Compare December 12, 2025 04:31

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 12, 2025

mergify bot removed the needs-rebase label Dec 12, 2025

format

3189c49

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson changed the title ~~[Do not land][Attention] Cache attention metadata builds across hybrid KV-cache groups~~ [Attention] Cache attention metadata builds across hybrid KV-cache groups Dec 12, 2025

LucasWilkinson added 2 commits December 12, 2025 19:48

fix

b00564a

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

fix

01d9aa1

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

fa07d6b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson requested review from heheda12345 and tdoublep December 15, 2025 21:13

tdoublep approved these changes Dec 16, 2025

View reviewed changes

review comments

e37e7c6

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson merged commit 9fec0e1 into vllm-project:main Dec 16, 2025
54 checks passed

jeejeelee mentioned this pull request Dec 17, 2025

[Bug]: TypeError: unhashable type: 'dict' when serving deepseek32 #30861

Closed

1 task

Josephasafg mentioned this pull request Feb 22, 2026

[Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) #34874

Merged

2 tasks

Uh oh!

Conversation

LucasWilkinson commented Nov 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Test Result

APC enabled:

APC disabled:

Uh oh!

mergify bot commented Nov 27, 2025

Uh oh!

mergify bot commented Dec 10, 2025

Uh oh!

mergify bot commented Dec 11, 2025

Uh oh!

mergify bot commented Dec 12, 2025

Uh oh!

mergify bot commented Dec 12, 2025

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Josephasafg commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Dec 18, 2025

Uh oh!

Josephasafg commented Dec 18, 2025

Uh oh!

LucasWilkinson commented Dec 18, 2025

Uh oh!

Josephasafg commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Josephasafg commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LucasWilkinson commented Nov 27, 2025 •

edited by github-actions bot

Loading

Josephasafg commented Dec 18, 2025 •

edited

Loading

Josephasafg commented Dec 18, 2025 •

edited

Loading