[Attention] Cache attention metadata builds across hybrid KV-cache groups#29627
[Attention] Cache attention metadata builds across hybrid KV-cache groups#29627LucasWilkinson merged 10 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
694d4e6 to
2237c63
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
2237c63 to
3e8c226
Compare
3e8c226 to
131d48c
Compare
|
Hi @LucasWilkinson, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
…oups Manually applied PR vllm-project#22788 from neuralmagic:lwilkinson/cache-metadata-builds, excluding FlashInfer changes. This PR caches metadata across kv-cache groups and does a lightweight update_block_table instead of full rebuild. When using the hybrid kv-cache manager we build a new attention metadata from scratch for each kv-cache group even when the kv-cache groups only differ by the block-table. This happens when we have a repeated n:1 pattern like Llama4 where we have 3 local chunked attention layers for every one full attention layer. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
ed18b07 to
3feee58
Compare
|
Hi @LucasWilkinson, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
tdoublep
left a comment
There was a problem hiding this comment.
Looks great, thank for you doing this - left one nit
vllm/v1/worker/gpu_model_runner.py
Outdated
| common_attn_metadata=common_attn_metadata, | ||
| **extra_attn_metadata_args, | ||
| ) | ||
| cached_attn_metadata[cache_key] = attn_metadata_i |
There was a problem hiding this comment.
nit: should we do it all the time or also condition it on builder.supports_update_block_table ?
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
|
@LucasWilkinson is there a reason not to add this support for Mamba1 in I have this PR which should consolidate most of the common logic between mamba1 and mamba2 so features like that won't be overlooked. I can add this logic to |
|
@Josephasafg Yes, fully agreed on that. Will prioritize reviewing + merging your cleanup PR. |
|
One interesting observation I made while adding After some debugging, I noticed that the Mamba2 model used in the hybrid APC tests is When I replaced the model in the APC test with @LucasWilkinson do you have any insight into why this might be happening? cc: @tdoublep |
Can you give me a repro command? happy to help out |
|
@LucasWilkinson Thanks. |
|
@LucasWilkinson have you managed to reproduce it? |
…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Scaled back version of: #22788 and generalization of: #29444
Depends on #29628 land that first
Test Plan
Decode-heavy workload with APC enabled => latency reduction through this PR: 13.7%
Decode-heavy workload with APC disabled => latency reduction through this PR: 2.4%
Test Result
APC enabled:
Before:
This PR:
APC disabled:
Before:
This PR:
Co-authored by: @s3woz (original PR: #29444)