[Hybrid] Marconi-style admission policy for hybrid cache by s3woz · Pull Request #37898 · vllm-project/vllm

s3woz · 2026-03-23T15:05:46Z

Purpose

This PR implements Marconi-style admission policy for hybrid cache used e.g. for Qwen models. In Marconi paper an effective cache admission policy is proposed that caches in two cases:

Last state - important for chat workloads
Shared prefix (cache after the 2nd hit: observe prompts, if a shared prefix is detected, then cache it) - important for system prompts, instructions, few-shot examples, self-consistency, long-document Q&A, RL workloads

Logic before this PR:

Only the last block-aligned block is cached, which conceptually corresponds to Last state caching

Logic with this PR:

Enable additionally shared prefix caching
The implementation in this PR differs from Marconi (that uses radix tree).
- Here, existing regular attention KVCache is used for detecting presence of a shared prefix (i.e. a non-cached shared prefix exists if standard attention KVCache has hits, but SSM attention doesn't, i.e. it lags behind the KVCache)
- If a shared prefix is detected, it is cached at the last block-aligned position (to avoid the need for any kernel adjustments and minimize the impact to vLLM logic)

A synthetic test below demonstrates:

Cache hits for shared prefix when using this PR, and no cache hits for shared prefix in Main.
With APC enabled, the Main is worse by up to 40% (Qwen/Qwen3.5-0.8B) or 66% (Qwen/Qwen3.5-35B-A3B) in terms of latency over the proposed PR.
With APC disabled, no noticeable overhead.

@tdoublep @bohnstingl

Test Plan

if __name__ == "__main__":
    from vllm import LLM, SamplingParams
    from vllm.distributed import cleanup_dist_env_and_memory
    import time, string
    MODEL = "Qwen/Qwen3.5-0.8B"
    sampling_params = SamplingParams(temperature=0.0)
    prefix1 = ( # examples/offline_inference/prefix_caching.py
        "You are an expert school principal, skilled in effectively managing "
        "faculty and staff. Draft 10-15 questions for a potential first grade "
        "Head Teacher for my K-12, all-girls', independent school that emphasizes "
        "community, joyful discovery, and life-long learning. The candidate is "
        "coming in for a first-round panel interview for a 8th grade Math "
        "teaching role. They have 5 years of previous teaching experience "
        "as an assistant teacher at a co-ed, public school with experience "
        "in middle school math teaching. ")
    prefix2 = ("Based on these information, fulfill "
                "the following paragraph: ")
    # Long context
    SHARED_LONG_PREFIX_MULTIPLE = 150
    prefix = SHARED_LONG_PREFIX_MULTIPLE * prefix1
    # How many times to diverge after Long context above
    DIVERGENT_PROMPTS = 20    
    # Moderately-sized prompt that spans more than 1 cache block
    MULTIPLE = 15
    prompt = MULTIPLE * prefix1 + prefix2 + ("Hello, my name is")

    for APC in [True, False]:
        engine = LLM(model=MODEL, enable_prefix_caching=APC, 
            gpu_memory_utilization=0.4, disable_log_stats=False)
        # Initial prompt
        outputs = engine.generate(prefix + prompt, sampling_params)
        # Measure diverging prompts:
        start_time = time.time()
        for i in range(DIVERGENT_PROMPTS):
            divergence = (' ' + string.ascii_letters[i] + ' ')
            outputs = engine.generate(prefix + divergence + prompt, sampling_params)
            # print(f"Generated text: {outputs[0].outputs[0].text!r}")
        total_time = time.time() - start_time
        # Summary
        print('Execution with APC:', APC, "took --- %s seconds ---" % total_time)
        for m in engine.llm_engine.get_metrics():
            if 'vllm:prompt_tokens_cached' in m.name:
                print(m.name, m.value)
        del engine
        cleanup_dist_env_and_memory()

Test Result

Main (Qwen/Qwen3.5-0.8B):

Execution with APC: True took --- 3.2461838722229004 seconds ---
vllm:prompt_tokens_cached 0
Execution with APC: False took --- 3.215266704559326 seconds ---
vllm:prompt_tokens_cached 0

This PR (Qwen/Qwen3.5-0.8B):

Execution with APC: True took --- 2.3055942058563232 seconds ---
vllm:prompt_tokens_cached 289408
Execution with APC: False took --- 3.2060539722442627 seconds ---
vllm:prompt_tokens_cached 0

This PR (for Qwen/Qwen3.5-35B-A3B):

Execution with APC: True took --- 5.365646839141846 seconds ---
vllm:prompt_tokens_cached 280896
Execution with APC: False took --- 8.88869309425354 seconds ---
vllm:prompt_tokens_cached 0

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

gemini-code-assist

Code Review

This pull request introduces a Marconi-style admission policy for the hybrid cache, which is a valuable performance optimization. The implementation correctly identifies shared prefixes by comparing cache hits between different attention mechanisms and forces caching at divergence points. However, I've identified a few areas where the code could be made more robust and maintainable. Specifically, there's a potential for a critical UnboundLocalError, some confusing variable reuse that hinders readability, and a piece of code with an uncertain assertion that could lead to subtle bugs. Addressing these points will improve the overall quality and stability of this new feature.

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

tdoublep · 2026-03-26T18:13:44Z

cc @peakcrosser7 @heheda12345

This is how we can catch system prompts using align mode

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

mergify · 2026-04-22T15:48:54Z

Hi @s3woz, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

tdoublep

LGTM but would like @heheda12345 to also review

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

s3woz · 2026-06-02T20:54:36Z

Re: @QilaiZhang :

@s3woz @heheda12345 Hi! Is there any progress on this PR?

I've just fixed a merge conflict. Otherwise, waiting for feedback / reviews. FYI: @tdoublep

…t#37898) Co-authored-by: s3woz <stw@zurich.ibm.com>

mergify · 2026-06-08T04:56:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @s3woz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

…m-project#37898) Implement Marconi-style cache admission policy for hybrid cache. Caches last state and shared prefixes for Qwen/Hybrid models. Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>

heheda12345

LGTM! Thanks for this improvement. Please simplify the comments to only include necessary ones.

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

mergify · 2026-06-09T07:39:33Z

Hi @s3woz, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

…t#37898) Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

s3woz added 2 commits March 20, 2026 09:01

Marconi admission policy for hybrid cache.

ad830a2

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Cleanup.

d250f8d

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

mergify Bot added the v1 label Mar 23, 2026

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py Outdated

Comment thread vllm/v1/core/sched/scheduler.py Outdated

Comment thread vllm/v1/core/sched/scheduler.py Outdated

s3woz added 2 commits March 26, 2026 13:51

Feedback from Gemini

54eedc2

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Pre-commit fixes

72809a4

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

mergify Bot added the intel-gpu Related to Intel GPU label Mar 27, 2026

s3woz marked this pull request as ready for review April 15, 2026 16:29

s3woz requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners April 15, 2026 16:29

tdoublep reviewed Apr 17, 2026

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py Outdated

Comment thread vllm/v1/core/kv_cache_coordinator.py Outdated

Comment thread vllm/v1/core/sched/scheduler.py

Comment thread vllm/v1/core/sched/scheduler.py Outdated

yannicks1 reviewed Apr 21, 2026

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py

Comment thread vllm/v1/core/kv_cache_coordinator.py Outdated

Comment thread vllm/v1/core/kv_cache_coordinator.py

Comment thread vllm/v1/core/sched/scheduler.py Outdated

s3woz added 2 commits April 22, 2026 09:58

Small naming changes

e67adb2

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Cleaner version returning common prefix from cache coordinator

8ae4013

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

s3woz and others added 4 commits April 24, 2026 00:10

Merge branch 'vllm-project:main' into hybrid_cache_optimization

0bc5430

Pre-commit fixes

0e3af1d

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Adapt tests to pass with new return structure

9cc01e2

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Test for Cache Coordinator and Scheduler

f913446

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

tdoublep approved these changes May 8, 2026

View reviewed changes

tdoublep mentioned this pull request May 11, 2026

[Bug]: Prefix cache align-mode has a 0% cache hit rate for Qwen3.6-35B-A3B #42317

Closed

1 task

heheda12345 reviewed May 13, 2026

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py

NickLucche mentioned this pull request May 13, 2026

[PD][Nixl] Mamba prefix caching mode support #42554

Merged

Merge

5a0cd8f

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

NeoKactus added a commit to NeoKactus/vllm that referenced this pull request Jun 3, 2026

[Hybrid] Marconi-style admission policy for hybrid cache (vllm-projec…

069d911

…t#37898) Co-authored-by: s3woz <stw@zurich.ibm.com>

mergify Bot added the needs-rebase label Jun 8, 2026

s3woz added 3 commits June 8, 2026 03:31

Changed to attribute per tdoublep's suggestion

ea7bbad

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Small fixes.

73830b5

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Merge branch 'main' into hybrid_cache_optimization

366acc2

mergify Bot removed the needs-rebase label Jun 8, 2026

heheda12345 approved these changes Jun 9, 2026

View reviewed changes

s3woz added 2 commits June 9, 2026 03:32

Less verbose comments

376ade3

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Less verbose comments

feb7563

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026

tdoublep enabled auto-merge (squash) June 9, 2026 07:39

depthfirst-app Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py

Initialize value

99640b3

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

auto-merge was automatically disabled June 9, 2026 07:53
Head branch was pushed to by a user without write access

tdoublep enabled auto-merge (squash) June 9, 2026 08:03

This was referenced Jun 9, 2026

[Tracking Issue]: Prefix Caching for Hybrid Models #26201

Open

[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching (all-mode) #26807

Open

vllm-bot merged commit dc66e01 into vllm-project:main Jun 10, 2026
62 of 65 checks passed

github-project-automation Bot moved this from In progress to Done in Qwen3.5 Jun 10, 2026

anishesg mentioned this pull request Jun 10, 2026

[Bugfix] Fix scheduling deadlock in _mamba_block_aligned_split with large multimodal inputs #40709

Open

wcynb1023 pushed a commit to wcynb1023/vllm that referenced this pull request Jun 11, 2026

[Hybrid] Marconi-style admission policy for hybrid cache (vllm-projec…

27e18d1

…t#37898) Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

Sahil170595 mentioned this pull request Jun 11, 2026

[Bug]: Hybrid-model prefix caching silently drops to 0% when the align-mode Mamba checkpoint lands in request-unique tokens #45238

Open

tdoublep mentioned this pull request Jun 12, 2026

[Model Runner V2] support mamba hybrid models align prefix cache #42406

Open

Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026

[Hybrid] Marconi-style admission policy for hybrid cache (vllm-projec…

c6765c1

…t#37898) Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

peakcrosser7 mentioned this pull request Jun 14, 2026

[Hybrid] Map multiple FullAttn layers to a single page #35703

Open

5 tasks

Uh oh!

Conversation

s3woz commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdoublep commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 22, 2026

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

s3woz commented Jun 2, 2026

Uh oh!

mergify Bot commented Jun 8, 2026

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

s3woz commented Mar 23, 2026 •

edited

Loading