Skip to content

[Hybrid] Marconi-style admission policy for hybrid cache#37898

Merged
vllm-bot merged 18 commits into
vllm-project:mainfrom
s3woz:hybrid_cache_optimization
Jun 10, 2026
Merged

[Hybrid] Marconi-style admission policy for hybrid cache#37898
vllm-bot merged 18 commits into
vllm-project:mainfrom
s3woz:hybrid_cache_optimization

Conversation

@s3woz

@s3woz s3woz commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR implements Marconi-style admission policy for hybrid cache used e.g. for Qwen models. In Marconi paper an effective cache admission policy is proposed that caches in two cases:

  1. Last state - important for chat workloads
  2. Shared prefix (cache after the 2nd hit: observe prompts, if a shared prefix is detected, then cache it) - important for system prompts, instructions, few-shot examples, self-consistency, long-document Q&A, RL workloads

Logic before this PR:

  • Only the last block-aligned block is cached, which conceptually corresponds to Last state caching

Logic with this PR:

  • Enable additionally shared prefix caching
  • The implementation in this PR differs from Marconi (that uses radix tree).
    • Here, existing regular attention KVCache is used for detecting presence of a shared prefix (i.e. a non-cached shared prefix exists if standard attention KVCache has hits, but SSM attention doesn't, i.e. it lags behind the KVCache)
    • If a shared prefix is detected, it is cached at the last block-aligned position (to avoid the need for any kernel adjustments and minimize the impact to vLLM logic)

A synthetic test below demonstrates:

  • Cache hits for shared prefix when using this PR, and no cache hits for shared prefix in Main.
  • With APC enabled, the Main is worse by up to 40% (Qwen/Qwen3.5-0.8B) or 66% (Qwen/Qwen3.5-35B-A3B) in terms of latency over the proposed PR.
  • With APC disabled, no noticeable overhead.

@tdoublep @bohnstingl

Test Plan

if __name__ == "__main__":
    from vllm import LLM, SamplingParams
    from vllm.distributed import cleanup_dist_env_and_memory
    import time, string
    MODEL = "Qwen/Qwen3.5-0.8B"
    sampling_params = SamplingParams(temperature=0.0)
    prefix1 = ( # examples/offline_inference/prefix_caching.py
        "You are an expert school principal, skilled in effectively managing "
        "faculty and staff. Draft 10-15 questions for a potential first grade "
        "Head Teacher for my K-12, all-girls', independent school that emphasizes "
        "community, joyful discovery, and life-long learning. The candidate is "
        "coming in for a first-round panel interview for a 8th grade Math "
        "teaching role. They have 5 years of previous teaching experience "
        "as an assistant teacher at a co-ed, public school with experience "
        "in middle school math teaching. ")
    prefix2 = ("Based on these information, fulfill "
                "the following paragraph: ")
    # Long context
    SHARED_LONG_PREFIX_MULTIPLE = 150
    prefix = SHARED_LONG_PREFIX_MULTIPLE * prefix1
    # How many times to diverge after Long context above
    DIVERGENT_PROMPTS = 20    
    # Moderately-sized prompt that spans more than 1 cache block
    MULTIPLE = 15
    prompt = MULTIPLE * prefix1 + prefix2 + ("Hello, my name is")

    for APC in [True, False]:
        engine = LLM(model=MODEL, enable_prefix_caching=APC, 
            gpu_memory_utilization=0.4, disable_log_stats=False)
        # Initial prompt
        outputs = engine.generate(prefix + prompt, sampling_params)
        # Measure diverging prompts:
        start_time = time.time()
        for i in range(DIVERGENT_PROMPTS):
            divergence = (' ' + string.ascii_letters[i] + ' ')
            outputs = engine.generate(prefix + divergence + prompt, sampling_params)
            # print(f"Generated text: {outputs[0].outputs[0].text!r}")
        total_time = time.time() - start_time
        # Summary
        print('Execution with APC:', APC, "took --- %s seconds ---" % total_time)
        for m in engine.llm_engine.get_metrics():
            if 'vllm:prompt_tokens_cached' in m.name:
                print(m.name, m.value)
        del engine
        cleanup_dist_env_and_memory()

Test Result

Main (Qwen/Qwen3.5-0.8B):

Execution with APC: True took --- 3.2461838722229004 seconds ---
vllm:prompt_tokens_cached 0
Execution with APC: False took --- 3.215266704559326 seconds ---
vllm:prompt_tokens_cached 0

This PR (Qwen/Qwen3.5-0.8B):

Execution with APC: True took --- 2.3055942058563232 seconds ---
vllm:prompt_tokens_cached 289408
Execution with APC: False took --- 3.2060539722442627 seconds ---
vllm:prompt_tokens_cached 0

This PR (for Qwen/Qwen3.5-35B-A3B):

Execution with APC: True took --- 5.365646839141846 seconds ---
vllm:prompt_tokens_cached 280896
Execution with APC: False took --- 8.88869309425354 seconds ---
vllm:prompt_tokens_cached 0

s3woz added 2 commits March 20, 2026 09:01
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@mergify mergify Bot added the v1 label Mar 23, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Marconi-style admission policy for the hybrid cache, which is a valuable performance optimization. The implementation correctly identifies shared prefixes by comparing cache hits between different attention mechanisms and forces caching at divergence points. However, I've identified a few areas where the code could be made more robust and maintainable. Specifically, there's a potential for a critical UnboundLocalError, some confusing variable reuse that hinders readability, and a piece of code with an uncertain assertion that could lead to subtle bugs. Addressing these points will improve the overall quality and stability of this new feature.

Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/core/sched/scheduler.py Outdated
s3woz added 2 commits March 26, 2026 13:51
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@tdoublep

Copy link
Copy Markdown
Member

cc @peakcrosser7 @heheda12345

This is how we can catch system prompts using align mode

@mergify mergify Bot added the intel-gpu Related to Intel GPU label Mar 27, 2026
@s3woz s3woz marked this pull request as ready for review April 15, 2026 16:29
Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/core/kv_cache_coordinator.py Outdated
Comment thread vllm/v1/core/sched/scheduler.py
Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/core/sched/scheduler.py
Comment thread vllm/v1/core/kv_cache_coordinator.py Outdated
Comment thread vllm/v1/core/kv_cache_coordinator.py
Comment thread vllm/v1/core/sched/scheduler.py Outdated
s3woz added 2 commits April 22, 2026 09:58
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@mergify

mergify Bot commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Hi @s3woz, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

s3woz and others added 4 commits April 24, 2026 00:10
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

@tdoublep tdoublep left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but would like @heheda12345 to also review

Comment thread vllm/v1/core/sched/scheduler.py
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@s3woz

s3woz commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Re: @QilaiZhang :

@s3woz @heheda12345 Hi! Is there any progress on this PR?

I've just fixed a merge conflict. Otherwise, waiting for feedback / reviews. FYI: @tdoublep

NeoKactus added a commit to NeoKactus/vllm that referenced this pull request Jun 3, 2026
@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @s3woz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 8, 2026
s3woz added 3 commits June 8, 2026 03:31
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@mergify mergify Bot removed the needs-rebase label Jun 8, 2026
NeoKactus added a commit to NeoKactus/vllm that referenced this pull request Jun 9, 2026
…m-project#37898)

Implement Marconi-style cache admission policy for hybrid cache.
Caches last state and shared prefixes for Qwen/Hybrid models.

Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>

@heheda12345 heheda12345 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for this improvement. Please simplify the comments to only include necessary ones.

s3woz added 2 commits June 9, 2026 03:32
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
@tdoublep tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026
@tdoublep tdoublep enabled auto-merge (squash) June 9, 2026 07:39
@mergify

mergify Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Hi @s3woz, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Comment thread vllm/v1/core/sched/scheduler.py
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
auto-merge was automatically disabled June 9, 2026 07:53

Head branch was pushed to by a user without write access

@tdoublep tdoublep enabled auto-merge (squash) June 9, 2026 08:03
@vllm-bot vllm-bot merged commit dc66e01 into vllm-project:main Jun 10, 2026
62 of 65 checks passed
@github-project-automation github-project-automation Bot moved this from In progress to Done in Qwen3.5 Jun 10, 2026
wcynb1023 pushed a commit to wcynb1023/vllm that referenced this pull request Jun 11, 2026
…t#37898)

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…t#37898)

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

intel-gpu Related to Intel GPU ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants