[Hybrid] Marconi-style admission policy for hybrid cache#37898
Conversation
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a Marconi-style admission policy for the hybrid cache, which is a valuable performance optimization. The implementation correctly identifies shared prefixes by comparing cache hits between different attention mechanisms and forces caching at divergence points. However, I've identified a few areas where the code could be made more robust and maintainable. Specifically, there's a potential for a critical UnboundLocalError, some confusing variable reuse that hinders readability, and a piece of code with an uncertain assertion that could lead to subtle bugs. Addressing these points will improve the overall quality and stability of this new feature.
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
|
This is how we can catch system prompts using align mode |
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
|
Hi @s3woz, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
tdoublep
left a comment
There was a problem hiding this comment.
LGTM but would like @heheda12345 to also review
|
Re: @QilaiZhang :
I've just fixed a merge conflict. Otherwise, waiting for feedback / reviews. FYI: @tdoublep |
…t#37898) Co-authored-by: s3woz <stw@zurich.ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
…m-project#37898) Implement Marconi-style cache admission policy for hybrid cache. Caches last state and shared prefixes for Qwen/Hybrid models. Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
heheda12345
left a comment
There was a problem hiding this comment.
LGTM! Thanks for this improvement. Please simplify the comments to only include necessary ones.
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
|
Hi @s3woz, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Head branch was pushed to by a user without write access
…t#37898) Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
…t#37898) Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Purpose
This PR implements Marconi-style admission policy for hybrid cache used e.g. for Qwen models. In Marconi paper an effective cache admission policy is proposed that caches in two cases:
Logic before this PR:
Logic with this PR:
A synthetic test below demonstrates:
Qwen/Qwen3.5-0.8B) or 66% (Qwen/Qwen3.5-35B-A3B) in terms of latency over the proposed PR.@tdoublep @bohnstingl
Test Plan
Test Result
Main (
Qwen/Qwen3.5-0.8B):This PR (
Qwen/Qwen3.5-0.8B):This PR (for
Qwen/Qwen3.5-35B-A3B):