You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This paper mentions removing a small portion of the KV cache does not affect the results but improves memory efficiency. A 20% reduction in Heavy Hitters (H2) can increase throughput by 29 times and reduce latency by 1.9 times.
When calculating attention scores, a small portion of tokens contribute the majority of the value. This paper proposes the Heavy Hitter Oracle (H2O), a KV cache eviction strategy that dynamically balances retention between recent tokens and H2 tokens. They frame the eviction problem of the KV cache as a dynamic submodular problem.
Trade-off:
Deleting some deemed unimportant KV cache may raise concerns about accuracy, but it reduces memory usage to improve throughput.
@simon-mo Is this a feature you'd like to see implemented?
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
H2O needs attention score to decide which one need to be evicted. No matter the prompt processing(xops.memory_efficient_attention_forward, vllm = 0.2.7 ) or the decoding phase, it is not easy to get the attention score. Has any idea?
very interesting idea! the paper adopts wiki-text-103 as test dataset to get the conclusion, i think it will may be annother conclusion in pure mathematical formula dataset
There's a similar feature request for StreamingLLM, issue here.
Meanwhile, FastGen and Scissorhands are also highly related KV cache compression methods.
It would be better if the implementation design can be more general, e.g., flexible head-wise/layer-wise KV cache management @simon-mo@chizhang118@WoosukKwon.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
🚀 The feature, motivation and pitch
This paper might be of interest: https://arxiv.org/pdf/2306.14048.pdf
This paper mentions removing a small portion of the KV cache does not affect the results but improves memory efficiency. A 20% reduction in Heavy Hitters (H2) can increase throughput by 29 times and reduce latency by 1.9 times.
When calculating attention scores, a small portion of tokens contribute the majority of the value. This paper proposes the Heavy Hitter Oracle (H2O), a KV cache eviction strategy that dynamically balances retention between recent tokens and H2 tokens. They frame the eviction problem of the KV cache as a dynamic submodular problem.
Trade-off:
Deleting some deemed unimportant KV cache may raise concerns about accuracy, but it reduces memory usage to improve throughput.
@simon-mo Is this a feature you'd like to see implemented?
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: