Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models #3532

Open
Tracked by #2
chizhang118 opened this issue Mar 20, 2024 · 5 comments

Comments

@chizhang118
Copy link

🚀 The feature, motivation and pitch

This paper might be of interest: https://arxiv.org/pdf/2306.14048.pdf

This paper mentions removing a small portion of the KV cache does not affect the results but improves memory efficiency. A 20% reduction in Heavy Hitters (H2) can increase throughput by 29 times and reduce latency by 1.9 times.
When calculating attention scores, a small portion of tokens contribute the majority of the value. This paper proposes the Heavy Hitter Oracle (H2O), a KV cache eviction strategy that dynamically balances retention between recent tokens and H2 tokens. They frame the eviction problem of the KV cache as a dynamic submodular problem.

Trade-off:
Deleting some deemed unimportant KV cache may raise concerns about accuracy, but it reduces memory usage to improve throughput.

@simon-mo Is this a feature you'd like to see implemented?

Alternatives

No response

Additional context

No response

@ChuanhongLi
Copy link

H2O needs attention score to decide which one need to be evicted. No matter the prompt processing(xops.memory_efficient_attention_forward, vllm = 0.2.7 ) or the decoding phase, it is not easy to get the attention score. Has any idea?

@laneeeee
Copy link
Contributor

very interesting idea! the paper adopts wiki-text-103 as test dataset to get the conclusion, i think it will may be annother conclusion in pure mathematical formula dataset

@beagleski
Copy link
Contributor

beagleski commented Mar 26, 2024

There's a similar feature request for StreamingLLM, issue here.
Meanwhile, FastGen and Scissorhands are also highly related KV cache compression methods.
It would be better if the implementation design can be more general, e.g., flexible head-wise/layer-wise KV cache management @simon-mo @chizhang118 @WoosukKwon.

@PiotrNawrot
Copy link

Another promising KV Cache Compression method (this time learned).

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants