[Feature]: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models #3532

chizhang118 · 2024-03-20T17:25:58Z

🚀 The feature, motivation and pitch

This paper might be of interest: https://arxiv.org/pdf/2306.14048.pdf

This paper mentions removing a small portion of the KV cache does not affect the results but improves memory efficiency. A 20% reduction in Heavy Hitters (H2) can increase throughput by 29 times and reduce latency by 1.9 times.
When calculating attention scores, a small portion of tokens contribute the majority of the value. This paper proposes the Heavy Hitter Oracle (H2O), a KV cache eviction strategy that dynamically balances retention between recent tokens and H2 tokens. They frame the eviction problem of the KV cache as a dynamic submodular problem.

Trade-off:
Deleting some deemed unimportant KV cache may raise concerns about accuracy, but it reduces memory usage to improve throughput.

@simon-mo Is this a feature you'd like to see implemented?

Alternatives

No response

Additional context

No response

ChuanhongLi · 2024-03-21T12:19:24Z

H2O needs attention score to decide which one need to be evicted. No matter the prompt processing(xops.memory_efficient_attention_forward, vllm = 0.2.7 ) or the decoding phase, it is not easy to get the attention score. Has any idea?

laneeeee · 2024-03-25T07:17:26Z

very interesting idea! the paper adopts wiki-text-103 as test dataset to get the conclusion, i think it will may be annother conclusion in pure mathematical formula dataset

beagleski · 2024-03-26T02:36:10Z

There's a similar feature request for StreamingLLM, issue here.
Meanwhile, FastGen and Scissorhands are also highly related KV cache compression methods.
It would be better if the implementation design can be more general, e.g., flexible head-wise/layer-wise KV cache management @simon-mo @chizhang118 @WoosukKwon.

PiotrNawrot · 2024-04-01T15:00:25Z

Another promising KV Cache Compression method (this time learned).

github-actions · 2024-10-29T02:03:44Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

chizhang118 added the feature request label Mar 20, 2024

XiaoningDing mentioned this issue Apr 10, 2024

Q2 Roadmap bd-iaas-us/vllm#2

Closed

17 tasks

IsaacRe mentioned this issue Oct 2, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

39 tasks

github-actions bot added the stale label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models #3532

[Feature]: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models #3532

chizhang118 commented Mar 20, 2024

ChuanhongLi commented Mar 21, 2024

laneeeee commented Mar 25, 2024

beagleski commented Mar 26, 2024 •

edited

Loading

PiotrNawrot commented Apr 1, 2024

github-actions bot commented Oct 29, 2024

[Feature]: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models #3532

[Feature]: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models #3532

Comments

chizhang118 commented Mar 20, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

ChuanhongLi commented Mar 21, 2024

laneeeee commented Mar 25, 2024

beagleski commented Mar 26, 2024 • edited Loading

PiotrNawrot commented Apr 1, 2024

github-actions bot commented Oct 29, 2024

beagleski commented Mar 26, 2024 •

edited

Loading