Implement Scissorhands KV-cache compression & SnapKV prompt compression #11

griff4692 · 2024-06-07T13:51:07Z

Scissorhands records the number of times a token in the KV Cache had low attention score (< uniform probability) over a history window (defaulted to 400 tokens). It evicts the tokens with the highest fraction of unimportant attentions per attention head. To avoid aggregating attention unimportances every step, they perform bulk evictions. This leads to empty slots in the KV Cache. I modified it slightly so that instead of periodically evicting in bulk, I periodically update an "eviction queue" which is then used to perform evictions at each step. This change allows us to avoid expensive re-calculation without having to perform bulk evictions.
SnapKV compresses long prompts by separating a prompt into a "prefix" and an "observation window". The method keeps every token in the observation window and compresses the "prefix" based on the attention scores from the observation window. This method is only called when prompt length > max cache length. In this case, we can't just insert into the KV cache in the update method. In order to compress, we need the attention scores which we only get after running spda. So, in line 160 of cache.py, we pass a "compress_prompt" callback which returns the attention scores to the cache after spda is performed. In turn this method first compresses the prompt and then calls the standard update method to insert it into the cache.

griff4692 · 2024-06-07T13:55:11Z

cache.py

+        attn = attn.squeeze()
+        keys = attn.shape[1]
+        attn_is_low = (attn < 1 / keys).int()
+        self.attn_history[:, :, :keys, self.attn_counter % self.history_window_size] = (


Circular queue % self.history_window_size ensures we are always inserting the latest attention value into the most stale / old slot.

griff4692 · 2024-06-07T13:55:33Z

cache.py

+        )
+        self.attn_counter += 1
+
+    def refill_eviction_queue(self, input_pos: int):


This is not in the Scissorhands paper but I explain it in the main PR comment

griff4692 · 2024-06-07T13:56:51Z

model.py

-
-        k = k.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
-        v = v.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
+        k_rep = k.repeat_interleave(self.n_head // self.n_local_heads, dim=1)


Change to k_rep instead of k since we don't want the repeated k, v passed to the attention callback

griff4692 · 2024-06-07T13:58:03Z

cache.py

+        """
+        attn = attn.squeeze()
+        keys = attn.shape[1]
+        attn_is_low = (attn < 1 / keys).int()


treats all tokens as being unimportant if its attention score is lower than uniform attn (1 / keys)

griff4692 · 2024-06-07T13:59:09Z

cache.py

+        num_insertions = k_val.shape[2]
+        # Update global tokens to the prompt size if set to -1
+        if self.insertions == 0 and self.global_tokens == -1:
+            self.global_tokens = num_insertions


I added this so that we have a setting where we always keep the prompt tokens -- by setting global_tokens=-1

cache.py

haileyschoelkopf · 2024-06-07T18:45:52Z

cache.py

+        if prompt_overflow:
+            return k_val, v_val, self.compress_prompt
+
+        # If the cache requires attention weights to manage evictions, we need to pass self.update_attn_history as a callback


todo (me): check if we have a way to only update attn history every m steps, as done by Scissorhands

I don't think I saw this. I suspect we'll want to add it? as otherwise the overhead is quite high for this method

I am recording the attention history every step but only aggregating and computing which tokens to evict every drop_amount (m) steps.

I feel like we have to store attention probs at each step (otherwise we'd have to recompute them when we need them) -- is there a better way?

Specifically scissorhands seems to say they only check and record attn probs every M steps for some hparam M:

So you'd only, every M steps, check the attention prob and appropriately increment the numerator and denominator based on just that step. Probably more brittle due to fewer observations over which to average / for which future-important tokens have the opportunity to reach higher-than-uniform attn probs, but would reduce computational overheads a lot.

Let's keep discussing over discord! closing for now

cache.py

haileyschoelkopf · 2024-06-07T18:59:24Z

model.py

        )

+        if attn_callback:
+            # Mean pool over the grouped queries (average over self.n_head // self.n_local_heads)


is this choice documented in Scissorhands / SnapKV?

I couldn't find any mention of GQA... probably best to just check which models they test on ... they might not be trained with GQA

cache.py

- https://arxiv.org/abs/2305.17118

…dow + global).

griff4692 requested review from haileyschoelkopf and sarahpannn June 7, 2024 13:51

griff4692 commented Jun 7, 2024

View reviewed changes

cache.py Outdated Show resolved Hide resolved

haileyschoelkopf reviewed Jun 7, 2024

View reviewed changes

cache.py Outdated Show resolved Hide resolved

haileyschoelkopf reviewed Jun 7, 2024

View reviewed changes

cache.py Outdated Show resolved Hide resolved

griff4692 requested a review from haileyschoelkopf June 10, 2024 11:25

Implement Scissorhands paper as KVCacheScissorHands.

5bbdcf1

- https://arxiv.org/abs/2305.17118

griff4692 force-pushed the snap branch from 40e39e5 to 3b3a62a Compare June 10, 2024 14:40

Implements prompt compression with SnapKV and Remove Middle (keep win…

45e9854

…dow + global).

griff4692 force-pushed the snap branch from e526b69 to 45e9854 Compare June 11, 2024 10:38

griff4692 merged commit 0669421 into main Jun 11, 2024

griff4692 deleted the snap branch June 11, 2024 10:40

Implement Scissorhands KV-cache compression & SnapKV prompt compression #11

Implement Scissorhands KV-cache compression & SnapKV prompt compression #11

Uh oh!

Conversation

griff4692 commented Jun 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants