Implements window KV-Cache Compression Strategy #9

griff4692 · 2024-05-29T17:42:41Z

Window keeps evicts earliest tokens first.
Used ruff to reformat the Python files

haileyschoelkopf

@griff4692 left some comments but haven't finished going through everything just yet!

Will also follow up with you to ensure I'm on the same page for certain design decisions

haileyschoelkopf · 2024-05-29T20:37:46Z

generate.py

+    parser.add_argument(
+        "--max_cache_length",
+        type=float,
+        default=512,


Suggested change

default=512,

default=1,

default to not-windowed? Or I suppose this is ignored unless using "window" for cache_strategy?

Not windowed makes sense (1.0)

If cache_strategy == full, max_cache_length has to be 512.

haileyschoelkopf · 2024-05-29T20:39:08Z

generate.py

+    # Optional Cache Kwargs depending on cache_strategy
+    parser.add_argument(
+        "--global_tokens",
+        default=128,


might 4 be a more reasonable default?

Yes - sorry the defaults right now were somewhat random but will fix to 4. I figured they'd all be adjusted during experimentation!

haileyschoelkopf · 2024-05-29T20:51:46Z

cache.py

+
+        self.updates = 0
+
+    def is_prefill(self):


I think I'm confused about the role of is_prefill here. Is this for certain methods which won't be using any fancy approaches during the prefill stage?

I didn't design this very well. is_prefill should probably exist outside of the Cache class.

The only place it's used is in generate.py

... q, k, v = map(lambda x: x.transpose(1, 2), (q, k, v)) if self.kv_cache is not None: cache_k, cache_v, cache_mask = self.kv_cache.update(input_pos, k, v) # If we are in the prefill stage, we use the existing prompt kv-pairs if not self.kv_cache.is_prefill(): k = cache_k v = cache_v mask = cache_mask.to(k.device) ...

The reason I added a switch for prefill is that during the prefill stage, we typically use full self-attention. If there's compression required to initialize the cache (|prompt| > max_cache_length), then this won't be full self-attention. I can explain further if it's unclear!

Do you think it's ok for the cache to essentially record the generation step (self.updates) or would you put that logic in generate.py?

haileyschoelkopf · 2024-05-29T20:58:50Z

generate.py

    # input_pos: [B, S]
-    logits = model(x, input_pos)
+    # Fix GPU
+    causal_mask = (


note to self: haven't finished looking at this or terminator_ids yet

griff4692 · 2024-05-29T23:57:57Z

@griff4692 left some comments but haven't finished going through everything just yet!

Will also follow up with you to ensure I'm on the same page for certain design decisions

Thanks! No rush - I'll update as you make suggestions

griff4692 · 2024-05-30T00:26:45Z

model.py

+            k = cache_k
+            v = cache_v
+            # We also need to ask the cache for its dynamic mask which changes based on updates and possibly evictions
+            # TODO - why is this not always loaded on GPU?


@haileyschoelkopf - I'm wondering if given that the caches are attached to the model (seesetup_caches), why isn't the mask which is registered as a buffer loaded onto the same device as the model (which is cuda)? I don't know enough about how this works but curious if you do!

griff4692 · 2024-06-03T22:39:35Z

Squashed everything into a single commit to make it easier to follow

- Creates cache.py - Introduces global_tokens - Formats repo with ruff - Speed parity with full KV-cache

griff4692 · 2024-06-04T11:17:20Z

@haileyschoelkopf - removed mutable python function args and made a few other minor edits. going to merge now and rebase my heavy hitters code onto the new main.

griff4692 requested review from fladhak and haileyschoelkopf May 29, 2024 17:43

haileyschoelkopf reviewed May 29, 2024

View reviewed changes

griff4692 commented May 30, 2024

View reviewed changes

griff4692 force-pushed the window branch 2 times, most recently from 6c9dd9f to 538314b Compare June 3, 2024 22:36

Adds fixed-window KV-cache.

06aa103

- Creates cache.py - Introduces global_tokens - Formats repo with ruff - Speed parity with full KV-cache

griff4692 force-pushed the window branch from a2f8029 to 06aa103 Compare June 4, 2024 11:16

griff4692 merged commit a4dd428 into main Jun 4, 2024

griff4692 deleted the window branch June 4, 2024 11:17

Implements window KV-Cache Compression Strategy #9

Implements window KV-Cache Compression Strategy #9

Uh oh!

Conversation

griff4692 commented May 29, 2024

Uh oh!

haileyschoelkopf left a comment

Choose a reason for hiding this comment

Uh oh!

haileyschoelkopf May 29, 2024

Choose a reason for hiding this comment

Uh oh!

griff4692 May 29, 2024

Choose a reason for hiding this comment

Uh oh!

haileyschoelkopf May 29, 2024

Choose a reason for hiding this comment

Uh oh!

griff4692 May 29, 2024

Choose a reason for hiding this comment

Uh oh!

haileyschoelkopf May 29, 2024

Choose a reason for hiding this comment

Uh oh!

griff4692 May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

griff4692 May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haileyschoelkopf May 29, 2024

Choose a reason for hiding this comment

Uh oh!

griff4692 commented May 29, 2024

Uh oh!

griff4692 May 30, 2024

Choose a reason for hiding this comment

Uh oh!

griff4692 commented Jun 3, 2024

Uh oh!

griff4692 commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

griff4692 May 30, 2024 •

edited

Loading

griff4692 May 30, 2024 •

edited

Loading