enable internal kv bucket in llama by x574chen · Pull Request #658 · huggingface/optimum-habana

x574chen · 2024-01-23T14:51:22Z

What does this PR do?

To enhance throughput in scenarios with long new tokens, break down the KV cache into multiples of the bucket width. Use this to compute attention rather than using the entire KV cache. Below are some results from LLaMA2 7B/70B on Gaudi2:

	TP	Input Length	Output Length	BS	Base Throughput	Throughput w/ internal kv bucket
LLaMA v2-7B	1	128	2048	76	1402	2217 (bucket=128)
LLaMA v2-7B	2	2048	2048	64	1417	1672 (bucket=128)
LLaMA v2-70B	4	128	2048	240	2834	3638 (bucket=256)

Add --bucket_size=128 --bucket_internal to the commands to enable the feature.

puneeshkhanna · 2024-01-24T10:18:22Z

+            if cache_idx is not None and q_len == 1:
+                key_states = key_states[:, :, :cache_idx, :]
+                value_states = value_states[:, :, :cache_idx, :]
+                attention_mask = attention_mask[:, :, :, :cache_idx]


Add a check whether attention_mask is not None

puneeshkhanna · 2024-01-24T11:39:21Z

@ssarkar2 - Maybe we should remove the original bucketing logic in separate PR later for simplicity of the overall code once we are convinced that this PR bucketing logic is best for all cases.

Btw everyone - I will add an option of clear cache too in utils.py (just an API call to release HPU graph memory) in a separate PR to address some corner cases where memory may increase with the bucketing changes of this PR.

ssarkar2 · 2024-01-25T19:24:10Z

@puneeshkhanna , the original external bucketing is general for any model and does not need model file change. It might be useful for unknown/new/unoptimized models.
However for max perf, we have to modify model files for internal cache (like this one).

Let me know if its worth it to keep the general external one that might work for any model

HuggingFaceDocBuilderDev · 2024-01-26T04:40:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

puneeshkhanna · 2024-02-02T12:43:34Z

+        if cache_len and bucket_size > 0:
+            idx = torch.div(token_idx - 1, bucket_size, rounding_mode="floor")
+            if idx < (cache_len // bucket_size):
+                cache_idx = (idx.item() + 1) * bucket_size


@x574chen - Just one query here that do we need to do .item() here because this will cause a sync back to CPU and graph ? Can it work without .item() ?

@x574chen - Also one more query here that can we move lines 823 to 829 to utils.py only and pass cache_idx in kwargs and just have one line here that cache_idx = kwargs.get("cache_idx"). It will make the bucketing changes easier for other models. Sorry for all these late review comments. allocate_kv_cache() can maybe return kv len to utils.py.

Rest all changes look good to me ; basically what I m thinking is that we just pass cache_idx in modeling_llama.py and just have the additional change of slicing the KV cache in the attention block code.

The .item() is used to ensure that cache_idx (one of model inputs) is an integer, not a tensor. This prevents HPUGraph from calling replay when the value of cache_idx changes. Please correct me if my understanding of the use of HPU graph is incorrect.

Also, I have tried to not use .item() to calculate cache_idx every step, but the performance doesn't appear to be significantly impacted. Therefore, I have not made further changes in the repo to avoid .item usage here.

puneeshkhanna · 2024-02-02T12:55:34Z

        assert generation_config.bucket_size > 0
    generation_config.kv_cache_fp8 = args.kv_cache_fp8
    generation_config.use_flash_attention = args.use_flash_attention
+    generation_config.bucket_internal = args.bucket_internal


Also we need initialize this in optimum-habana/optimum/habana/transformers/generation/configuration_utils.py. I think CI will fail without that change

x574chen requested review from bhargaveede, libinta, mandy-li, ssarkar2 and vivekgoe as code owners January 23, 2024 14:51

x574chen requested a review from a user January 23, 2024 14:51

x574chen requested a review from regisss as a code owner January 23, 2024 14:51

puneeshkhanna reviewed Jan 24, 2024

View reviewed changes

ghost approved these changes Jan 29, 2024

View reviewed changes

libinta approved these changes Jan 29, 2024

View reviewed changes

ssarkar2 approved these changes Jan 29, 2024

View reviewed changes

puneeshkhanna reviewed Feb 2, 2024

View reviewed changes

puneeshkhanna mentioned this pull request Feb 5, 2024

enable internal kv bucket in llama HabanaAI/optimum-habana-fork#11

Closed

x574chen closed this Feb 5, 2024

x574chen force-pushed the llama_internal_bucket branch from f4dd4ba to f096980 Compare February 5, 2024 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable internal kv bucket in llama#658

enable internal kv bucket in llama#658
x574chen wants to merge 0 commit into
huggingface:mainfrom
x574chen:llama_internal_bucket

x574chen commented Jan 23, 2024

Uh oh!

puneeshkhanna Jan 24, 2024

Uh oh!

puneeshkhanna commented Jan 24, 2024

Uh oh!

ssarkar2 commented Jan 25, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jan 26, 2024

Uh oh!

puneeshkhanna Feb 2, 2024

Uh oh!

puneeshkhanna Feb 2, 2024 •

edited

Loading

Uh oh!

puneeshkhanna Feb 2, 2024

Uh oh!

xt574chen Feb 5, 2024 •

edited

Loading

Uh oh!

puneeshkhanna Feb 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

x574chen commented Jan 23, 2024

What does this PR do?

Uh oh!

puneeshkhanna Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna commented Jan 24, 2024

Uh oh!

ssarkar2 commented Jan 25, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jan 26, 2024

Uh oh!

puneeshkhanna Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

xt574chen Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

puneeshkhanna Feb 2, 2024 •

edited

Loading

xt574chen Feb 5, 2024 •

edited

Loading