enable internal kv bucket in llama by xt574chen · Pull Request #24 · HabanaAI/optimum-habana-fork

xt574chen · 2024-02-05T17:00:14Z

What does this PR do?

To enhance throughput in scenarios with long new tokens, break down the KV cache into multiples of the bucket width. Use this to compute attention rather than using the entire KV cache.

Add --bucket_size=128 --bucket_internal to the commands to enable the feature.

ghost · 2024-02-06T04:21:28Z

+    parser.add_argument(
+        "--bucket_internal",
+        action="store_true",
+        help="Split kv sequence into buckets in decode phase. It is useful for long new tokens.",


It improves throughput when max_new_tokens is large

puneeshkhanna · 2024-02-07T13:25:58Z

+                    if idx < (model_kwargs["kv_cache_len"] // bucket_size):
+                        cache_idx = (idx.item() + 1) * bucket_size
+                        model_kwargs["cache_idx"] = cache_idx



@xt574chen - this logic will work only when your total generated length is multiple of bucket size. For example consider an example of total length as 2060. So for tokens getting generated between 2048 and 2060, KV cache will be sliced till seq len 2048 and KV values between 2048 and 2060 won't be considered.

Please find updated logic below (spent a lot of time reviewing all the changes today):

if model_kwargs.get("token_idx") <= (model_kwargs["kv_cache_len"] // bucket_size) * bucket_size: idx = torch.div(model_kwargs.get("token_idx") - 1, bucket_size, rounding_mode="floor") cache_idx = (idx.item() + 1) * bucket_size model_kwargs["cache_idx"] = cache_idx else: model_kwargs["cache_idx"] = model_kwargs["kv_cache_len"]

We can also further enhance a bit more for avoiding .item() call when the idx tensor is not changing. But lets avoid that minor enhancement for now. We can push separate PR later.

More importantly above logic needs to go in first so that the model logic works fine.

@xt574chen Further enhanced code can be as below. I will let you decide the best course of action.

#Declare prev_idx = None outside while loop. if model_kwargs.get("token_idx") <= (model_kwargs["kv_cache_len"] // bucket_size) * bucket_size: idx = torch.div(model_kwargs.get("token_idx") - 1, bucket_size, rounding_mode="floor") if idx != prev_idx: cache_idx = (idx.item() + 1) * bucket_size model_kwargs["cache_idx"] = cache_idx prev_idx = idx else: model_kwargs["cache_idx"] = model_kwargs["kv_cache_len"]

I tested the recommended logic (without the enhancement) quite a bit and seems to be working fine. @xt574chen - please test from your side and feel free to update anything. But the original code has the issue as I highlighted in earlier comments.

puneeshkhanna · 2024-02-07T13:28:04Z

@dvarshney-habana - check comment. Once addressed then we can merge.

xt574chen · 2024-02-08T02:34:00Z

@puneeshkhanna updated, thank you!

puneeshkhanna · 2024-02-08T04:33:17Z

@xt574chen - Thank you. Hope you also verified the changes and we are not missing any corner cases.
Changes look good to me.

@dvarshney-habana - lets merge it so that we can start testing in nightly jobs too and we can see an impact of improved performances with bucketing.

puneeshkhanna · 2024-02-12T07:46:35Z

@xt574chen -
I think there is still some issue with bucketing logic in this PR.
Earlier I had tested your local patch which had --use_kv_blocks. Logic was almost same there except bucket size is hardcoded as 128.
With that I had done extensive testing and all the configs performed better with that patch.

As an example, lets take below 2 configs
1st config:

With kv_blocks:
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 2048 --batch_size 60 --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 --limit_hpu_graphs --use_kv_blocks
Stats:
Stats:

Throughput (including tokenization) = 2924.5992187903807 tokens/second
Number of HPU graphs = 46
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.13634231203469 seconds

Without bucketing (without kv blocks) irrespective of the changes in this PR:
Stats:

Throughput (including tokenization) = 2848.227319207161 tokens/second
Number of HPU graphs = 16
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 87.94648431398673 seconds

With the changes of this PR:
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 2048 --batch_size 60 --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 --limit_hpu_graphs --bucket_size 128 --bucket_internal
Stats:

Throughput (including tokenization) = 2647.6716598301537 tokens/second
Number of HPU graphs = 51
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 100.0808209030074 seconds

2nd config:

With kv_blocks
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 60 --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 --limit_hpu_graphs --use_kv_blocks
Stats:

Throughput (including tokenization) = 2210.6107447412933 tokens/second
Number of HPU graphs = 46
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 118.95348056603689 seconds

Without bucketing or kv blocks irrespective of the changes in this PR:
Stats:

Throughput (including tokenization) = 2154.783151771037 tokens/second
Number of HPU graphs = 16
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 117.34948720003013 seconds

With the changes of this PR:
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 60 --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 --limit_hpu_graphs --bucket_size 128 --bucket_internal
Stats:

Throughput (including tokenization) = 2036.5612704786631 tokens/second
Number of HPU graphs = 51
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 129.63119935599389 seconds

We should not get a degrade for any configuration with bucketing logic as that should become the default config in my opinion. Something is extra in this PR which is causing a degrade I think.

* enable internal kv bucket in llama * initialize bucket_internal for CI * make bucket_internal more clear * further perf optim while max length is not multiple of bucket size

astachowiczhabana · 2024-06-07T14:19:09Z

huggingface#720

enable internal kv bucket in llama

30963f3

xt574chen requested review from bhargaveede, libinta, mandy-li, ssarkar2 and vivekgoe as code owners February 5, 2024 17:00

xt574chen requested a review from a user February 5, 2024 17:00

initialize bucket_internal for CI

5b1c158

xt574chen mentioned this pull request Feb 5, 2024

enable internal kv bucket in llama #11

Closed

ghost reviewed Feb 6, 2024

View reviewed changes

ghost approved these changes Feb 6, 2024

View reviewed changes

make bucket_internal more clear

c1fb4f4

puneeshkhanna reviewed Feb 7, 2024

View reviewed changes

xt574chen added 2 commits February 8, 2024 10:09

fix conflict

6362656

further perf optim while max length is not multiple of bucket size

8b628df

ghost merged commit d5291ae into HabanaAI:habana-main Feb 8, 2024

xt574chen mentioned this pull request Mar 1, 2024

extend bucket_internal to SAMPLE generation mode #84

Merged

astachowiczhabana pushed a commit that referenced this pull request Nov 22, 2024

Adding labels clone as workaround to avoid crash (#24)

7101122

xinyu-intel pushed a commit that referenced this pull request Mar 4, 2025

Adding labels clone as workaround to avoid crash (#24)

dc88a90

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable internal kv bucket in llama#24

enable internal kv bucket in llama#24
5 commits merged into
HabanaAI:habana-mainfrom
xt574chen:llama_cache_bucket

xt574chen commented Feb 5, 2024

Uh oh!

ghost Feb 6, 2024

Uh oh!

puneeshkhanna Feb 7, 2024 •

edited

Loading

Uh oh!

puneeshkhanna Feb 7, 2024

Uh oh!

puneeshkhanna Feb 7, 2024

Uh oh!

puneeshkhanna commented Feb 7, 2024

Uh oh!

xt574chen commented Feb 8, 2024

Uh oh!

puneeshkhanna commented Feb 8, 2024

Uh oh!

puneeshkhanna commented Feb 12, 2024

Uh oh!

astachowiczhabana commented Jun 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xt574chen commented Feb 5, 2024

What does this PR do?

Uh oh!

ghost Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna commented Feb 7, 2024

Uh oh!

xt574chen commented Feb 8, 2024

Uh oh!

puneeshkhanna commented Feb 8, 2024

Uh oh!

puneeshkhanna commented Feb 12, 2024

Throughput (including tokenization) = 2924.5992187903807 tokens/second Number of HPU graphs = 46 Memory allocated = 21.75 GB Max memory allocated = 42.26 GB Total memory available = 94.62 GB Graph compilation duration = 90.13634231203469 seconds

Throughput (including tokenization) = 2848.227319207161 tokens/second Number of HPU graphs = 16 Memory allocated = 21.75 GB Max memory allocated = 42.26 GB Total memory available = 94.62 GB Graph compilation duration = 87.94648431398673 seconds

Throughput (including tokenization) = 2647.6716598301537 tokens/second Number of HPU graphs = 51 Memory allocated = 21.75 GB Max memory allocated = 42.26 GB Total memory available = 94.62 GB Graph compilation duration = 100.0808209030074 seconds

Throughput (including tokenization) = 2210.6107447412933 tokens/second Number of HPU graphs = 46 Memory allocated = 33.53 GB Max memory allocated = 64.01 GB Total memory available = 94.62 GB Graph compilation duration = 118.95348056603689 seconds

Throughput (including tokenization) = 2154.783151771037 tokens/second Number of HPU graphs = 16 Memory allocated = 33.53 GB Max memory allocated = 64.01 GB Total memory available = 94.62 GB Graph compilation duration = 117.34948720003013 seconds

Throughput (including tokenization) = 2036.5612704786631 tokens/second Number of HPU graphs = 51 Memory allocated = 33.53 GB Max memory allocated = 64.01 GB Total memory available = 94.62 GB Graph compilation duration = 129.63119935599389 seconds

Uh oh!

astachowiczhabana commented Jun 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

puneeshkhanna Feb 7, 2024 •

edited

Loading

Throughput (including tokenization) = 2924.5992187903807 tokens/second
Number of HPU graphs = 46
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.13634231203469 seconds

Throughput (including tokenization) = 2848.227319207161 tokens/second
Number of HPU graphs = 16
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 87.94648431398673 seconds

Throughput (including tokenization) = 2647.6716598301537 tokens/second
Number of HPU graphs = 51
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 100.0808209030074 seconds

Throughput (including tokenization) = 2210.6107447412933 tokens/second
Number of HPU graphs = 46
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 118.95348056603689 seconds

Throughput (including tokenization) = 2154.783151771037 tokens/second
Number of HPU graphs = 16
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 117.34948720003013 seconds

Throughput (including tokenization) = 2036.5612704786631 tokens/second
Number of HPU graphs = 51
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 129.63119935599389 seconds