Skip to content

enable internal kv bucket in llama#24

Merged
5 commits merged into
HabanaAI:habana-mainfrom
xt574chen:llama_cache_bucket
Feb 8, 2024
Merged

enable internal kv bucket in llama#24
5 commits merged into
HabanaAI:habana-mainfrom
xt574chen:llama_cache_bucket

Conversation

@xt574chen
Copy link
Copy Markdown

What does this PR do?

To enhance throughput in scenarios with long new tokens, break down the KV cache into multiples of the bucket width. Use this to compute attention rather than using the entire KV cache.
image

Add --bucket_size=128 --bucket_internal to the commands to enable the feature.

parser.add_argument(
"--bucket_internal",
action="store_true",
help="Split kv sequence into buckets in decode phase. It is useful for long new tokens.",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It improves throughput when max_new_tokens is large

if idx < (model_kwargs["kv_cache_len"] // bucket_size):
cache_idx = (idx.item() + 1) * bucket_size
model_kwargs["cache_idx"] = cache_idx

Copy link
Copy Markdown

@puneeshkhanna puneeshkhanna Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xt574chen - this logic will work only when your total generated length is multiple of bucket size. For example consider an example of total length as 2060. So for tokens getting generated between 2048 and 2060, KV cache will be sliced till seq len 2048 and KV values between 2048 and 2060 won't be considered.

Please find updated logic below (spent a lot of time reviewing all the changes today):

if model_kwargs.get("token_idx") <= (model_kwargs["kv_cache_len"] // bucket_size) * bucket_size:
     idx = torch.div(model_kwargs.get("token_idx") - 1, bucket_size, rounding_mode="floor")
     cache_idx = (idx.item() + 1) * bucket_size
     model_kwargs["cache_idx"] = cache_idx
else:
     model_kwargs["cache_idx"] = model_kwargs["kv_cache_len"]

We can also further enhance a bit more for avoiding .item() call when the idx tensor is not changing. But lets avoid that minor enhancement for now. We can push separate PR later.

More importantly above logic needs to go in first so that the model logic works fine.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xt574chen Further enhanced code can be as below. I will let you decide the best course of action.

#Declare prev_idx = None outside while loop.

if model_kwargs.get("token_idx") <= (model_kwargs["kv_cache_len"] // bucket_size) * bucket_size:
     idx = torch.div(model_kwargs.get("token_idx") - 1, bucket_size, rounding_mode="floor")
     if idx != prev_idx:
         cache_idx = (idx.item() + 1) * bucket_size
         model_kwargs["cache_idx"] = cache_idx
         prev_idx = idx
else:
     model_kwargs["cache_idx"] = model_kwargs["kv_cache_len"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the recommended logic (without the enhancement) quite a bit and seems to be working fine. @xt574chen - please test from your side and feel free to update anything. But the original code has the issue as I highlighted in earlier comments.

@puneeshkhanna
Copy link
Copy Markdown

@dvarshney-habana - check comment. Once addressed then we can merge.

@xt574chen
Copy link
Copy Markdown
Author

@puneeshkhanna updated, thank you!

@puneeshkhanna
Copy link
Copy Markdown

@xt574chen - Thank you. Hope you also verified the changes and we are not missing any corner cases.
Changes look good to me.

@dvarshney-habana - lets merge it so that we can start testing in nightly jobs too and we can see an impact of improved performances with bucketing.

@ghost ghost merged commit d5291ae into HabanaAI:habana-main Feb 8, 2024
@puneeshkhanna
Copy link
Copy Markdown

@xt574chen -
I think there is still some issue with bucketing logic in this PR.
Earlier I had tested your local patch which had --use_kv_blocks. Logic was almost same there except bucket size is hardcoded as 128.
With that I had done extensive testing and all the configs performed better with that patch.

As an example, lets take below 2 configs
1st config:

  1. With kv_blocks:
    python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 2048 --batch_size 60 --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 --limit_hpu_graphs --use_kv_blocks
    Stats:
    Stats:

Throughput (including tokenization) = 2924.5992187903807 tokens/second
Number of HPU graphs = 46
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.13634231203469 seconds

  1. Without bucketing (without kv blocks) irrespective of the changes in this PR:
    Stats:

Throughput (including tokenization) = 2848.227319207161 tokens/second
Number of HPU graphs = 16
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 87.94648431398673 seconds

  1. With the changes of this PR:
    python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 2048 --batch_size 60 --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 --limit_hpu_graphs --bucket_size 128 --bucket_internal
    Stats:

Throughput (including tokenization) = 2647.6716598301537 tokens/second
Number of HPU graphs = 51
Memory allocated = 21.75 GB
Max memory allocated = 42.26 GB
Total memory available = 94.62 GB
Graph compilation duration = 100.0808209030074 seconds

2nd config:

  1. With kv_blocks
    python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 60 --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 --limit_hpu_graphs --use_kv_blocks
    Stats:

Throughput (including tokenization) = 2210.6107447412933 tokens/second
Number of HPU graphs = 46
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 118.95348056603689 seconds

  1. Without bucketing or kv blocks irrespective of the changes in this PR:
    Stats:

Throughput (including tokenization) = 2154.783151771037 tokens/second
Number of HPU graphs = 16
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 117.34948720003013 seconds

  1. With the changes of this PR:
    python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 60 --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 --limit_hpu_graphs --bucket_size 128 --bucket_internal
    Stats:

Throughput (including tokenization) = 2036.5612704786631 tokens/second
Number of HPU graphs = 51
Memory allocated = 33.53 GB
Max memory allocated = 64.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 129.63119935599389 seconds

We should not get a degrade for any configuration with bucketing logic as that should become the default config in my opinion. Something is extra in this PR which is causing a degrade I think.

bhargaveede pushed a commit that referenced this pull request Feb 19, 2024
* enable internal kv bucket in llama

* initialize bucket_internal for CI

* make bucket_internal more clear

* further perf optim while max length is not multiple of bucket size
bhargaveede pushed a commit that referenced this pull request Feb 19, 2024
* enable internal kv bucket in llama

* initialize bucket_internal for CI

* make bucket_internal more clear

* further perf optim while max length is not multiple of bucket size
dudilester pushed a commit that referenced this pull request Feb 29, 2024
* enable internal kv bucket in llama

* initialize bucket_internal for CI

* make bucket_internal more clear

* further perf optim while max length is not multiple of bucket size
kalyanjk pushed a commit to kalyanjk/optimum-habana-fork that referenced this pull request Apr 12, 2024
* enable internal kv bucket in llama

* initialize bucket_internal for CI

* make bucket_internal more clear

* further perf optim while max length is not multiple of bucket size
kalyanjk pushed a commit to kalyanjk/optimum-habana-fork that referenced this pull request Apr 15, 2024
* enable internal kv bucket in llama

* initialize bucket_internal for CI

* make bucket_internal more clear

* further perf optim while max length is not multiple of bucket size
@astachowiczhabana
Copy link
Copy Markdown

huggingface#720

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants