Skip to content

enable internal kv bucket in llama#720

Merged
regisss merged 4 commits into
huggingface:mainfrom
xt574chen:bucket_internal
Feb 23, 2024
Merged

enable internal kv bucket in llama#720
regisss merged 4 commits into
huggingface:mainfrom
xt574chen:bucket_internal

Conversation

@xt574chen
Copy link
Copy Markdown
Contributor

What does this PR do?

To enhance throughput in scenarios with long new tokens, break down the KV cache into multiples of the bucket width. Use this to compute attention rather than using the entire KV cache.

LLaMA v2 70B (8x, max_input_tokens 128, max_new_tokens 2048, batch_size 240):
5528 tps (original performance) -> 6378 tps (w/ internal bucket size 128)

Add --bucket_size=128 --bucket_internal to the commands to enable the feature.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread optimum/habana/transformers/generation/configuration_utils.py
if not generation_config.bucket_internal:
assert generation_config.bucket_size <= 0, "reuse_cache and bucketing flags set together"
else:
assert generation_config.bucket_size >= 0, "bucket_internal and bucket_size flags set together"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are in the case where generation_config.bucket_internal is True, so if this assert fails (i.e. generation_config.bucket_size < 0), it means that bucket_size is not set right? But the error message says otherwise

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@puneeshkhanna I see you've corrected some error messages. I hope this update won't cause conflict.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xt574chen - I will update my PR once this gets merged first

Comment thread optimum/habana/transformers/generation/utils.py Outdated
@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Feb 22, 2024

@xt574chen Can you run the following from the root of the repo to make the code style check pass please?

pip install -U ruff
make style

@regisss regisss added the run-test Run CI for PRs from external contributors label Feb 23, 2024
@regisss regisss merged commit e328e21 into huggingface:main Feb 23, 2024
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
@xt574chen xt574chen deleted the bucket_internal branch March 1, 2024 01:03
HolyFalafel pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants