Update repeat KV llama logic for better TP-4 performance by puneeshkhanna · Pull Request #639 · huggingface/optimum-habana

puneeshkhanna · 2024-01-16T11:10:23Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

puneeshkhanna · 2024-01-16T11:12:54Z

@regisss - this PR should provide the same gains as we saw in #626 for TP-4 and TP-2 cases of llama70B inference.
I just need to test once more with the final changes that I have pushed and will update the results in comments.

puneeshkhanna · 2024-01-16T11:13:57Z

@regisss - Also can we label this too with synapse 1.14 since it is dependent on #626

puneeshkhanna · 2024-01-16T11:58:12Z

Significant perf improvements will be seen for world size 4 or world size 2 with this change without using flash attention too. Few readings taken below.

python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 128 --batch_size 8 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2

Without fix -
Stats:
Throughput (including tokenization) = 277.43501473449646 tokens/second
Number of HPU graphs = 19
Memory allocated = 32.88 GB
Max memory allocated = 36.85 GB
Total memory available = 94.62 GB
Graph compilation duration = 9.24757074000081 seconds

With fix -
Stats:
Throughput (including tokenization) = 330.3786981312745 tokens/second
Number of HPU graphs = 19
Memory allocated = 32.86 GB
Max memory allocated = 36.91 GB
Total memory available = 94.62 GB
Graph compilation duration = 8.079426162003074 seconds

python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 512 --max_new_tokens 512 --batch_size 16 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2

Without fix -
Stats:
Throughput (including tokenization) = 288.127251016057 tokens/second
Number of HPU graphs = 19
Memory allocated = 34.23 GB
Max memory allocated = 44.86 GB
Total memory available = 94.62 GB
Graph compilation duration = 58.593736156995874 seconds

With fix -
Stats:
Throughput (including tokenization) = 593.7646171567183 tokens/second
Number of HPU graphs = 19
Memory allocated = 34.2 GB
Max memory allocated = 44.83 GB
Total memory available = 94.62 GB
Graph compilation duration = 29.501766333996784 seconds

python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 1024 --max_new_tokens 1024 --batch_size 8 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2

Without fix -
Stats:
Throughput (including tokenization) = 140.24620942590755 tokens/second
Number of HPU graphs = 19
Memory allocated = 34.28 GB
Max memory allocated = 44.92 GB
Total memory available = 94.62 GB
Graph compilation duration = 118.88620932200865 seconds

With fix -
Stats:
Throughput (including tokenization) = 315.3412322858002 tokens/second
Number of HPU graphs = 19
Memory allocated = 34.22 GB
Max memory allocated = 44.86 GB
Total memory available = 94.62 GB
Graph compilation duration = 53.846886464976706 seconds

HuggingFaceDocBuilderDev · 2024-01-17T08:46:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss

LGTM!

Can you also add a disclaimer in the first post to say that this should not be merged before #626 (if I understood correctly) please?

puneeshkhanna · 2024-01-17T09:56:28Z

yes @regisss - we should first merge #626 and then this one to avoid merge conflicts hopefully.

puneeshkhanna · 2024-01-17T09:58:22Z

@schoi-habana - Will you get a chance to verify finetuning too with the changes in this PR. Note that these changes are applicable when flash attention is disabled.

schoi-habana · 2024-01-17T18:18:06Z

@puneeshkhanna I tested this patch with 4x finetuning and flash attention disabled. There was no performance gain observed with this patch

…#639) Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai>

Update repeat KV llama logic for better TP-4 performance

d92f449

puneeshkhanna requested review from libinta and mandy-li as code owners January 16, 2024 11:10

puneeshkhanna requested a review from a user January 16, 2024 11:10

Update repeat KV llama logic for better TP-4 performance

55b1b61

Update repeat KV llama logic for better TP-4 performance

2484189

regisss added the synapse1.14 label Jan 17, 2024

regisss approved these changes Jan 17, 2024

View reviewed changes

regisss added the run-test Run CI for PRs from external contributors label Jan 17, 2024

puneeshkhanna mentioned this pull request Jan 17, 2024

Flash attention enhancement of repeatKV #626

Merged

3 tasks

ghost approved these changes Jan 18, 2024

View reviewed changes

libinta approved these changes Jan 18, 2024

View reviewed changes

Merge branch 'oh_orig_main1' into repeatKVfix

b7b37af

libinta merged commit 8077ea5 into huggingface:main Jan 24, 2024

puneeshkhanna deleted the repeatKVfix branch January 24, 2024 03:59

jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024

Update repeat KV llama logic for better TP-4 performance (huggingface…

3ced167

…#639) Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update repeat KV llama logic for better TP-4 performance#639

Update repeat KV llama logic for better TP-4 performance#639
libinta merged 4 commits into
huggingface:mainfrom
puneeshkhanna:repeatKVfix

puneeshkhanna commented Jan 16, 2024

Uh oh!

puneeshkhanna commented Jan 16, 2024

Uh oh!

puneeshkhanna commented Jan 16, 2024

Uh oh!

puneeshkhanna commented Jan 16, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jan 17, 2024

Uh oh!

regisss left a comment

Uh oh!

puneeshkhanna commented Jan 17, 2024

Uh oh!

puneeshkhanna commented Jan 17, 2024 •

edited

Loading

Uh oh!

schoi-habana commented Jan 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

puneeshkhanna commented Jan 16, 2024

What does this PR do?

Before submitting

Uh oh!

puneeshkhanna commented Jan 16, 2024

Uh oh!

puneeshkhanna commented Jan 16, 2024

Uh oh!

puneeshkhanna commented Jan 16, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jan 17, 2024

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

puneeshkhanna commented Jan 17, 2024

Uh oh!

puneeshkhanna commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schoi-habana commented Jan 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

puneeshkhanna commented Jan 17, 2024 •

edited

Loading