Skip to content

Update repeat KV llama logic for better TP-4 performance#639

Merged
libinta merged 4 commits into
huggingface:mainfrom
puneeshkhanna:repeatKVfix
Jan 24, 2024
Merged

Update repeat KV llama logic for better TP-4 performance#639
libinta merged 4 commits into
huggingface:mainfrom
puneeshkhanna:repeatKVfix

Conversation

@puneeshkhanna
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@puneeshkhanna puneeshkhanna requested a review from a user January 16, 2024 11:10
@puneeshkhanna
Copy link
Copy Markdown
Contributor Author

@regisss - this PR should provide the same gains as we saw in #626 for TP-4 and TP-2 cases of llama70B inference.
I just need to test once more with the final changes that I have pushed and will update the results in comments.

@puneeshkhanna
Copy link
Copy Markdown
Contributor Author

@regisss - Also can we label this too with synapse 1.14 since it is dependent on #626

@puneeshkhanna
Copy link
Copy Markdown
Contributor Author

Significant perf improvements will be seen for world size 4 or world size 2 with this change without using flash attention too. Few readings taken below.

python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 128 --batch_size 8 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2

Without fix -
Stats:
Throughput (including tokenization) = 277.43501473449646 tokens/second
Number of HPU graphs = 19
Memory allocated = 32.88 GB
Max memory allocated = 36.85 GB
Total memory available = 94.62 GB
Graph compilation duration = 9.24757074000081 seconds

With fix -
Stats:
Throughput (including tokenization) = 330.3786981312745 tokens/second
Number of HPU graphs = 19
Memory allocated = 32.86 GB
Max memory allocated = 36.91 GB
Total memory available = 94.62 GB
Graph compilation duration = 8.079426162003074 seconds

python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 512 --max_new_tokens 512 --batch_size 16 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2

Without fix -
Stats:
Throughput (including tokenization) = 288.127251016057 tokens/second
Number of HPU graphs = 19
Memory allocated = 34.23 GB
Max memory allocated = 44.86 GB
Total memory available = 94.62 GB
Graph compilation duration = 58.593736156995874 seconds

With fix -
Stats:
Throughput (including tokenization) = 593.7646171567183 tokens/second
Number of HPU graphs = 19
Memory allocated = 34.2 GB
Max memory allocated = 44.83 GB
Total memory available = 94.62 GB
Graph compilation duration = 29.501766333996784 seconds

python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 1024 --max_new_tokens 1024 --batch_size 8 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2

Without fix -
Stats:
Throughput (including tokenization) = 140.24620942590755 tokens/second
Number of HPU graphs = 19
Memory allocated = 34.28 GB
Max memory allocated = 44.92 GB
Total memory available = 94.62 GB
Graph compilation duration = 118.88620932200865 seconds

With fix -
Stats:
Throughput (including tokenization) = 315.3412322858002 tokens/second
Number of HPU graphs = 19
Memory allocated = 34.22 GB
Max memory allocated = 44.86 GB
Total memory available = 94.62 GB
Graph compilation duration = 53.846886464976706 seconds

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Can you also add a disclaimer in the first post to say that this should not be merged before #626 (if I understood correctly) please?

@regisss regisss added the run-test Run CI for PRs from external contributors label Jan 17, 2024
@puneeshkhanna
Copy link
Copy Markdown
Contributor Author

yes @regisss - we should first merge #626 and then this one to avoid merge conflicts hopefully.

@puneeshkhanna
Copy link
Copy Markdown
Contributor Author

puneeshkhanna commented Jan 17, 2024

@schoi-habana - Will you get a chance to verify finetuning too with the changes in this PR. Note that these changes are applicable when flash attention is disabled.

@schoi-habana
Copy link
Copy Markdown
Collaborator

@puneeshkhanna I tested this patch with 4x finetuning and flash attention disabled. There was no performance gain observed with this patch

@libinta libinta merged commit 8077ea5 into huggingface:main Jan 24, 2024
@puneeshkhanna puneeshkhanna deleted the repeatKVfix branch January 24, 2024 03:59
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
…#639)



Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants