Update repeat KV llama logic for better TP-4 performance#639
Conversation
|
Significant perf improvements will be seen for world size 4 or world size 2 with this change without using flash attention too. Few readings taken below. python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 128 --batch_size 8 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 Without fix - With fix - python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 512 --max_new_tokens 512 --batch_size 16 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 Without fix - With fix - python ../gaudi_spawn.py --use_deepspeed --world_size 4 run_generation.py --model_name_or_path /software/data/llama_inference/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 1024 --max_new_tokens 1024 --batch_size 8 --limit_hpu_graphs --attn_softmax_bf16 --trim_logits --bf16 --reuse_cache --warmup 2 --n_iterations 2 Without fix - With fix - |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@schoi-habana - Will you get a chance to verify finetuning too with the changes in this PR. Note that these changes are applicable when flash attention is disabled. |
|
@puneeshkhanna I tested this patch with 4x finetuning and flash attention disabled. There was no performance gain observed with this patch |
…#639) Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai>
What does this PR do?
Fixes # (issue)
Before submitting