Prefill kvcache upstream by puneeshkhanna · Pull Request #942 · huggingface/optimum-habana

puneeshkhanna · 2024-05-02T05:09:18Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

* Use KV cache till input seq len for prefill phase. Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag. Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Revert initialization of KV cache * Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag * remove os import * remove commented print --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

…anaAI#161) * Sampling search UseKV cache till input seq len for prefill phase * Remove redundant line

puneeshkhanna · 2024-05-02T05:11:47Z

@regisss, @libinta, @dvarshney-habana - Please add 1.16 synpase release label to this.

puneeshkhanna · 2024-05-02T05:23:59Z

Description of the changes in this PR -

Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag right now.

Updated command (remove --reuse_cache from all existing commands , setting PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1 automatically taken care)
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 200 --attn_softmax_bf16 --trim_logits --bf16 --warmup 2 --n_iterations 2 --limit_hpu_graphs --bucket_internal --bucket_size 128

With the changes in this PR, performance in any existing configs remains same but we can scale batch sizes to much much higher numbers since we save a lot of memory during the prefill phase.
As an example with 2K input tokens + 2K new tokens, llama 70B on 8x with flash attention - maximum batch size without PR changes that we can go is around 270 and with the changes in this PR, we can go up to batch size 370.

libinta · 2024-05-02T06:06:38Z

                    )
-                    past_key_value = (past_key, past_value)
+                    # Return list instead of tuple
+                    past_key_value = [past_key, past_value]


this could impact tgi as tgi goes through this route

ssarkar2 · 2024-06-10T04:02:06Z

merged thru: #1028

Puneesh Khanna added 2 commits May 2, 2024 08:05

Sampling search UseKV cache till input seq len for prefill phase (Hab…

967fa47

…anaAI#161) * Sampling search UseKV cache till input seq len for prefill phase * Remove redundant line

puneeshkhanna requested review from bhargaveede, libinta, mandy-li, ssarkar2 and vivekgoe as code owners May 2, 2024 05:09

puneeshkhanna requested a review from a user May 2, 2024 05:09

puneeshkhanna requested a review from regisss as a code owner May 2, 2024 05:09

Puneesh Khanna added 3 commits May 2, 2024 10:47

Fix merge conflict

3fdd1f6

Fix merge conflict

3770d1c

Update modeling_llama.py

4a04bff

libinta added the synapse 1.16_dependency synapse 1.16 dependency label May 2, 2024

libinta reviewed May 2, 2024

View reviewed changes

ssarkar2 reviewed May 6, 2024

View reviewed changes

Comment thread optimum/habana/transformers/models/llama/modeling_llama.py

fix review comment

7af3dce

ssarkar2 removed the synapse 1.16_dependency synapse 1.16 dependency label May 31, 2024

libinta closed this Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefill kvcache upstream#942

Prefill kvcache upstream#942
puneeshkhanna wants to merge 6 commits into
huggingface:mainfrom
puneeshkhanna:prefill_kvcache_upstream

puneeshkhanna commented May 2, 2024

Uh oh!

puneeshkhanna commented May 2, 2024

Uh oh!

puneeshkhanna commented May 2, 2024

Uh oh!

libinta May 2, 2024

Uh oh!

Uh oh!

ssarkar2 commented Jun 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

puneeshkhanna commented May 2, 2024

What does this PR do?

Before submitting

Uh oh!

puneeshkhanna commented May 2, 2024

Uh oh!

puneeshkhanna commented May 2, 2024

Uh oh!

libinta May 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ssarkar2 commented Jun 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants