Use KV cache till input seq len for prefill phase by puneeshkhanna · Pull Request #154 · HabanaAI/optimum-habana-fork

puneeshkhanna · 2024-04-10T08:24:36Z

Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable.
All the changes are protected by bucket internal flag.

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag. Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

puneeshkhanna · 2024-04-10T08:57:38Z

Updated command (remove --reuse_cache , setting PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1 automatically taken care)

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 200 --attn_softmax_bf16 --trim_logits --bf16 --warmup 2 --n_iterations 2 --limit_hpu_graphs --bucket_internal --bucket_size 128

Also requires pytorch-integration patch - https://gerrit.habana-labs.com/#/c/408363/

* Use KV cache till input seq len for prefill phase. Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag. Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Revert initialization of KV cache * Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag * remove os import * remove commented print --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

astachowiczhabana · 2024-06-11T07:09:42Z

huggingface#1028

puneeshkhanna requested review from bhargaveede, libinta, mandy-li, ssarkar2 and vivekgoe as code owners April 10, 2024 08:24

puneeshkhanna requested a review from a user April 10, 2024 08:24

Revert initialization of KV cache

057a60d

bgoldberg-habana reviewed Apr 10, 2024

View reviewed changes

Comment thread optimum/habana/transformers/models/llama/modeling_llama.py

Puneesh Khanna added 3 commits April 11, 2024 09:37

Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag

de71a47

remove os import

534e16a

remove commented print

ee5fe3a

ghost approved these changes Apr 11, 2024

View reviewed changes

ghost merged commit 60b5d9b into HabanaAI:habana-main Apr 11, 2024

astachowiczhabana pushed a commit that referenced this pull request Feb 14, 2025

Disabling timers synchronization (#154)

3327c79

xinyu-intel pushed a commit that referenced this pull request Mar 4, 2025

Disabling timers synchronization (#154)

5fa4c45

astachowiczhabana pushed a commit that referenced this pull request Apr 17, 2025

Disabling timers synchronization (#154) (huggingface#1879)

029f8fb

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use KV cache till input seq len for prefill phase#154

Use KV cache till input seq len for prefill phase#154
5 commits merged into
HabanaAI:habana-mainfrom
puneeshkhanna:prefill_kvcache

puneeshkhanna commented Apr 10, 2024

Uh oh!

puneeshkhanna commented Apr 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

astachowiczhabana commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

puneeshkhanna commented Apr 10, 2024

What does this PR do?

Before submitting

Uh oh!

puneeshkhanna commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

astachowiczhabana commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

puneeshkhanna commented Apr 10, 2024 •

edited

Loading