Prefill kvcache upstream#942
Conversation
* Use KV cache till input seq len for prefill phase. Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag. Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Revert initialization of KV cache * Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag * remove os import * remove commented print --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
…anaAI#161) * Sampling search UseKV cache till input seq len for prefill phase * Remove redundant line
|
Description of the changes in this PR - Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag right now. Updated command (remove --reuse_cache from all existing commands , setting PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1 automatically taken care) With the changes in this PR, performance in any existing configs remains same but we can scale batch sizes to much much higher numbers since we save a lot of memory during the prefill phase. |
| ) | ||
| past_key_value = (past_key, past_value) | ||
| # Return list instead of tuple | ||
| past_key_value = [past_key, past_value] |
There was a problem hiding this comment.
this could impact tgi as tgi goes through this route
|
merged thru: #1028 |
What does this PR do?
Fixes # (issue)
Before submitting