-
Hi sorry for another message. I am writing a backend for Tenstorrent's chips. I've been on commit 0fff7fd for 3 weeks and wanted to pull changes from upstream llama.cpp. Since about 2 weeks ago, I've been failing at pulling upstream changes and make it work. Due to some limitations in my implementation of I did some initial debugging but cannot wiggle myself out of the problem. Even if I force allow the operator to pass, I see massive performance regression (5x slower) and the LLM is not coherent (expected, I banned some copies for this reason). I am unsure if this is a bug in my backend or in GGML. Even if my backend refuses to run CPY, I believe GGML can copy both tensors to CPU, do the copy there and write it back to device. I know it is slow, but it should work nevertheless. Can someone point me in some direction? And what's the condition of the error showing up? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
I am not aware of any changes to this behavior recently, other than the occasional minor bug fix. It would be possible to copy the tensor to the CPU and copy it back again after the operation, but that's not implemented. You can try using |
Beta Was this translation helpful? Give feedback.
I am not aware of any changes to this behavior recently, other than the occasional minor bug fix. It would be possible to copy the tensor to the CPU and copy it back again after the operation, but that's not implemented. You can try using
-nkvo
to avoid offloadling the KV cache until you have implemented the CPY operation.