Architecture: what if I want to optimize for llama.cpp? #3395
-
We have our model converted to gguf with quantization, shout out to @teleprint-me and @ds5t5. But it's still slow, our problem is the prompt. The speed is about 500 tps for prefill (Apple M1), which is way to slow for practical use. For fill-in-the-middle code completion, the user will have to wait 4 seconds for a typical 2000 tokens context. We train our own models, so the question is: what if we change the architecture? What is the bottleneck for prefill? How do we make it 5-10x faster, besides making the network smaller? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
Do you mean on Torch for example with the same model/hardware you do reach that 5-10x faster mark? |
Beta Was this translation helpful? Give feedback.
-
The M1 has 2.6 TFLOPS of compute, while the 3090 has 35.6 TFLOPS. The observed difference is within expectations. Architecture wise, I don't think you can optimize much - all the compute is in the matrix multiplications and you definitely need those. |
Beta Was this translation helpful? Give feedback.
The M1 has 2.6 TFLOPS of compute, while the 3090 has 35.6 TFLOPS. The observed difference is within expectations.
There could be something we can do to squeeze out some extra perf from the M1 by writing faster Metal kernels, but I doubt there will be a significant jump. Currently, you get the highest prefill speed with F16 models, so in case you are using a quantum model, you can try switching to F16.
Architecture wise, I don't think you can optimize much - all the compute is in the matrix multiplications and you definitely need those.