Architecture: what if I want to optimize for llama.cpp? #3395

olegklimov · 2023-09-29T05:59:03Z

olegklimov
Sep 29, 2023

We have our model converted to gguf with quantization, shout out to @teleprint-me and @ds5t5.

But it's still slow, our problem is the prompt. The speed is about 500 tps for prefill (Apple M1), which is way to slow for practical use. For fill-in-the-middle code completion, the user will have to wait 4 seconds for a typical 2000 tokens context.

We train our own models, so the question is: what if we change the architecture? What is the bottleneck for prefill? How do we make it 5-10x faster, besides making the network smaller?

Answered by ggerganov

Sep 30, 2023

The M1 has 2.6 TFLOPS of compute, while the 3090 has 35.6 TFLOPS. The observed difference is within expectations.
There could be something we can do to squeeze out some extra perf from the M1 by writing faster Metal kernels, but I doubt there will be a significant jump. Currently, you get the highest prefill speed with F16 models, so in case you are using a quantum model, you can try switching to F16.

Architecture wise, I don't think you can optimize much - all the compute is in the matrix multiplications and you definitely need those.

View full answer

KerfuffleV2 · 2023-09-30T05:14:32Z

KerfuffleV2
Sep 30, 2023
Collaborator

Do you mean on Torch for example with the same model/hardware you do reach that 5-10x faster mark?

4 replies

olegklimov Sep 30, 2023
Author

To prefill 2000 tokens, it takes:

About 150ms using Torch+GPU, on a 3090-like hardware.

About 4000ms on Apple M1.

That's 25 times difference, unless I'm doing something wrong.

KerfuffleV2 Sep 30, 2023
Collaborator

What I'm asking is: Can you already get the speed you expect/want on the same hardware, with the same model, etc using Torch or some platform other than llama.cpp?
So to be specific, on the same Apple M1 system, with the same prompt and model, can you already get the speed you want using Torch rather than llama.cpp?

olegklimov Sep 30, 2023
Author

No, of course not. I'm not even sure Torch will work on M1 in any meaningful way.

KerfuffleV2 Sep 30, 2023
Collaborator

No, of course not. I'm not even sure Torch will work on M1 in any meaningful way.

I didn't necessarily meant Torch specifically, just it seems like the first question would obviously be: "Is this even possible?" If there was already an example of reaching the speed you want with the same hardware, etc then you'd know it's possible and llama.cpp could potentially be optimized to perform equivalently.

The project is already pretty well optimized (generally speaking, I don't know about your specific model) and people work hard to squeeze out an increase of 2-3% performance, so increasing prompt processing speed 25x would be pretty amazing.

ggerganov · 2023-09-30T07:15:48Z

ggerganov
Sep 30, 2023
Maintainer

The M1 has 2.6 TFLOPS of compute, while the 3090 has 35.6 TFLOPS. The observed difference is within expectations.
There could be something we can do to squeeze out some extra perf from the M1 by writing faster Metal kernels, but I doubt there will be a significant jump. Currently, you get the highest prefill speed with F16 models, so in case you are using a quantum model, you can try switching to F16.

Architecture wise, I don't think you can optimize much - all the compute is in the matrix multiplications and you definitely need those.

2 replies

olegklimov Sep 30, 2023
Author

Hi @ggerganov I'm new to optimization. Here's what I'm reading about M1: "maximum floating point (FP32) performance of 2.6 TFLOPs" (and 3.6TFLOPs for M2). I'm surprised you are referring to FP32 numbers. CPUs should be faster mul-adding using integer SIMD (I'm guessing), and Metal should be faster working with FP16 or 8-bit. My best interpretation is this: you are referring to FP32 numbers as a reference point, but real calculations are in FP16 in Metal.

ggerganov Sep 30, 2023
Maintainer

CPUs should be faster mul-adding using integer SIMD (I'm guessing), and Metal should be faster working with FP16 or 8-bit. My best interpretation is this: you are referring to FP32 numbers as a reference point, but real calculations are in FP16 in Metal.

Correct. We do use F16 math in Metal and integer SIMD with CPU, but the F32 numbers are official according to spec and can be used to make rough performance comparisons across hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture: what if I want to optimize for llama.cpp? #3395

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Architecture: what if I want to optimize for llama.cpp? #3395

olegklimov Sep 29, 2023

Replies: 2 comments · 6 replies

KerfuffleV2 Sep 30, 2023 Collaborator

olegklimov Sep 30, 2023 Author

KerfuffleV2 Sep 30, 2023 Collaborator

olegklimov Sep 30, 2023 Author

KerfuffleV2 Sep 30, 2023 Collaborator

ggerganov Sep 30, 2023 Maintainer

olegklimov Sep 30, 2023 Author

ggerganov Sep 30, 2023 Maintainer

olegklimov
Sep 29, 2023

Replies: 2 comments 6 replies

KerfuffleV2
Sep 30, 2023
Collaborator

olegklimov Sep 30, 2023
Author

KerfuffleV2 Sep 30, 2023
Collaborator

olegklimov Sep 30, 2023
Author

KerfuffleV2 Sep 30, 2023
Collaborator

ggerganov
Sep 30, 2023
Maintainer

olegklimov Sep 30, 2023
Author

ggerganov Sep 30, 2023
Maintainer