Skip to content

Architecture: what if I want to optimize for llama.cpp? #3395

Answered by ggerganov
olegklimov asked this question in Q&A
Discussion options

You must be logged in to vote

The M1 has 2.6 TFLOPS of compute, while the 3090 has 35.6 TFLOPS. The observed difference is within expectations.
There could be something we can do to squeeze out some extra perf from the M1 by writing faster Metal kernels, but I doubt there will be a significant jump. Currently, you get the highest prefill speed with F16 models, so in case you are using a quantum model, you can try switching to F16.

Architecture wise, I don't think you can optimize much - all the compute is in the matrix multiplications and you definitely need those.

Replies: 2 comments 6 replies

Comment options

You must be logged in to vote
4 replies
@olegklimov
Comment options

@KerfuffleV2
Comment options

@olegklimov
Comment options

@KerfuffleV2
Comment options

Comment options

You must be logged in to vote
2 replies
@olegklimov
Comment options

@ggerganov
Comment options

Answer selected by olegklimov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants
Converted from issue

This discussion was converted from issue #3390 on September 29, 2023 10:45.