Performance of llama.cpp on Apple Silicon M-series #4167
Replies: 71 comments 138 replies
-
M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M2 Max Studio, 8+4 CPU, 38 GPU ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅
build: 55978ce (1555) Short Note: mostly similar to the one reported by @slaren . But for Q4_0 |
Beta Was this translation helpful? Give feedback.
-
In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s. |
Beta Was this translation helpful? Give feedback.
-
How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later. |
Beta Was this translation helpful? Give feedback.
-
M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth. |
Beta Was this translation helpful? Give feedback.
-
M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅
build: e9c13ff (1560) Note: M1 Max RAM Bandwidth is 400GB/s |
Beta Was this translation helpful? Give feedback.
-
Look at what I started |
Beta Was this translation helpful? Give feedback.
-
M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
### M2 MAX (MBP 16) 38 Core 32GB ✅
build: 795cd5a (1493) |
Beta Was this translation helpful? Give feedback.
-
I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models. |
Beta Was this translation helpful? Give feedback.
-
M4 Max (Macbook Pro 16" 2024), 12+4 CPU, 40GPU, 128 GB Memory ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
Have someone tried M4 Pro 64G, is it possible to run a 70B model in a usable speed? |
Beta Was this translation helpful? Give feedback.
-
If i'm reading correctly, the m3 pro is slower than the m2 pro?? |
Beta Was this translation helpful? Give feedback.
-
M4 Pro, 8+4 CPU, 16 GPU, 24 GB Memory (MBP 14) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M4 Max (Macbook Pro 14" 2024), 12+4 CPU, 40 GPU, 128 GB Memory
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
Which models can my M3 16GB MacBook Air support? |
Beta Was this translation helpful? Give feedback.
-
why specifically is the M2 so cracked compared to the M3 and M4? |
Beta Was this translation helpful? Give feedback.
-
M4 Max (Macbook Pro 16" 2024), 16 CPU, 40 GPU, 128 GB Memory
Used the command below |
Beta Was this translation helpful? Give feedback.
-
Okay Apple... no M4 Ultra in the Mac Studio, but a M4 Max or M3 Ultra - ... planning a new Mac Pro, huh?! So let's add these guys to the table 😁 |
Beta Was this translation helpful? Give feedback.
-
M3 Ultra 20+8 CPU, 60 GPU, 256GB RAM ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M3 Ultra 24+8 CPU, 80 GPU, 512GB RAM ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
Can someone please test the base Mac Ultra M4 :) |
Beta Was this translation helpful? Give feedback.
-
Cost Per Token April 2025 ~ Bang for Buck Seems like the M4 Mac Mini is cheapest instant win for now, with an M1 Max Studio coming in close second.
|
Beta Was this translation helpful? Give feedback.
-
... cross-posted to the Vulkan thread: Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS MontereyNote: I've updated this post -- I realized when I posted the first time I was so excited to see the GPUs doing stuff that I didn't check whether they were working right. Turns out they were not! So I recompiled MoltenVK and llama.cpp with some tweaks and checked that the models were working correctly before re-benchmarking. When the system was spitting garbage it was running about 30% higher t/s rates across the board. Full HOWTO on getting the Mac Pro D700s to accept layers here: https://github.com/lukewp/TrashCanLLM/blob/main/README.md ./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null
build: d3bd719 (5092) The FP16 model, was throwing garbage so I did not include here -- it will require some unique flags to run correctly. Additionally, here's the 8- and 4- bit llama 2 7B runs on the CPU alone (using -ngl 0 flag): ./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 0 2> /dev/null
build: d3bd719 (5092) |
Beta Was this translation helpful? Give feedback.
-
Just saying.. Shouldn't the OP be edited with the actual used bandwidth numbers, rather than the BS figures apple gave to the press? |
Beta Was this translation helpful? Give feedback.
-
Summary
LLaMA 7B
[GB/s]
Cores
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
plot.py
Description
This is a collection of short
llama.cpp
benchmarks on various Apple Silicon hardware. It can be useful to compare the performance thatllama.cpp
achieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):
PP
means "prompt processing" (bs = 512
),TG
means "text-generation" (bs = 1
),t/s
means "tokens per second"Note that in this benchmark we are evaluating the performance against the same build 8e672ef (2023 Nov 13) in order to keep all performance factors even. Since then, there have been multiple improvements resulting in better absolute performance. As an example, here is how the same test compares against the build 86ed72d (2024 Nov 21) on M2 Ultra:
[GB/s]
Cores
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅
build: 8e672ef (1550)
M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅
build: 8e672ef (1550)
M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅
build: d103d93 (1553)
Footnotes
https://en.wikipedia.org/wiki/Apple_M1#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M2#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M3#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M4#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6
Beta Was this translation helpful? Give feedback.
All reactions