Performance of llama.cpp on Apple Silicon M-series #4167

ggerganov · 2023-11-22T09:46:54Z

ggerganov
Nov 22, 2023
Maintainer

Summary

LLaMA 7B

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ M1 ¹	68	7			108.21	7.92	107.81	14.19
✅ M1 ¹	68	8			117.25	7.91	117.96	14.15
✅ M1 Pro ¹	200	14	262.65	12.75	235.16	21.95	232.55	35.52
✅ M1 Pro ¹	200	16	302.14	12.75	270.37	22.34	266.25	36.41
✅ M1 Max ¹	400	24	453.03	22.55	405.87	37.81	400.26	54.61
✅ M1 Max ¹	400	32	599.53	23.03	537.37	40.2	530.06	61.19
✅ M1 Ultra ¹	800	48	875.81	33.92	783.45	55.69	772.24	74.93
✅ M1 Ultra ¹	800	64	1168.89	37.01	1042.95	59.87	1030.04	83.73

✅ M2 ²	100	8			147.27	12.18	145.91	21.7
✅ M2 ²	100	10	201.34	6.72	181.4	12.21	179.57	21.91
✅ M2 Pro ²	200	16	312.65	12.47	288.46	22.7	294.24	37.87
✅ M2 Pro ²	200	19	384.38	13.06	344.5	23.01	341.19	38.86
✅ M2 Max ²	400	30	600.46	24.16	540.15	39.97	537.6	60.99
✅ M2 Max ²	400	38	755.67	24.65	677.91	41.83	671.31	65.95
✅ M2 Ultra ²	800	60	1128.59	39.86	1003.16	62.14	1013.81	88.64
✅ M2 Ultra ²	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27

🟥 M3 ³	100	8
🟨 M3 ³	100	10			187.52	12.27	186.75	21.34
🟨 M3 Pro ³	150	14			272.11	17.44	269.49	30.65
✅ M3 Pro ³	150	18	357.45	9.89	344.66	17.53	341.67	30.74
✅ M3 Max ³	300	30	589.41	19.54	566.4	34.3	567.59	56.58
✅ M3 Max ³	400	40	779.17	25.09	757.64	42.75	759.7	66.31
✅ M3 Ultra ³	800	60	1121.80	42.24	1085.76	63.55	1073.09	88.40
✅ M3 Ultra ³	800	80	1538.34	39.78	1487.51	63.93	1471.24	92.14

🟥 M4 ⁴	120	8
✅ M4 ⁴	120	10	230.18	7.43	223.64	13.54	221.29	24.11
✅ M4 Pro ⁴	273	16	381.14	17.19	367.13	30.54	364.06	49.64
✅ M4 Pro ⁴	273	20	464.48	17.18	449.62	30.69	439.78	50.74
🟥 M4 Max ⁴	410	32
✅ M4 Max ⁴	546	40	922.83	31.64	891.94	54.05	885.68	83.06
🟥 M4 Ultra	820	64
🟥 M4 Ultra	1092	80

plot.py

# GPT-4 Generated Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating DataFrame from the provided data
data = {
    "Chip": ["M1", "M1", "M1 Pro", "M1 Pro", "M1 Max", "M1 Max", "M1 Ultra", "M2", "M2 Pro", "M2 Pro", "M2 Max", "M2 Max", "M2 Ultra", "M2 Ultra", "M3", "M3 Pro", "M3 Pro", "M3 Max"],
    "BW (GB/s)":     [68, 68, 200, 200, 400, 400, 800, 100, 200, 200, 400, 400, 800, 800, 100, 150, 150, 400],
    "GPU Cores":     [7, 8, 14, 16, 24, 32, 48, 10, 16, 19, 30, 38, 60, 76, 10, 14, 18, 40],
    "F16 PP (t/s)":  [None, None, None, 302.14, 453.03, 599.53, 875.81, 201.34, 312.65, 384.38, 600.46, 755.67, 1128.59, 1401.85, None, None, 357.45, 779.17],
    "F16 TG (t/s)":  [None, None, None, 12.75, 22.55, 23.03, 33.92, 6.72, 12.47, 13.06, 24.16, 24.65, 39.86, 41.02, None, None, 9.89, 25.09],
    "Q8_0 PP (t/s)": [108.21, 117.25, 235.16, 270.37, 405.87, 537.37, 783.45, 181.4, 288.46, 344.5, 540.15, 677.91, 1003.16, 1248.59, 187.52, 272.11, 344.66, 757.64],
    "Q8_0 TG (t/s)": [7.92, 7.91, 21.95, 22.34, 37.81, 40.2, 55.69, 12.21, 22.7, 23.01, 39.97, 41.83, 62.14, 66.64, 12.27, 17.44, 17.53, 42.75],
    "Q4_0 PP (t/s)": [107.81, 117.96, 232.55, 266.25, 400.26, 530.06, 772.24, 179.57, 294.24, 341.19, 537.6, 671.31, 1013.81, 1238.48, 186.75, 269.49, 341.67, 759.7],
    "Q4_0 TG (t/s)": [14.19, 14.15, 35.52, 36.41, 54.61, 61.19, 74.93, 21.91, 37.87, 38.86, 60.99, 65.95, 88.64, 94.27, 21.34, 30.65, 30.74, 66.31]
}
df = pd.DataFrame(data)

# Helper function to plot and annotate multiple data series in the same plot
def plot_multi_series(ax, x, y_series, labels, xlabel, ylabel, title, poly_power=1):
    colors = ['r', 'g', 'b']  # Colors for different series
    for i, y in enumerate(y_series):
        # Sorting data for regression
        sorted_indices = np.argsort(x)
        x_sorted = x[sorted_indices]
        y_sorted = y[sorted_indices]

        # Masking NaN values
        mask = ~np.isnan(y_sorted)
        x_sorted = x_sorted[mask]
        y_sorted = y_sorted[mask]

        # Fitting a polynomial regression model
        coefficients = np.polyfit(x_sorted, y_sorted, poly_power)
        polynomial = np.poly1d(coefficients)

        # Creating a range of x-values for a smoother trendline
        x_range = np.linspace(x_sorted.min(), x_sorted.max(), 500)
        trendline = polynomial(x_range)

        # Plotting
        ax.scatter(x, y, color=colors[i], label=labels[i], s=20)
        ax.plot(x_range, trendline, f"{colors[i]}-", linewidth=1)  # Trendline in the same color

    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.legend()

    # Annotating points with the number of GPU cores and Bandwidth
    for i, txt in enumerate(df["Chip"]):
        ax.annotate(f"{df['GPU Cores'][i]} Cores, {df['BW (GB/s)'][i]} GB/s", (x[i], y_series[0][i]))


# Creating plots for PP vs Cores and TG vs Bandwidth
fig, axs = plt.subplots(1, 2, figsize=(15, 6))
fig.suptitle('PP vs GPU Cores and TG vs Bandwidth for F16, Q8_0, and Q4_0')

# PP vs GPU Cores
y_series_cores_pp = [df["F16 PP (t/s)"], df["Q8_0 PP (t/s)"], df["Q4_0 PP (t/s)"]]
plot_multi_series(axs[0], df["GPU Cores"], y_series_cores_pp,
                  ['F16 PP', 'Q8_0 PP', 'Q4_0 PP'], 'GPU Cores', 'Performance (t/s)',
                  'PP Performance vs GPU Cores', 1)

# TG vs Bandwidth
y_series_bw_tg = [df["F16 TG (t/s)"], df["Q8_0 TG (t/s)"], df["Q4_0 TG (t/s)"]]
plot_multi_series(axs[1], df["BW (GB/s)"], y_series_bw_tg,
                  ['F16 TG', 'Q8_0 TG', 'Q4_0 TG'], 'Bandwidth (GB/s)', 'Performance (t/s)',
                  'TG Performance vs Bandwidth', 2)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Description

This is a collection of short llama.cpp benchmarks on various Apple Silicon hardware. It can be useful to compare the performance that llama.cpp achieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508

If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):

git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

Make sure to run the benchmark on commit 8e672ef
Please also include the F16 model as shown, not just the quantum models
Contributors can post the same results in the comments below
If a device is already benchmarked and your results are comparable, there is no need to add it again
PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"
✅ means the data has been added to the summary

Note that in this benchmark we are evaluating the performance against the same build 8e672ef (2023 Nov 13) in order to keep all performance factors even. Since then, there have been multiple improvements resulting in better absolute performance. As an example, here is how the same test compares against the build 86ed72d (2024 Nov 21) on M2 Ultra:

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
M2 Ultra `8e672ef`	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27
M2 Ultra `86ed72d` + FA	800	76	1525.95	43.15	1368.18	73.11	1391.78	108.80

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	302.14 ± 0.07
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	12.75 ± 0.00
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	270.37 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	22.34 ± 0.00
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	266.25 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	36.41 ± 0.01

build: 8e672ef (1550)

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1401.85 ± 1.75
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	41.02 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1248.59 ± 0.73
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	66.64 ± 0.02
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1238.48 ± 0.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	94.27 ± 0.05

build: 8e672ef (1550)

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	794.26 ± 3.16
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.27 ± 0.07
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	749.37 ± 8.35
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	43.00 ± 0.12
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	690.99 ± 33.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.85 ± 0.22

build: d103d93 (1553)

QueryType · 2023-11-22T17:17:06Z

QueryType
Nov 22, 2023

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	201.34 ± 0.21
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	6.72 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	181.40 ± 0.05
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	12.21 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	179.57 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	21.91 ± 0.02

build: 8e672ef (1550)

0 replies

brozkrut · 2023-11-23T15:50:17Z

brozkrut
Nov 23, 2023

M2 Max Studio, 8+4 CPU, 38 GPU ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	755.67 ± 0.11
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.65 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	677.91 ± 0.26
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.83 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	671.31 ± 0.20
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.95 ± 0.08

build: 8e672ef (1550)

8 replies

maver1ck Dec 16, 2023

Wow. I wasn't aware that 4090 is so fast.

vitali-fridman Dec 26, 2023

This is from one/two generations old hardware but it's for 70B model which might be of interest.

CPU: AMD 3995WX, GPU: 2x Nvidia 3090, Ubuntu 23.10, Kernel 6.5.0-14, NV Driver: 545.23.08, CUDA: 12.3.1

model	size	params	backend	ngl	test	t/s
llama 70B Q4_0	36.20 GiB	68.98 B	CUDA	99	pp 512	179.29 ± 2.83
llama 70B Q4_0	36.20 GiB	68.98 B	CUDA	99	tg 128	21.17 ± 0.04

For comparison, 7B model on the same hardware

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	1178.60 ± 88.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	87.34 ± 0.89

zotona Dec 27, 2023

could you try at 7b model for correct comparation? Thanks!

pukhrajvansh Feb 16, 2025

what the hell this is lower than m4 max, i mean 2x 3090 whatt..??

atlas5301 Feb 17, 2025

what the hell this is lower than m4 max, i mean 2x 3090 whatt..??

Probably because llama.cpp is not well optimized on gpus. You can expect significantly better throughput with sglang and vllm.

crasm · 2023-11-23T19:02:44Z

crasm
Nov 23, 2023

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1128.59 ± 0.82
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	39.86 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1003.16 ± 0.39
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	62.14 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1013.81 ± 0.92
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	88.64 ± 0.06

build: 8e672ef (1550)

0 replies

ymcui · 2023-11-24T03:17:39Z

ymcui
Nov 24, 2023

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	779.17 ± 0.49
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.09 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	757.64 ± 1.03
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	42.75 ± 0.06
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	759.70 ± 2.26
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	66.31 ± 0.12

build: 55978ce (1555)

Short Note: mostly similar to the one reported by @slaren . But for Q4_0 pp 512, my result is 759.70 ± 2.26, while the one in the main post is 690.99 ± 33.76. Not sure about the source of the difference.

1 reply

slaren Nov 24, 2023
Maintainer

I am not sure why, but the results that I get are not very consistent. I suspect that it may due to the cooling limitations of the smaller laptop. I repeated the test now and the results are very similar to yours.

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	787.24 ± 0.84
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.15 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	755.88 ± 1.56
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	42.64 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	760.65 ± 0.77
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	66.35 ± 0.24

Azirine · 2023-11-24T09:08:33Z

Azirine
Nov 24, 2023

In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s.

0 replies

Azirine · 2023-11-24T15:08:41Z

Azirine
Nov 24, 2023

How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later.

1 reply

ggerganov Nov 24, 2023
Maintainer Author

You can compute these. By default, you can use ~75% of the total RAM with the GPU. You can use more if you do some tricks

minosvasilias · 2023-11-24T20:36:12Z

minosvasilias
Nov 24, 2023

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	312.65 ± 15.75
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	12.47 ± 0.71
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	288.46 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	22.70 ± 0.12
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	294.24 ± 0.10
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	37.87 ± 0.10

build: e9c13ff (1560)

0 replies

to3d · 2023-11-24T22:06:23Z

to3d
Nov 24, 2023

Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth.

0 replies

MrSparc · 2023-11-25T00:11:27Z

MrSparc
Nov 25, 2023

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	674.50 ± 0.58
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.79 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	669.51 ± 1.17
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	64.55 ± 1.36

build: e9c13ff (1560)

2 replies

rlippmann Nov 26, 2023

I'm also using a MBP16 M2Max with the same CPU/GPU specs, but only 32 gb ram and my results are roughly the same:

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	747.99 ± 0.28
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.54 ± 0.22
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	674.37 ± 0.63
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.67 ± 0.05
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	668.28 ± 0.24
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	62.98 ± 0.06

build: 22da055 (1566)

MrSparc Nov 26, 2023

Yes, it is expected that the same cpu/gpu spec will have similar performance values for same models to be compared regardless of RAM, as long as the size of the model to be used can be loaded into memory.
The amount of RAM is a limiting factor in the size of the model that can be loaded, as only 75% (by default) of the unified memory can be used as VRAM on the GPU
https://github.com/ggerganov/llama.cpp#memorydisk-requirements

CedricYauLBD · 2023-11-25T00:16:50Z

CedricYauLBD
Nov 25, 2023

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	599.53 ± 0.86
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	23.03 ± 0.09
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	537.37 ± 0.19
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.20 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	530.06 ± 0.17
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	61.19 ± 0.15

build: e9c13ff (1560)

Note: M1 Max RAM Bandwidth is 400GB/s

0 replies

philipturner · 2023-11-25T03:32:09Z

philipturner
Nov 25, 2023

Look at what I started

1 reply

yxzwayne Nov 25, 2023

off topic, but your benchmark output is my desktop rn :D

paramaggarwal · 2023-11-25T03:47:44Z

paramaggarwal
Nov 25, 2023

M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	272.11 ± 1.40
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	17.44 ± 0.42
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	269.49 ± 1.14
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	30.65 ± 0.20

build: e9c13ff (1560)

5 replies

ggerganov Nov 25, 2023
Maintainer Author

This one has 150 GB/s memory bandwidth, correct?

paramaggarwal Nov 25, 2023

Yes, that's correct. (source)

Kaszebe May 30, 2024

Could it run a Q5 quant of llama3 70b Instruct at ~2 tokens per second?

mladencucakSYN Mar 22, 2025

I'm also interested to see if it can run a bit bigger model with some kind of reasonable outcome. Just don't want to spend MCB Max money

bagobones Mar 22, 2025

The old models give excellent comparative numbers but I wonder if the benchmark needs to be re-based around the current most popular models at some point.

Not just bigger ones for finding the biggest but popular sets / distillations that go from small to very large.

It looks like 96-128 ish gigs of shared memory will be practical on Apple / AMD / nvidia digits going forward.

brozkrut · 2023-11-25T14:50:23Z

brozkrut
Nov 25, 2023

Chip (vs. Predecessor)	F16 PP	F16 TG	Q8_0 PP	Q8_0 TG	Q4_0 PP	Q4_0 TG
M2 Pro (16) vs. M1 Pro (16)	312.65 302.14	12.47 12.75	288.46 270.37	22.7 22.34	294.24 266.25	37.87 36.41
	+3.48%	-2.20%	+6.69%	+1.61%	+10.51%	+4.01%
M2 Max (38) vs. M1 Max (32)	755.67 599.53	24.65 23.03	677.91 537.37	41.83 40.2	671.31 530.06	65.95 61.19
	+26.04%	+7.03%	+26.15%	+4.05%	+26.65%	+7.78%
M2 Ultra (60) vs. M2 Max (38)	1128.59 755.67	39.86 24.65	1003.16 677.91	62.14 41.83	1013.81 671.31	88.64 65.95
	+49.34%	+61.90%	+48.04%	+48.48%	+51.03%	+34.41%
M2 Ultra (76) vs. M2 Max (38)	1401.85 755.67	41.02 24.65	1248.59 677.91	66.64 41.83	1238.48 671.31	94.27 65.95
	+85.67%	+66.45%	+84.24%	+59.47%	+84.53%	+43.06%
M2 Ultra (76) vs. M2 Ultra (60)	1401.85 1128.59	41.02 39.86	1248.59 1003.16	66.64 62.14	1238.48 1013.81	94.27 88.64
	+24.25%	+2.91%	+24.43%	+7.23%	+22.19%	+6.33%
M3 Pro (14) vs. M2 Pro (16)			272.11 288.46	17.44 22.7	269.49 294.24	30.65 37.87
			-5.67%	-23.17%	-8.41%	-19.07%
M3 Max (40) vs. M2 Max (38)	779.17 755.67	25.09 24.65	757.64 677.91	42.75 41.83	759.7 671.31	66.31 65.95
	+3.11%	+1.78%	+11.76%	+2.20%	+13.17%	+0.55%

0 replies

pudepiedj · 2023-11-25T17:33:00Z

pudepiedj
Nov 25, 2023

### M2 MAX (MBP 16) 38 Core 32GB ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	754.39 ± 0.36
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.31 ± 0.38
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	671.33 ± 2.65
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.85 ± 0.32
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	664.07 ± 9.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	63.29 ± 0.15

build: 795cd5a (1493)

0 replies

MrSparc · 2023-11-25T21:49:00Z

MrSparc
Nov 25, 2023

I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models.
Sorry if my question is silly, I'm new to this area, but can someone explain to me why original model delivers more performance than quantized models? Thanks

1 reply

ggerganov Nov 26, 2023
Maintainer Author

The question is not silly - the observation is expected. At large batch size (PP means batch size of 512) the computation is compute bound. I.e. the speed depends on how many FLOPS you can utilize. For quantum models, the existing kernels require extra compute to dequantize the data compared to F16 models where the data is already in F16 format.

sekstini · 2024-11-13T00:30:51Z

sekstini
Nov 13, 2024

M4 Max (Macbook Pro 16" 2024), 12+4 CPU, 40GPU, 128 GB Memory ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	922.83 ± 1.12
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	31.64 ± 0.08
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	891.94 ± 0.28
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	54.05 ± 0.16
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	885.68 ± 1.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	83.06 ± 0.22

build: 8e672ef (1550)

13 replies

bluemoehre Nov 19, 2024

@ggerganov insane configuration having 192GB RAM. I was thinking on getting a used Studio M2 Max (32-64GB) or 2x Mac Mini M4 (16GB + 24GB Pro) to run a Chat Model + a Vision Model / Embedding Model (RAG for docs) for daily tasks. Right now it is a similar price tag. Would you recommend getting a huge machine over multiple small ones? How is your subjective experience in quality of answers with larger models / better Q?

ggerganov Nov 19, 2024
Maintainer Author

The M2 Max has 400GB/s memory bandwidth while the M4 Pro has 273GB/s, so ~1.5x faster for text generation. M2 Max with 38 GPU cores would be ~1.5x faster than M4 Pro with 20 GPU cores for prompt processing (see the table above). But with the Minis, you will have 2 of them which can work in parallel. I don't have experience with Vision/Embeddings and RAG so can't really recommend anything.

How is your subjective experience in quality of answers with larger models / better Q?

Can't give you much info here as I am not a heavy chat user and don't have a good base for comparison. I'm mainly using the Qwen 2.5 Coder models and my primary usage is FIM completion. Regarding Q - I've settled on Q8 because prompt processing speed is much more important for FIM and since I have practically infinite VRAM there is no need to use lower quantizations as they don't improve the prompt processing speed. I would probably even use F16 for the extra PP speed, but there is an annoying bug in MacOS that is kind of a deal breaker at the moment (#10119).

bluemoehre Nov 19, 2024

Thanks for your feedback. Yeh, I saw that bandwidth difference, but right now I am with a M2 Air and text generation speed is already fine after latest OS updates. The M4 Pro would make it >2x faster, so I would like to put more focus on flexibility and the ability to run more than one model in parallel 24/7. For the heavy loads I still gonna use the cloud, later on (after RTX 5000 series release next weeks) I consider to get some little brick and attach my current RTX 4070 Ti S to it. Unfortunately a Studio M2 (>128GB) is still too expensive.

gardner Nov 20, 2024

after RTX 5000 series release next weeks

Can you tell me more about next week?

maciejjedrzejczyk Jan 26, 2025

Has anyone tried to perform this test on 14'' MBP M4 Pro Max?

liyimeng · 2024-12-04T17:58:16Z

liyimeng
Dec 4, 2024

Have someone tried M4 Pro 64G, is it possible to run a 70B model in a usable speed?

7 replies

gsgxnet Dec 8, 2024

I own a 128GB Macbook Pro Max. Today I tested it with Llama 3.3-70B Q6 for writing some python code.
Got: 4.36 tok/sec, 788 tokens, 2.96s to first token. Similar results achieved in several runs with other models. So that machine might fit your needs.

liyimeng Dec 10, 2024

Thanks guy! I just order a Mac mini m4 pro, dedicated to run llm for personal use. 4-5t/s can be ok.
@gsgxnet you MacBook Pro max is m4, right?

gsgxnet Dec 10, 2024

Not a Pro Max M4 but one year older, a M3.
Important is Memory Bandwidth, see above, original post of Georgi Greganov. From the numbers there I would assume you will get 60 to 70 percent of the speed of the M3 Max.

liyimeng Dec 12, 2024

Thanks! I'll check when I have it on hands.

liyimeng Jan 16, 2025

M4proc 64g,I got 4.5 token/s with llama3.3 q4. Happy with it!

cktang88 · 2024-12-05T03:26:09Z

cktang88
Dec 5, 2024

If i'm reading correctly, the m3 pro is slower than the m2 pro??

1 reply

atlas5301 Dec 5, 2024

It is. That's what 'ONLY APPLE CAN DO'.

Hanneseh · 2024-12-07T11:02:10Z

Hanneseh
Dec 7, 2024

M4 Pro, 8+4 CPU, 16 GPU, 24 GB Memory (MBP 14) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	381.14 ± 0.06
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	17.19 ± 0.04
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	367.13 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	30.54 ± 0.01
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	364.06 ± 0.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	49.64 ± 0.01

build: 8e672ef (1550)

0 replies

eightpigs · 2024-12-17T16:18:34Z

eightpigs
Dec 17, 2024

M4 Max (Macbook Pro 14" 2024), 12+4 CPU, 40 GPU, 128 GB Memory

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	923.55 ± 0.12
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	31.61 ± 0.10
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	852.47 ± 48.37
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	53.06 ± 0.48
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	746.09 ± 29.30
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	82.52 ± 0.13

build: 8e672ef (1550)

4 replies

maciejjedrzejczyk Jan 26, 2025

@eightpigs can you please confirm that this is indeed a 14'' MBP that you used for testing? The spec you provided is only available for 16'' MBP M4 Pro Max which hase a higher memory bandwidth than 14'' MBP M4 Pro Max model (536gb/s vs 410gb/s).

eightpigs Jan 26, 2025

@maciejjedrzejczyk This is the result from my testing on a 14’’ MBP. The specs I provided are correct, and here are the details:

> system_profiler SPHardwareDataType SPDisplaysDataType 
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac16,6
      Chip: Apple M4 Max
      Total Number of Cores: 16 (12 performance and 4 efficiency)
      Memory: 128 GB
      ...

Graphics/Displays:

    Apple M4 Max:

      Chipset Model: Apple M4 Max
      Type: GPU
      Bus: Built-In
      Total Number of Cores: 40
      ...

The M4 Max memory bandwidth can go up to 546GB/s: https://support.apple.com/en-us/121553

maciejjedrzejczyk Jan 26, 2025

Thank you for confirmation. My confusion was based on the ground that I used apple store configurator specific to my country which only showed a single available spec for 14'' MBP M4 Pro Max with a lower memory bandwidth (36GB RAM). I like the 14'' form factor much more than that of 16'' version but that was the only missing part in my research :) Just a follow-up question - would you consider the thermals (temp while on lap, fan noise, multitasking etc.) on this machine at an acceptable level while using LLM inference?

eightpigs Jan 27, 2025

I also prefer the 14-inch MBP.

As for noise, I mostly run 7B or 14B models, so noise hasn’t been an issue for me. Here’s some data I tracked with my Apple Watch for reference:

DeepSeek-R1-Distill-Llama-70B-8bit: Fan noise is around 56dB.
DeepSeek-R1-Distill-Qwen-32B-MLX-8bit: Fan noise is around 48dB.
DeepSeek-R1-Distill-Qwen-14B-8bit: Fan is almost silent.

On multitasking, I haven’t run into any scenarios where I felt the machine was under pressure. Performance has been more than enough for me.

DeconstructingAmbiguity · 2025-01-11T19:00:39Z

DeconstructingAmbiguity
Jan 11, 2025

Which models can my M3 16GB MacBook Air support?

3 replies

gsgxnet Jan 12, 2025

Depends on how much RAM you want to dedicate to AI inferencing. I think if you tweak your MacOS you might have the option to use up to 12GB for the model. So a model with 20B parameters quantised down to 4bit might just work. If it is a plain M3 you have, inferencing speed might be too slow, so you probably would stay with a smaller model.

DeconstructingAmbiguity Jan 12, 2025

Thank you for the thoughtful response. I am keeping an eye out for any tests here that most closely resemble my system.

Crear12 Feb 23, 2025

You can try Deepseek-R1:14b Q4_K_M from ollame, it's only 9.0GB:
ollama run deepseek-r1:14b

gcr · 2025-01-17T11:43:07Z

gcr
Jan 17, 2025

why specifically is the M2 so cracked compared to the M3 and M4?

1 reply

byrongibson Jan 28, 2025

I think it's primarily due to memory bandwidth (first column). Where they have the same bandwidth, the results are close. But in cases where the M2 Max or Ultra has substantially higher bandwidth, it outperforms the equivalent M3 or M4.

kaush4l · 2025-02-19T04:00:16Z

kaush4l
Feb 19, 2025

M4 Max (Macbook Pro 16" 2024), 16 CPU, 40 GPU, 128 GB Memory

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	920.48 ± 3.25
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	31.56 ± 0.07
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	891.32 ± 0.95
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	53.75 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	884.59 ± 0.78
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	82.36 ± 0.22

Used the command below
git checkout 8e672ef
make clean && make -j llama-bench && ./llama-bench
-m ./models/Llama-2-7b-chat-f16.gguf
-m ./models/llama-2-7b-chat.Q8_0.gguf
-m ./models/llama-2-7b-chat.Q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null
HEAD is now at 8e672ef stablelm : simplify + speedup generation (#4153)

0 replies

bluemoehre · 2025-03-05T17:53:06Z

bluemoehre
Mar 5, 2025

Okay Apple... no M4 Ultra in the Mac Studio, but a M4 Max or M3 Ultra - ... planning a new Mac Pro, huh?!
https://www.apple.com/shop/buy-mac/mac-studio

So let's add these guys to the table 😁

14 replies

gsgxnet Mar 6, 2025

You know UP TO 16,9 that is marketing - may be Apple run that speed comparison with a model, which did not fit before into the unified memory and now does so. We have to tweak memory settings to make a bigger than default part of the RAM GPU-unified. But never all RAM can be used as unified RAM. See #2182 (comment) and all msg above.

Or Apple might offer a much more performant MLX option with M3 Ultra. Who knows? Benchmarks will tell in a few days I assume.

bluemoehre Mar 7, 2025

Will this end up like with NVIDIA's RTX series? Today some M4s (mostly those with maxed RAM) are no more listed on the website for several countries - if you are able to load it at all. Apple seems to have major backend issues since hours. Mmh.

fairydreaming Mar 11, 2025
Collaborator

Found some numbers: https://creativestrategies.com/mac-studio-m3-ultra-ai-workstation-review/
Seems to be only tg, no pp.

Thireus Mar 12, 2025

Nobody seems to have posted any pp so far... and I wonder why.

Edit: https://www.reddit.com/r/LocalLLaMA/comments/1j9jfbt/comment/mhe1ku9/

netrunnereve Mar 12, 2025
Collaborator

Nobody seems to have posted any pp so far... and I wonder why.

Considering the article's bias I'm not surprised. The 5090's going to destroy the Mac when the model fully fits in VRAM, so the author uses a 128k context and swapping on the 5090 (not even partial offloading) to make the Mac appear more effective. For prompt processing I think the 5090 might actually beat the Mac even with partial offloading. IMO he should have also done tests with smaller contexts to show the distinction between a model that fits in VRAM (5090 wins) and one that doesn't (Mac wins).

Our llama-bench should be the standard for testing llama.cpp but sadly a lot of people don't know about it.

kelvinyangis · 2025-03-14T02:05:31Z

kelvinyangis
Mar 14, 2025

M3 Ultra 20+8 CPU, 60 GPU, 256GB RAM ✅

./llama-bench -m models/Llama-2-7b-chat-f16.gguf -m models/llama-2-7b-chat.Q8_0.gguf -m models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1121.80 ± 2.33
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	42.24 ± 0.05
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1085.76 ± 0.90
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	63.55 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1073.09 ± 1.29
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	88.40 ± 0.44

build: 8e672ef (1550)

4 replies

arty-hlr Mar 14, 2025

So identical speeds to the M2 ultra it looks like, because same bandwidth... Not worth buying imo!

marcingomulkiewicz Mar 14, 2025

Well, entry level model with 256GB of RAM costs exactly the same as 192GB model costed previously, plus (if one's rich) there's 512GB version, so even though speed seems similar, there's still argument to be made in favour of those.

bluemoehre Mar 14, 2025

I've already seen several benchmarks that say the M3 only makes sense in terms of memory and not performance. It seems you get better value for money with a M4 Max.

Still I wonder what is the TG performance with a ~500GB model vs a ~50GB model on the same machine.

marcingomulkiewicz Mar 16, 2025

All else equal - probably 10x, as there is 10x as much weights, no matter if it's memory or compute bound. But it's not that simple: >600B DeepSeek/R1 are MoEs with iirc ~37B parameters per expert, so I'd expect it to work much (2x?) faster than 70B Llama.

ivanfioravanti · 2025-03-23T19:13:55Z

ivanfioravanti
Mar 23, 2025

M3 Ultra 24+8 CPU, 80 GPU, 512GB RAM ✅

./llama-bench -m ./models/llama-7b-v2/ggml-model-f16.gguf -m ./models/llama-7b-v2/ggml-model-q8_0.gguf -m ./models/llama-7b-v2/ggml-model-q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1538.34 ± 2.14
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	39.78 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1487.51 ± 1.57
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	63.93 ± 0.24
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1471.24 ± 1.05
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	92.14 ± 0.66

build: 8e672ef (1550)

0 replies

ilcommm · 2025-03-24T16:55:14Z

ilcommm
Mar 24, 2025

Can someone please test the base Mac Ultra M4 :)

5 replies

marcingomulkiewicz Mar 24, 2025

Doubtful. M4 Ultra does not exists, at least not yet.

ilcommm Mar 25, 2025

Of course, you're right. I obviously meant the base Mac Studio with M4 Max.

ilcommm Mar 25, 2025

I’m just choosing between a Mac Studio M2 Max with 32GB for 1,888 and a Mac Studio M4 Max with 36GB for 2,705 (in our shops). Trying to figure out if the performance boost is worth the extra cost.

mladencucakSYN Mar 25, 2025

which shops would that be? :D

ilcommm Mar 25, 2025

sad shops in Russia))). It is usd prices.

shimza · 2025-04-03T00:05:36Z

shimza
Apr 3, 2025

Cost Per Token April 2025 ~ Bang for Buck

Seems like the M4 Mac Mini is cheapest instant win for now, with an M1 Max Studio coming in close second.

Product	Amazon Link	Price	Tokens/sec (t/s)	Token Cost
M1 MacBook Air	https://amzn.to/42gTl9Z	584	14.19	$ 41.16
M1 MacBook Pro	https://amzn.to/426VINP	777	14.15	$ 54.91
M1 Pro MacBook Pro 14"	https://amzn.to/3FST7hK	823	35.52	$ 23.17
M1 Pro MacBook Pro 16"	https://amzn.to/4leWSOC	901	36.41	$ 24.75
M1 Max MacBook Pro 14"	https://amzn.to/3XImWaV	1299	54.61	$ 23.79
M1 Max MacBook Pro 16"	https://amzn.to/3XImWaV	1551	61.19	$ 25.35
M1 Max Mac Studio	https://amzn.to/41UMFiJ	1385	61.19	$ 22.63
M1 Ultra Mac Studio	https://amzn.to/429gScF	1980	74.93	$ 26.42
M2 MacBook Air	https://amzn.to/3YcCIe4	749	21.7	$ 34.52
M2 MacBook Pro	https://amzn.to/3XInrBP	835	21.91	$ 38.11
M2 Pro MacBook Pro 14"	https://amzn.to/4cq9Y7O	1180	37.87	$ 31.16
M2 Pro MacBook Pro 16"	https://amzn.to/4i0pIzt	1502	38.86	$ 38.65
M2 Max MacBook Pro 14"	https://amzn.to/4lf04K0	1885	60.99	$ 30.91
M2 Max MacBook Pro 16"	https://amzn.to/4hVTm97	2014	65.95	$ 30.54
M2 Max Mac Studio	https://amzn.to/4hWJ7Bm	1799	60.99	$ 29.50
M2 Ultra Mac Studio	https://amzn.to/4jiIVgN	3889	88.64	$ 43.87
M2 Ultra Mac Studio	https://amzn.to/4jiIVgN	3889	94.27	$ 41.25
M3 Pro MacBook Pro 14	https://amzn.to/4jfqqda	1286	30.74	$ 41.83
M3 Pro MacBook Pro 16	https://amzn.to/4llV2M3	1976	30.74	$ 64.28
M3 Max MacBook Pro	https://amzn.to/3R3jWlD	2959	56.58	$ 52.30
M3 Ultra Mac Studio	https://www.cornellstore.com/Mac-Studio-M3-Ultra	3599	88.4	$ 40.71
M4 Mac Mini	https://amzn.to/43Eb1Pa	549	24.11	$ 22.77
M4 MacBook Air	https://amzn.to/4cl0ISi	949	24.11	$ 39.36
M4 Pro MacBook Pro 14"	https://amzn.to/3G2TVjW	1786	49.64	$ 35.98
M4 Pro MacBook Pro 16"	https://amzn.to/4hYXogP	1880	50.74	$ 37.05
M4 Max MacBook Pro	https://amzn.to/43xoCYr	2849	83.06	$ 34.30

2 replies

arty-hlr Apr 3, 2025

Is this really the place to post a bunch of sponsored links??
You are not computing the "token cost", which doesn't make sense, but the token speed cost.

shimza Apr 4, 2025

Point taken, but I've looked at this chart for months and really wanted to know which Mac to buy, wanted to optimize my spend for obvious reasons.

Yeh yeh nah - it is the Token Cost ... it's self explanatory the speed reference from the prior column.

lukewp · 2025-04-03T15:11:13Z

lukewp
Apr 3, 2025

... cross-posted to the Vulkan thread:

Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS Monterey

Note: I've updated this post -- I realized when I posted the first time I was so excited to see the GPUs doing stuff that I didn't check whether they were working right. Turns out they were not! So I recompiled MoltenVK and llama.cpp with some tweaks and checked that the models were working correctly before re-benchmarking. When the system was spitting garbage it was running about 30% higher t/s rates across the board.

Full HOWTO on getting the Mac Pro D700s to accept layers here: https://github.com/lukewp/TrashCanLLM/blob/main/README.md

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	pp512	68.55 ± 0.25
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	tg128	11.05 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	pp512	68.86 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	tg128	16.73 ± 0.05

build: d3bd719 (5092)

The FP16 model, was throwing garbage so I did not include here -- it will require some unique flags to run correctly. Additionally, here's the 8- and 4- bit llama 2 7B runs on the CPU alone (using -ngl 0 flag):

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 0 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	pp512	25.87 ± 0.56
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	tg128	6.85 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	pp512	26.17 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	tg128	10.85 ± 0.01

build: d3bd719 (5092)

(proof-of-life images below):
GPU test:

CPU test:

0 replies

mirh · 2025-04-12T02:52:32Z

mirh
Apr 12, 2025

Just saying.. Shouldn't the OP be edited with the actual used bandwidth numbers, rather than the BS figures apple gave to the press?

0 replies

Performance of llama.cpp on Apple Silicon M-series #4167

ggerganov Nov 22, 2023 Maintainer

Summary

Description

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

Footnotes

Replies: 71 comments · 138 replies

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅

M2 Max Studio, 8+4 CPU, 38 GPU ✅

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

slaren Nov 24, 2023 Maintainer

ggerganov Nov 24, 2023 Maintainer Author

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅

ggerganov Nov 25, 2023 Maintainer Author

ggerganov Nov 26, 2023 Maintainer Author

M4 Max (Macbook Pro 16" 2024), 12+4 CPU, 40GPU, 128 GB Memory ✅

ggerganov Nov 19, 2024 Maintainer Author

ggerganov
Nov 22, 2023
Maintainer

Replies: 71 comments 138 replies

slaren Nov 24, 2023
Maintainer

ggerganov Nov 24, 2023
Maintainer Author

ggerganov Nov 25, 2023
Maintainer Author

ggerganov Nov 26, 2023
Maintainer Author

ggerganov Nov 19, 2024
Maintainer Author