Performance of llama.cpp on NVIDIA DGX Spark #16578

ggerganov · 2025-10-14T14:28:54Z

ggerganov
Oct 14, 2025
Maintainer

Overview

This document summarizes the performance of llama.cpp for various models on the new NVIDIA DGX Spark.

Benchmarks include:

Prefill (pp) and generation (tg) at various context depths (d)
Batch sizes of 1, 2, 4, 8, 16, 32 typical for local environments

Models:

gpt-oss-20b
gpt-oss-120b
Qwen3 Coder 30B A3B
Qwen2.5 Coder 7B
Gemma 3 4B QAT
GLM 4.5 Air

Feel free to request additional benchmarks for models and use cases.

Benchmarks

Using the following commands:

# sequential requests
llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

# parallel requests
llama-batched-bench -m [model.gguf] -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32

History

2025 Oct 14 (b6761) 7ea15bb Initial numbers
2025 Oct 15 (b6767) 5acd455 Improved decode via CUDA: Changing the CUDA scheduling strategy to spin #16585

gpt-oss-20b

Model: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

llama-bench

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048	3610.56 ± 15.16
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32	79.74 ± 0.43
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d4096	3361.11 ± 12.95
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d4096	74.63 ± 0.15
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d8192	3147.73 ± 15.77
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d8192	69.49 ± 1.12
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d16384	2685.54 ± 5.76
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d16384	64.02 ± 0.72
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d32768	2055.34 ± 20.43
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d32768	55.96 ± 0.07

build: 5acd455 (6767)

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	1.124	3644.93	0.431	74.24	1.555	2654.99
4096	32	2	8256	2.252	3637.13	0.766	83.59	3.018	2735.57
4096	32	4	16512	4.485	3652.80	0.960	133.37	5.445	3032.46
4096	32	8	33024	8.958	3657.89	1.228	208.45	10.186	3242.01
4096	32	16	66048	17.883	3664.62	1.695	302.07	19.578	3373.52
4096	32	32	132096	35.738	3667.60	2.403	426.22	38.140	3463.42
8192	32	1	8224	2.285	3584.99	0.458	69.81	2.743	2997.64
8192	32	2	16448	4.547	3603.29	0.797	80.29	5.344	3077.82
8192	32	4	32896	9.117	3594.24	1.004	127.47	10.121	3250.29
8192	32	8	65792	18.248	3591.40	1.356	188.77	19.604	3356.01
8192	32	16	131584	36.389	3601.99	1.951	262.37	38.340	3432.02
8192	32	32	263168	72.880	3596.94	2.937	348.69	75.816	3471.12

gpt-oss-120b

Model: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

llama-bench

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048	1689.47 ± 107.67
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32	52.87 ± 1.70
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d4096	1733.41 ± 5.19
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d4096	51.02 ± 0.65
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d8192	1705.93 ± 7.89
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d8192	48.46 ± 0.53
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d16384	1514.78 ± 5.66
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d16384	44.78 ± 0.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d32768	1221.23 ± 7.85
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d32768	38.76 ± 0.06

build: 5acd455 (6767)

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	2.345	1746.43	0.751	42.59	3.097	1333.05
4096	32	2	8256	4.442	1844.38	1.252	51.13	5.693	1450.15
4096	32	4	16512	8.816	1858.46	1.583	80.86	10.399	1587.86
4096	32	8	33024	17.570	1865.00	2.124	120.55	19.694	1676.88
4096	32	16	66048	35.112	1866.50	3.083	166.05	38.195	1729.22
4096	32	32	132096	70.158	1868.23	4.581	223.53	74.739	1767.42
8192	32	1	8224	4.462	1835.87	0.773	41.42	5.235	1571.02
8192	32	2	16448	8.960	1828.57	1.304	49.10	10.264	1602.57
8192	32	4	32896	17.800	1840.87	1.669	76.70	19.469	1689.65
8192	32	8	65792	35.706	1835.41	2.339	109.44	38.046	1729.29
8192	32	16	131584	71.322	1837.75	3.507	145.98	74.829	1758.46
8192	32	32	263168	142.658	1837.57	5.593	183.08	148.251	1775.15

Qwen3 Coder 30B A3B

Model: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

llama-bench

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	test	t/s
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048	2933.39 ± 9.43
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32	59.95 ± 0.26
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d4096	2537.98 ± 7.17
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d4096	52.70 ± 0.75
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d8192	2246.86 ± 6.45
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d8192	44.48 ± 0.34
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d16384	1772.41 ± 10.58
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d16384	37.10 ± 0.05
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d32768	1252.10 ± 2.16
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d32768	27.82 ± 0.01

build: 5acd455 (6767)

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	1.446	2831.93	0.609	52.58	2.055	2008.85
4096	32	2	8256	2.890	2834.16	1.117	57.27	4.008	2059.94
4096	32	4	16512	5.778	2835.64	1.546	82.78	7.324	2254.48
4096	32	8	33024	11.505	2848.21	2.195	116.63	13.700	2410.57
4096	32	16	66048	23.016	2847.43	3.218	159.11	26.234	2517.67
4096	32	32	132096	46.022	2848.01	4.926	207.86	50.949	2592.73
8192	32	1	8224	3.075	2664.32	0.724	44.18	3.799	2164.77
8192	32	2	16448	6.114	2679.65	1.267	50.50	7.382	2228.27
8192	32	4	32896	12.251	2674.64	1.807	70.85	14.058	2340.03
8192	32	8	65792	24.435	2682.01	2.729	93.82	27.164	2422.02
8192	32	16	131584	48.952	2677.56	4.322	118.47	53.274	2469.97
8192	32	32	263168	97.879	2678.26	7.057	145.10	104.936	2507.89

Qwen2.5 Coder

Model: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

llama-bench

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	test	t/s
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048	2267.08 ± 6.38
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32	29.40 ± 0.02
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d4096	2094.87 ± 11.61
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d4096	28.31 ± 0.10
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d8192	1906.26 ± 4.45
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d8192	27.53 ± 0.04
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d16384	1634.82 ± 6.67
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d16384	26.03 ± 0.03
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d32768	1302.32 ± 4.58
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d32768	22.08 ± 0.03

build: 5acd455 (6767)

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	1.818	2252.90	1.131	28.30	2.949	1399.87
4096	32	2	8256	3.629	2257.46	1.273	50.29	4.901	1684.40
4096	32	4	16512	7.267	2254.46	1.393	91.86	8.661	1906.54
4096	32	8	33024	14.516	2257.44	1.598	160.22	16.113	2049.48
4096	32	16	66048	29.025	2257.90	2.092	244.69	31.118	2122.53
4096	32	32	132096	58.059	2257.55	2.764	370.44	60.824	2171.79
8192	32	1	8224	3.748	2185.91	1.171	27.33	4.918	1672.09
8192	32	2	16448	7.502	2183.95	1.354	47.28	8.856	1857.35
8192	32	4	32896	15.018	2181.92	1.556	82.27	16.574	1984.82
8192	32	8	65792	30.024	2182.77	1.908	134.16	31.933	2060.34
8192	32	16	131584	60.044	2182.93	2.673	191.55	62.717	2098.06
8192	32	32	263168	120.112	2182.49	3.903	262.39	124.015	2122.07

Gemma 3 4B QAT

Model: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF

llama-bench

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	test	t/s
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048	5694.21 ± 13.18
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32	79.83 ± 0.18
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d4096	5228.77 ± 20.56
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d4096	67.49 ± 1.17
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d8192	4882.66 ± 37.61
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d8192	66.87 ± 0.80
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d16384	4491.42 ± 44.60
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d16384	63.36 ± 0.66
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d32768	3840.09 ± 14.52
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d32768	57.67 ± 0.09

build: 5acd455 (6767)

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	0.704	5819.48	0.467	68.59	1.170	3527.14
4096	32	2	8256	1.403	5837.53	0.585	109.49	1.988	4153.21
4096	32	4	16512	2.790	5871.91	0.675	189.62	3.465	4764.98
4096	32	8	33024	5.554	5899.92	0.907	282.33	6.461	5111.51
4096	32	16	66048	11.106	5900.73	1.349	379.54	12.455	5302.76
4096	32	32	132096	22.194	5905.76	2.113	484.68	24.307	5434.56
8192	32	1	8224	1.425	5750.56	0.477	67.12	1.901	4325.44
8192	32	2	16448	2.818	5814.33	0.653	98.07	3.470	4739.43
8192	32	4	32896	5.628	5822.18	0.809	158.21	6.437	5110.32
8192	32	8	65792	11.256	5822.26	1.183	216.33	12.439	5288.96
8192	32	16	131584	22.470	5833.08	1.913	267.69	24.383	5396.51
8192	32	32	263168	44.856	5844.16	3.251	314.98	48.107	5470.50

GLM 4.5 Air

Model: https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main

llama-bench

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	test	t/s
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048	841.44 ± 12.67
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32	22.59 ± 0.11
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048 @ d4096	749.08 ± 2.10
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32 @ d4096	20.10 ± 0.01
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048 @ d8192	680.95 ± 1.38
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32 @ d8192	18.78 ± 0.07
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048 @ d16384	565.44 ± 1.47
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32 @ d16384	16.47 ± 0.01
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048 @ d32768	418.84 ± 0.53
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32 @ d32768	13.19 ± 0.01

build: 5acd455 (6767)

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	5.554	737.54	1.660	19.28	7.214	572.25
4096	32	2	8256	9.781	837.53	2.696	23.74	12.477	661.68
4096	32	4	16512	19.548	838.12	3.812	33.58	23.361	706.83
4096	32	8	33024	38.980	840.64	6.407	39.95	45.387	727.61
4096	32	16	66048	77.938	840.88	12.197	41.98	90.134	732.77
8192	32	1	8224	10.343	792.02	1.803	17.75	12.146	677.11
8192	32	2	16448	20.678	792.34	3.161	20.25	23.839	689.96
8192	32	4	32896	41.320	793.04	4.496	28.47	45.816	718.01
8192	32	8	65792	82.728	792.19	8.426	30.38	91.153	721.77

More info

Saren-Arterius · 2025-10-14T16:55:30Z

Saren-Arterius
Oct 14, 2025

Thanks for the benchmark! I would like to request additional benchmark for a very popular model GLM-4.5-Air-FP8:
https://huggingface.co/zai-org/GLM-4.5-Air-FP8

and quants for it:

Q4_K_M
Q6_K
Q8 (if possible)
https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main

1 reply

Saren-Arterius Oct 15, 2025

Saw the benchmark results. Thank you so much for the work! Appreciate very much.

SinaYa · 2025-10-14T20:27:27Z

SinaYa
Oct 14, 2025

Hi. It would be great to see a Qwen Next 80B benchmark for these two models:

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
(Has acceptable t/s even on CPU... I'm not sure if this one runs on llama.cpp)

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
(Official quants)

Thanks.

2 replies

sorasoras Oct 14, 2025

Not support yet with open pr currently

icsy7867 Oct 14, 2025

Hi. It would be great to see a Qwen Next 80B benchmark for these two models:

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (Has acceptable t/s even on CPU... I'm not sure if this one runs on llama.cpp)

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (Official quants)

Thanks.

Yeah I really want to see the performance of a specific model comparing full 16 bit precision, Q8, Q4, FP4 and FP8.

None the less, thank you for the wonderful data!

mfarme · 2025-10-15T00:46:52Z

mfarme
Oct 15, 2025

Getting similar performance with my Farmework Desktop. Thanks for helping my FOMO.

11 replies

woachk Oct 15, 2025

Same chip so should be the same perf. Asus says shipping date of November 3 here.

LucidityCrash Oct 15, 2025

Someone please help explain this to me? I am not trying to bash on this machine, I am just trying to understand the justification for paying almost twice as much for the same performance with similar specs.

I'm sure the connectx-7 200GB networking has something to do with the pricing difference :)

cocoderss Oct 15, 2025

btw it's most likely a much better choice if you want GB10 to buy the Asus GB10 systems for $1k less (at least that's what I did) - DGX Spark is more expensive but it's not the only choice

Interesting, the Asus GB10 seems to run with 240W power adapter, much higher than the DGX Spark. I wonder if you will get more performance given the higher power intake.

icsy7867 Oct 15, 2025

btw it's most likely a much better choice if you want GB10 to buy the Asus GB10 systems for $1k less (at least that's what I did) - DGX Spark is more expensive but it's not the only choice

Interesting, the Asus GB10 seems to run with 240W power adapter, much higher than the DGX Spark. I wonder if you will get more performance given the higher power intake.

I havent seen the specs. But its possible ASUS just used a power adapter with a high enough rating for the device? For example, I can plug a 90watt compatible power adapter into my 45watt laptop. It will pull what it needs to.

geerlingguy Oct 16, 2025

@bartlettroscoe i benched gpt-oss 120b on Framework Desktop a couple months ago: geerlingguy/ai-benchmarks#21 (comment)

netrunnereve · 2025-10-15T03:39:48Z

netrunnereve
Oct 15, 2025
Collaborator

Can you run the classic llama 2 7B Q4_0 so it can be compared on the chart?

0 replies

atsyplikhin · 2025-10-15T05:21:38Z

atsyplikhin
Oct 15, 2025

Super interesting, thanks for sharing, Georgi!

llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

Could you please help me understand: Does "-d" mean KV cache length before the "-p" prefill happens? What does "-ub" define, eg batch size?

1 reply

ggerganov Oct 15, 2025
Maintainer Author

Does "-d" mean KV cache length before the "-p" prefill happens?

Yes.

What does "-ub" define, eg batch size?

Yes.

beebopkim · 2025-10-15T05:32:14Z

beebopkim
Oct 15, 2025

Could you add llama2-7b result to #15013?

0 replies

Ramalama2 · 2025-10-15T07:46:39Z

Ramalama2
Oct 15, 2025

Awesome, thank you!
So for gpt-oss-120B around 35 tokens/s on dgx spark.
On vllm im getting with 131k context and at almost any length around 180 tokens/s on a 300W RTX6000 96gb Max-Q edition.

So whats the sense of a dgx spark? I mean sure it has 128gb memory, but i can offload bigger models between 96gb vram and the rest to normal Ram (CPU)...
So in the end i can run even bigger models and even faster as the dgx could.

Its too expensive for what it offers. If the DGX Spark would be around 2k, like the Ryzen Max 395+ Mini-PC's it would be fine and okay.
But for 4k usd/eur its absolutely senseless...

PS: And a Mac Mini/Studio is a much better option at 4k usd/eur, compared to a DGX Sparc.

7 replies

icsy7867 Oct 15, 2025

Ryzen AI Max 395+ Cannot do FP8 or FP4. The DGX Spark will have a HUGE speedup when using FP8 (Theoretically 2x for FP8 and 4x for FP4), which is what a lot of benchmarking posts/videos arent addressing.

I also do not consider the RTX 6000 Pro (Blackwell) a fair comparison. That's probbably an $8,000+ GPU, plus the additional cost of the system ($1000 if being fairly modest...). Someone might need that much VRAM to finetune or run/test a model, but not have $9,000 to spend.

Dont get me wrong it is a niche. Even a Mac Mini M4 Max has 546 GB/s memory, versus the 273 GB/s on the DGX spark, but the FP8 and FP4 capabilities should also be factored in here and have significant value. This is why I would love to see the benchmarks for running the Q4/Q8 vs FP4FP8

cocoderss Oct 15, 2025

Ryzen AI Max 395+ Cannot do FP8 or FP4. The DGX Spark will have a HUGE speedup when using FP8 (Theoretically 2x for FP8 and 4x for FP4), which is what a lot of benchmarking posts/videos arent addressing.

The benchmarks clearly show that the gpt-oss-120B at MXFP4 is comparable between the Ryzen AI Max 395+ and DGX Spark. That's FP4 model, unless I am missing something. Also Q4 and Q8 (GUUF) are INT and not FP, and thus would not benefit from the sparse "FP" speed up claimed by Nvidia..

Ramalama2 Oct 16, 2025

Guys, please dont take fp4 or fp8 as a win.

Let me explain:
I do compare embedding models in different quantisations (for my project @work).

Comparing embedding Models is actually great, because you can simply query the resulting vector database and see the quantisation impacts.

From my tests, no matter which Model, be it Qwen3-Embedding or BGE-M3 or anything else, the impact of Quantisation is Huge!

FP32 is Amazing
BF16 is still Amazing
int8/Q8 = you see already a degradation because the results start to differ, but only 5-10% of the results are different.
Q4 = 50% of the results are different, almost unusable Model

So you Guys want to tell me that FP4 is a win?
In my Opinion FP8 is fine and usable, but FP4 will be unusable crap.
No Matter what the Marketing says, 1% quality loss is a huge lie!!!

I didnt tested fp4 tho, not even fp8, so i cant say for sure.
But from my experience with all other quantisations fp4 should be crap.

Cheers!

icsy7867 Oct 16, 2025

It depends on the model. In many cases, in my experience FP4 does a fantastic job. Also NVFP4 has the potential to be amazing.

So is it situational? Sure, it can be. But I don't think it's something that can be ignored.

Also, FP8 is also great, I have found little reason to not use it.

lhl Oct 18, 2025

I'd agree that everyone should eval for their particular downstream tasks rather than just trusting perplexity or KLD. When running quants on my 405B model I ran JA MT-Bench evals and was surprised to find a bigger difference with FP8-Dynamic than IQ3_M.

@icsy7867 I know you're just theory-crafting instead of running tests, but see my PRO 6000 TensorRT/NVFP4 benchmark below, but there is zero throughput benefit from NVFP4. Maybe related to NVIDIA/TransformerEngine#2255 - I never use TensorRT and it's impossible to build so I just used the latest docker for my tests (tensorrt-llm/release:1.2.0rc0) but I've put my full scripts/details online so it's easy for anyone to rent any GPU they want to check any configuration/variation for themselves.

#16578 (reply in thread)

cocoderss · 2025-10-15T13:58:22Z

cocoderss
Oct 15, 2025

@ggerganov Are there llama.cpp benchmarks for the AGX Thor? It seems it's similar offering but Nvidia markets it as twice as fast.

There are no official detailed spec sheet for the DGX Spark to make a comparison to the Thor (2560 cuda cores and 92 tensor cores), but Nvidia claims 2PLOPS (sparse FP4) for the Thor and 1PFLOPS (sparse FP4) for the Spark.
I guess this might only affect batching, but it would be interesting to know given that Thor is cheaper than Spark.

5 replies

ggerganov Oct 15, 2025
Maintainer Author

I'm not familiar with AGX Thor. But if you have one, you can easily run the same benchmarks on it.

woachk Oct 15, 2025

Quick tldr:

Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory. And no raytracing cores. While Spark is sm_121 with the full consumer Blackwell feature set.

Thor and Spark have relatively similar memory bandwidth. The Thor CPU is much slower.

Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.

Thor has 4 cursed Synopsys 25GbE NICs (set to 10GbE by default, see https://docs.nvidia.com/jetson/archives/r38.2/DeveloperGuide/SD/Kernel/Enable25GbEthernetOnQSFP.html as it doesn't have auto-negociation of the link rate) exposed via a QSFP connector providing 4x25GbE while Spark systems have regular ConnectX-7.

Thor uses a downstream L4T stack instead of regular NVIDIA drivers unlike Spark. But at least the CUDA SDK is the same unlike prior Tegras. Oh and you get less other IO too.

Side note: might be better to also consider GB10 systems from OEMs. Those are available for cheaper than AGX Thor devkits too.

cocoderss Oct 15, 2025

I'm not familiar with AGX Thor. But if you have one, you can easily run the same benchmarks on it.

I don't have one unfortunately, hoping whoever does will run those benchmarks.

Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.

This is a very weird and interesting tradeoff.

yf225 Oct 15, 2025

Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory

@woachk does "tensor memory" here refer to TMEM?

woachk Oct 16, 2025

Yes.

eous · 2025-10-15T18:45:22Z

eous
Oct 15, 2025

For those curious about Thor performance
(All models are the same as linked in the original benchmark with the same command)
llama.cpp git commit: f9fb33f
Jetpack 7.0 [L4T 38.2.2]
Docker container: nvcr.io/nvidia/pytorch:25.09-py3
MAXN and jetson_clocks enabled

gpt-oss-20b-gguf

# ./bin/llama-bench -m /workspace/models/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |          pp2048 |       2008.85 ± 4.18 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |            tg32 |         60.85 ± 0.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d4096 |       1862.13 ± 4.80 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d4096 |         55.03 ± 0.06 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d8192 |       1740.90 ± 3.24 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d8192 |         53.58 ± 0.18 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d16384 |       1446.75 ± 3.01 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d16384 |         52.49 ± 1.94 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d32768 |       1193.93 ± 0.72 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d32768 |         48.33 ± 0.04 |

build: f9fb33f2 (6771)

Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

# ./bin/llama-bench -m /workspace/models/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |          pp2048 |       1654.25 ± 1.80 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |            tg32 |         44.26 ± 0.11 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d4096 |       1410.87 ± 2.22 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d4096 |         39.46 ± 0.04 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d8192 |       1228.69 ± 1.78 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d8192 |         36.88 ± 0.13 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d16384 |        985.39 ± 7.04 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d16384 |         33.55 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d32768 |        686.45 ± 0.93 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d32768 |         26.92 ± 0.05 |

build: f9fb33f2 (6771)

gpt-oss-120b

# ./bin/llama-bench -m /workspace/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp2048 |        967.20 ± 6.04 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         42.00 ± 0.09 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d4096 |        932.85 ± 2.33 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d4096 |         38.81 ± 0.04 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d8192 |        892.28 ± 2.88 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d8192 |         39.22 ± 1.05 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d16384 |        827.57 ± 1.28 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d16384 |         37.77 ± 0.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d32768 |        677.70 ± 1.06 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d32768 |         34.02 ± 0.02 |

build: f9fb33f2 (6771)

9 replies

woachk Oct 16, 2025

That commit only applies the change to if (prop.major == 12 && prop.minor == 1) {, wonder if also adding it to 11.0 changes things

eous Oct 16, 2025

I did a quick one off build where I removed the conditional around the scheduling block to force spin and I do see a consistent improvement. Just looking at power draw there is a probably at least another 10-20% performance untapped on thor beyond moving it to the spin scheduler. Currently looks like we are mostly cpu bound.

Llama-bench Test Results (Qwen3moe 30B)

schedule Default Spin Improvement (%)
test
pp2048 1654.25 1700.05 2.77
pp2048 @ d16384 985.39 992.37 0.71
pp2048 @ d32768 686.45 687.30 0.12
pp2048 @ d4096 1410.87 1446.22 2.51
pp2048 @ d8192 1228.69 1257.35 2.33
tg32 44.26 45.67 3.19
tg32 @ d16384 33.55 33.62 0.21
tg32 @ d32768 26.92 27.05 0.48
tg32 @ d4096 39.46 40.64 2.99
tg32 @ d8192 36.88 38.09 3.28

Average improvement: 1.86%
Best improvement: 3.28% (tg32 @ d8192)
Worst improvement: 0.12% (pp2048 @ d32768)

Llama-batched-bench Test Results

PP=4096:
Average throughput improvement: 2.03%
Best batch size improvement: B2 (4.48%)
Worst batch size improvement: B16 (0.06%)

PP=8192:
Average throughput improvement: 0.05%
Best batch size improvement: B32 (0.07%)
Worst batch size improvement: B16 (0.03%)

Spin schedule
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
Test: llama-bench
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |          pp2048 |       1700.05 ± 2.02 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |            tg32 |         45.67 ± 0.11 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d4096 |       1446.22 ± 3.54 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d4096 |         40.64 ± 0.05 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |  pp2048 @ d8192 |       1257.35 ± 0.75 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    tg32 @ d8192 |         38.09 ± 0.09 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d16384 |        992.37 ± 1.89 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d16384 |         33.62 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 | pp2048 @ d32768 |        687.30 ± 0.48 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |   tg32 @ d32768 |         27.05 ± 0.03 |
Test: llama-batched-bench
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  4096 |     32 |    1 |   4128 |    2.537 |  1614.38 |    0.789 |    40.54 |    3.327 |  1240.92 |
|  4096 |     32 |    2 |   8256 |    4.949 |  1655.30 |    1.301 |    49.18 |    6.250 |  1320.87 |
|  4096 |     32 |    4 |  16512 |    9.887 |  1657.09 |    1.663 |    76.98 |   11.550 |  1429.62 |
|  4096 |     32 |    8 |  33024 |   19.739 |  1660.11 |    2.289 |   111.86 |   22.027 |  1499.25 |
|  4096 |     32 |   16 |  66048 |   39.464 |  1660.65 |    3.279 |   156.14 |   42.743 |  1545.23 |
|  4096 |     32 |   32 | 132096 |   78.936 |  1660.49 |    5.033 |   203.46 |   83.968 |  1573.16 |
|  8192 |     32 |    1 |   8224 |    5.314 |  1541.47 |    0.839 |    38.14 |    6.153 |  1336.50 |
|  8192 |     32 |    2 |  16448 |   10.614 |  1543.68 |    1.396 |    45.86 |   12.009 |  1369.61 |
|  8192 |     32 |    4 |  32896 |   21.220 |  1544.24 |    1.888 |    67.79 |   23.108 |  1423.59 |
|  8192 |     32 |    8 |  65792 |   42.394 |  1545.87 |    2.792 |    91.68 |   45.187 |  1456.01 |
|  8192 |     32 |   16 | 131584 |   84.800 |  1545.66 |    4.206 |   121.73 |   89.006 |  1478.37 |
|  8192 |     32 |   32 | 263168 |  169.577 |  1545.87 |    6.867 |   149.11 |  176.444 |  1491.51 |

woachk Oct 16, 2025

For prompt processing there's a lot more on the table but that means switching to tcgen05 MMA instructions. (Which is a separate instruction set than the regular tensor core one)

And there's also the matter of using lower precision MMAs in general

aazzolini Oct 17, 2025

I believe that Thor doesn't support tcgen05 because it doesn't have tensor-memory

woachk Oct 17, 2025

Thor does have tensor memory - it uses the data centre tensor cores (it's sm_110[a]), Spark does not.

See https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma

qdrddr · 2025-10-16T21:42:49Z

qdrddr
Oct 16, 2025

Would love to see accuracy of the same models on main banchmarks running in DGX as they will vary on different HW & FW in addition to the speed.

As its clearly sing here https://artificialanalysis.ai/models/gpt-oss-120b/providers

0 replies

mightcouldb1 · 2025-10-17T15:00:50Z

mightcouldb1
Oct 17, 2025

Please bench the full Qwen3 coder model

2 replies

ggerganov Oct 17, 2025
Maintainer Author

There isn't any measurable benefits in terms of quality compared to Q8_0, so don't think there is any point in benching that as it is most likely going to perform worse in terms of speed.

mightcouldb1 Oct 17, 2025

I am just impressed that it might run at all. It's there any bench on fine-tuning?

qdrddr · 2025-10-17T16:46:30Z

qdrddr
Oct 17, 2025

Would love to see this this cluster setup in the comparison table too
EXO Lab cluster with 2xDGX + MacStudio
https://blog.exolabs.net/nvidia-dgx-spark/

1 reply

ggerganov Oct 17, 2025
Maintainer Author

AFAICT this is vaporware.

aazzolini · 2025-10-17T18:04:23Z

aazzolini
Oct 17, 2025

On the subject of Spark and Thor, I have been looking for alternatives to TensorRT for python-free and community driven inference engine. I'm looking to leverage nvfp4 tensor cores , and wonder if there's any project or folks working to support those in llama.cpp?

6 replies

woachk Oct 17, 2025

The whole Blackwell product range, from the RTX 5050 onwards to the B200/300 through iGPUs

woachk Oct 17, 2025

That said: NVIDIA/TransformerEngine#2255

lhl Oct 17, 2025

Just as an FYI, I don't have a Spark but I tested NVP4 on an RTX PRO 6000 (Llama 3.1 8B Instruct). NVP4 w/ TensorRT does not perform better than llama.cpp at bs=1, and at higher concurency, doesn't take a lead until c=32.

I didn't test quality loss, but from a pure throughput perspective, I don't think the current NVFP4 implementation is particularly good. Certainly not worth all the custom quanting and other hassles...

Config	Req/s	Prefill Tok/s	Decode Tok/s	Total Tok/s	Max Out Tok/s	TTFT mean	TTFT med	TTFT p99	TPOT mean	TPOT med	TPOT p99
llama.cpp.q4_k_m	1.65	1683.45	207.16	1890.61	223.00	74.17	75.75	85.71	4.36	4.22	8.40
sglang.fp8-auto	1.15	1173.85	142.83	1316.68	146.00	54.88	55.31	55.79	6.61	6.62	6.62
sglang.fp8-dynamic	1.04	1065.99	130.29	1196.28	132.00	55.91	56.30	57.13	7.28	7.29	7.29
sglang.w4a16	1.56	1590.93	194.85	1785.78	204.00	53.69	54.10	54.79	4.74	4.75	4.76
trt.fp8	0.59	605.67	74.33	680.01	76.00	39.94	40.24	40.76	13.24	13.24	13.27
trt.nvfp4	0.60	608.22	74.38	682.61	76.00	30.91	31.05	31.31	13.30	13.30	13.34
vllm.fp8-dynamic	0.77	789.55	94.90	884.45	98.00	34.94	35.12	36.43	10.34	10.34	10.36
vllm.w4a16	1.52	1549.83	189.81	1739.64	196.00	49.09	49.39	50.30	4.92	4.92	4.96

aazzolini Oct 17, 2025

@lhl what's the prefill sequence length in the profiles above?
my usecase is pre-fill only at seqlen > 300

lhl Oct 18, 2025

This is using a standard vLLM bench - ShareGPT w/ prefill 1024 and decode 128 I believe. If you have a specific use case it's probably best to just trying the device directly - I think they're available for a buck or two on Vast or Runpod.

I think the compute is particularly strong for a client card. For example, the PRO 6000 actually beats an H100 on our Whisper inference sweeps. (Still trains much slower though)

Here's my LLM sweep scripts (and raw results) btw: https://github.com/AUGMXNT/speed-benchmarking/tree/main/nvfp4

Performance of llama.cpp on NVIDIA DGX Spark #16578

Uh oh!

Uh oh!

ggerganov Oct 14, 2025 Maintainer

Overview

Benchmarks

History

gpt-oss-20b

gpt-oss-120b

Qwen3 Coder 30B A3B

Qwen2.5 Coder

Gemma 3 4B QAT

GLM 4.5 Air

More info

Replies: 13 comments · 45 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

netrunnereve Oct 15, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

ggerganov Oct 15, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Oct 15, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov
Oct 14, 2025
Maintainer

Replies: 13 comments 45 replies

netrunnereve
Oct 15, 2025
Collaborator

ggerganov Oct 15, 2025
Maintainer Author

ggerganov Oct 15, 2025
Maintainer Author