Performance of llama.cpp on NVIDIA DGX Spark #16578
Replies: 13 comments 45 replies
-
Thanks for the benchmark! I would like to request additional benchmark for a very popular model GLM-4.5-Air-FP8: and quants for it:
|
Beta Was this translation helpful? Give feedback.
-
Hi. It would be great to see a Qwen Next 80B benchmark for these two models: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 Thanks. |
Beta Was this translation helpful? Give feedback.
-
Getting similar performance with my Farmework Desktop. Thanks for helping my FOMO. |
Beta Was this translation helpful? Give feedback.
-
Can you run the classic llama 2 7B Q4_0 so it can be compared on the chart? |
Beta Was this translation helpful? Give feedback.
-
Super interesting, thanks for sharing, Georgi!
Could you please help me understand: Does "-d" mean KV cache length before the "-p" prefill happens? What does "-ub" define, eg batch size? |
Beta Was this translation helpful? Give feedback.
-
Could you add llama2-7b result to #15013? |
Beta Was this translation helpful? Give feedback.
-
Awesome, thank you! So whats the sense of a dgx spark? I mean sure it has 128gb memory, but i can offload bigger models between 96gb vram and the rest to normal Ram (CPU)... Its too expensive for what it offers. If the DGX Spark would be around 2k, like the Ryzen Max 395+ Mini-PC's it would be fine and okay. PS: And a Mac Mini/Studio is a much better option at 4k usd/eur, compared to a DGX Sparc. |
Beta Was this translation helpful? Give feedback.
-
@ggerganov Are there llama.cpp benchmarks for the AGX Thor? It seems it's similar offering but Nvidia markets it as twice as fast. There are no official detailed spec sheet for the DGX Spark to make a comparison to the Thor (2560 cuda cores and 92 tensor cores), but Nvidia claims 2PLOPS (sparse FP4) for the Thor and 1PFLOPS (sparse FP4) for the Spark. |
Beta Was this translation helpful? Give feedback.
-
For those curious about Thor performance gpt-oss-20b-gguf# ./bin/llama-bench -m /workspace/models/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 | 2008.85 ± 4.18 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 60.85 ± 0.17 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1862.13 ± 4.80 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 55.03 ± 0.06 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1740.90 ± 3.24 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 53.58 ± 0.18 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 1446.75 ± 3.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 52.49 ± 1.94 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 1193.93 ± 0.72 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 48.33 ± 0.04 |
build: f9fb33f2 (6771) Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF# ./bin/llama-bench -m /workspace/models/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1654.25 ± 1.80 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 | 44.26 ± 0.11 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1410.87 ± 2.22 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 39.46 ± 0.04 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1228.69 ± 1.78 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 36.88 ± 0.13 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 985.39 ± 7.04 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 33.55 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 686.45 ± 0.93 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 26.92 ± 0.05 |
build: f9fb33f2 (6771) gpt-oss-120b# ./bin/llama-bench -m /workspace/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 | 967.20 ± 6.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 42.00 ± 0.09 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 932.85 ± 2.33 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 38.81 ± 0.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 892.28 ± 2.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 39.22 ± 1.05 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 827.57 ± 1.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 37.77 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 677.70 ± 1.06 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 34.02 ± 0.02 |
build: f9fb33f2 (6771) |
Beta Was this translation helpful? Give feedback.
-
Would love to see accuracy of the same models on main banchmarks running in DGX as they will vary on different HW & FW in addition to the speed. As its clearly sing here https://artificialanalysis.ai/models/gpt-oss-120b/providers |
Beta Was this translation helpful? Give feedback.
-
Please bench the full Qwen3 coder model |
Beta Was this translation helpful? Give feedback.
-
Would love to see this this cluster setup in the comparison table too |
Beta Was this translation helpful? Give feedback.
-
On the subject of Spark and Thor, I have been looking for alternatives to TensorRT for python-free and community driven inference engine. I'm looking to leverage nvfp4 tensor cores , and wonder if there's any project or folks working to support those in llama.cpp? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
This document summarizes the performance of
llama.cpp
for various models on the new NVIDIA DGX Spark.Benchmarks include:
pp
) and generation (tg
) at various context depths (d
)Models:
gpt-oss-20b
gpt-oss-120b
Qwen3 Coder 30B A3B
Qwen2.5 Coder 7B
Gemma 3 4B QAT
GLM 4.5 Air
Feel free to request additional benchmarks for models and use cases.
Benchmarks
Using the following commands:
History
2025 Oct 14 (b6761)
7ea15bb Initial numbers2025 Oct 15 (b6767)
5acd455 Improved decode via CUDA: Changing the CUDA scheduling strategy to spin #16585gpt-oss-20b
Model: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF
llama-bench
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
build: 5acd455 (6767)
llama-batched-bench
gpt-oss-120b
Model: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF
llama-bench
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
build: 5acd455 (6767)
llama-batched-bench
Qwen3 Coder 30B A3B
Model: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF
llama-bench
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
build: 5acd455 (6767)
llama-batched-bench
Qwen2.5 Coder
Model: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF
llama-bench
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
build: 5acd455 (6767)
llama-batched-bench
Gemma 3 4B QAT
Model: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF
llama-bench
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
build: 5acd455 (6767)
llama-batched-bench
GLM 4.5 Air
Model: https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main
llama-bench
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
build: 5acd455 (6767)
llama-batched-bench
More info
Beta Was this translation helpful? Give feedback.
All reactions