CUDA Performance Regression on Jetson AGX Orin

Noticed a 10% performance loss in tg on the AGX Orin this week, a bisect led me to f77c13b91f4d25754b6a0b857f98a6bc922a0aa7 (https://github.com/ggml-org/llama.cpp/pull/16715).

---
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 |    0 |            tg32 |         37.09 ± 0.58 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 |    0 |            tg64 |         37.31 ± 0.05 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 |    0 |           tg128 |         37.33 ± 0.02 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 |    0 |           tg512 |         37.20 ± 0.01 |

build: 3cfa9c3f1 (6840)

---
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 |    0 |            tg32 |         33.21 ± 0.44 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 |    0 |            tg64 |         33.39 ± 0.04 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 |    0 |           tg128 |         33.40 ± 0.02 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 |    0 |           tg512 |         33.29 ± 0.01 |

build: f77c13b91 (6841)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CUDA Performance Regression on Jetson AGX Orin #16815

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	threads	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	tg32	37.09 ± 0.58
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	tg64	37.31 ± 0.05
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	tg128	37.33 ± 0.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	tg512	37.20 ± 0.01

Uh oh!

CUDA Performance Regression on Jetson AGX Orin #16815

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions