Faster prompt processing on CUDA by ikawrakow · Pull Request #1687 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-04-24T15:59:08Z

This is a port of PR 22298 in llama.cpp.

Noticeable performance gains (~10%) for MoE models, especially when using split mode graph.

Much more modest gains (2-3%) for dense models.

markaalonzo · 2026-04-25T02:21:54Z

Single-GPU MoE data point for reference.

Setup: RTX 3080 Ti, CUDA 13.1, clean worktree checkout of PR branch. Model: Qwen3.6-35B-A3B IQ4_K_R4 (18.37 GiB). llama-bench -fa 1 -ger 1 -ser 4,0 --fit 1, 2 reps.

prompt tokens	base (4436, 8e7a2b5c)	this PR (4438, `eb550ad`)	delta
3,000	337.31 ± 4.47 tok/s	334.38 ± 10.92 tok/s	−0.87%
32,000	335.93 ± 0.25 tok/s	335.65 ± 0.74 tok/s	−0.08%

Both deltas sit inside the measurement noise — consistent with the PR body framing the ~10% MoE gain as being tied to split mode graph (multi-GPU). Single-GPU MoE looks essentially flat at these prompt lengths.

Build is clean (no conflicts, CMake Release, CUDA arch 86). Happy to re-run with different flags if that would narrow down the conditions where the gain kicks in.

ikawrakow · 2026-04-25T06:46:26Z

@markaalonzo

Your GPU has 12 GB VRAM, so many MoE layers will end up on the CPU. With the default u-batch size of 512 the MoE computation for these will be done on the CPU¹. Observed performance will be totally dominated by that. Here is what I get with full offload for this model on a single 3090 GPU:

Main branch

model	backend	ngl	threads	n_ubatch	test	t/s
qwen35moe 35B.A3B IQ4_K_R4	CUDA	100	1	2048	pp2048	4246.78 ± 64.81

PR

model	backend	ngl	threads	n_ubatch	test	t/s
qwen35moe 35B.A3B IQ4_K_R4	CUDA	100	1	2048	pp2048	4931.29 ± 68.43

I.e., the PR is about 16% faster at zero context.

Running this model CPU-only on my Ryzen-3995WX CPU I get

model	backend	ngl	threads	n_ubatch	test	t/s
qwen35moe 35B.A3B IQ4_K_R4	CUDA	100	1	2048	pp2048	573.64 ± 20.18

I.e., faster than your hybrid CPU/GPU result. I think you should buy a better CPU to go along your 12 GB GPU.

¹ For MoE models the decision if to upload a tensor residing in RAM to the GPU to perform the matrix multiplication depends on the batch size, the number of total experts (N_tot) and the number of active experts (N_active). By default it is uploaded if batch size > 32 * N_tot / N_active. For Qwen3.6-35B-A3B we have N_tot = 256, and you have set N_active = 4 via -ser 4,0. Hence, the batch size must be greater than 2048 to offload the MoE matrix multiplications to the GPU.

This reverts commit 3a945af.

Ph0rk0z · 2026-04-27T11:07:16Z

I was doing a bunch of home improvement so I didn't have time to test this. On dense gemma 31B I end up with 1/4 of the prompt processing. I revert the commit and it's fine and 2000 t/s again.

ikawrakow · 2026-04-27T11:16:46Z

@Ph0rk0z

I was doing a bunch of home improvement so I didn't have time to test this. On dense gemma 31B I end up with 1/4 of the prompt processing. I revert the commit and it's fine and 2000 t/s again.

And this is with your usual 4x3090 setup using -sm graph?

Ph0rk0z · 2026-04-27T12:56:56Z

Yes, same gemma 31b. I didn't try mistral large or any MoE yet. I first thought it was my hardware but I went back to pre autoparser local branch and then went to main and reverted the commit.

This reverts commit 3a945af.

ikawrakow added 3 commits April 24, 2026 15:00

Better fixup_stream_k

9d8edab

ggml_cuda_op_mul_mat_q -> ggml_cuda_mul_mat_q_id

f8dd677

Adding forgotten file

eb550ad

ikawrakow merged commit 3a945af into main Apr 25, 2026

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 25, 2026

Revert "Faster prompt processing on CUDA (ikawrakow#1687)"

158a350

This reverts commit 3a945af.

ikawrakow mentioned this pull request Apr 27, 2026

Revert "Faster prompt processing on CUDA (#1687)" #1700

Merged

ikawrakow added a commit that referenced this pull request Apr 28, 2026

Revert "Faster prompt processing on CUDA (#1687)" (#1700)

aae9b8d

This reverts commit 3a945af.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster prompt processing on CUDA#1687

Faster prompt processing on CUDA#1687
ikawrakow merged 3 commits into
mainfrom
ik/better_fixup_stream_k

ikawrakow commented Apr 24, 2026

Uh oh!

markaalonzo commented Apr 25, 2026

Uh oh!

ikawrakow commented Apr 25, 2026

Uh oh!

Ph0rk0z commented Apr 27, 2026

Uh oh!

ikawrakow commented Apr 27, 2026

Uh oh!

Ph0rk0z commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Apr 24, 2026

Uh oh!

markaalonzo commented Apr 25, 2026

Uh oh!

ikawrakow commented Apr 25, 2026

Main branch

PR

Uh oh!

Ph0rk0z commented Apr 27, 2026

Uh oh!

ikawrakow commented Apr 27, 2026

Uh oh!

Ph0rk0z commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants