Skip to content

Faster prompt processing on CUDA#1687

Merged
ikawrakow merged 3 commits into
mainfrom
ik/better_fixup_stream_k
Apr 25, 2026
Merged

Faster prompt processing on CUDA#1687
ikawrakow merged 3 commits into
mainfrom
ik/better_fixup_stream_k

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

This is a port of PR 22298 in llama.cpp.

Noticeable performance gains (~10%) for MoE models, especially when using split mode graph.

Much more modest gains (2-3%) for dense models.

@markaalonzo
Copy link
Copy Markdown
Contributor

Single-GPU MoE data point for reference.

Setup: RTX 3080 Ti, CUDA 13.1, clean worktree checkout of PR branch. Model: Qwen3.6-35B-A3B IQ4_K_R4 (18.37 GiB). llama-bench -fa 1 -ger 1 -ser 4,0 --fit 1, 2 reps.

prompt tokens base (4436, 8e7a2b5c) this PR (4438, eb550ad) delta
3,000 337.31 ± 4.47 tok/s 334.38 ± 10.92 tok/s −0.87%
32,000 335.93 ± 0.25 tok/s 335.65 ± 0.74 tok/s −0.08%

Both deltas sit inside the measurement noise — consistent with the PR body framing the ~10% MoE gain as being tied to split mode graph (multi-GPU). Single-GPU MoE looks essentially flat at these prompt lengths.

Build is clean (no conflicts, CMake Release, CUDA arch 86). Happy to re-run with different flags if that would narrow down the conditions where the gain kicks in.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@markaalonzo

Your GPU has 12 GB VRAM, so many MoE layers will end up on the CPU. With the default u-batch size of 512 the MoE computation for these will be done on the CPU1. Observed performance will be totally dominated by that. Here is what I get with full offload for this model on a single 3090 GPU:

Main branch

model backend ngl threads n_ubatch test t/s
qwen35moe 35B.A3B IQ4_K_R4 CUDA 100 1 2048 pp2048 4246.78 ± 64.81

PR

model backend ngl threads n_ubatch test t/s
qwen35moe 35B.A3B IQ4_K_R4 CUDA 100 1 2048 pp2048 4931.29 ± 68.43

I.e., the PR is about 16% faster at zero context.

Running this model CPU-only on my Ryzen-3995WX CPU I get

model backend ngl threads n_ubatch test t/s
qwen35moe 35B.A3B IQ4_K_R4 CUDA 100 1 2048 pp2048 573.64 ± 20.18

I.e., faster than your hybrid CPU/GPU result. I think you should buy a better CPU to go along your 12 GB GPU.


1 For MoE models the decision if to upload a tensor residing in RAM to the GPU to perform the matrix multiplication depends on the batch size, the number of total experts (N_tot) and the number of active experts (N_active). By default it is uploaded if batch size > 32 * N_tot / N_active. For Qwen3.6-35B-A3B we have N_tot = 256, and you have set N_active = 4 via -ser 4,0. Hence, the batch size must be greater than 2048 to offload the MoE matrix multiplications to the GPU.

@ikawrakow ikawrakow merged commit 3a945af into main Apr 25, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 25, 2026
@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 27, 2026

I was doing a bunch of home improvement so I didn't have time to test this. On dense gemma 31B I end up with 1/4 of the prompt processing. I revert the commit and it's fine and 2000 t/s again.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Ph0rk0z

I was doing a bunch of home improvement so I didn't have time to test this. On dense gemma 31B I end up with 1/4 of the prompt processing. I revert the commit and it's fine and 2000 t/s again.

And this is with your usual 4x3090 setup using -sm graph?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 27, 2026

Yes, same gemma 31b. I didn't try mistral large or any MoE yet. I first thought it was my hardware but I went back to pre autoparser local branch and then went to main and reverted the commit.

ikawrakow added a commit that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants