Faster prompt processing on CUDA#1687
Conversation
|
Single-GPU MoE data point for reference. Setup: RTX 3080 Ti, CUDA 13.1, clean worktree checkout of PR branch. Model: Qwen3.6-35B-A3B IQ4_K_R4 (18.37 GiB).
Both deltas sit inside the measurement noise — consistent with the PR body framing the ~10% MoE gain as being tied to split mode Build is clean (no conflicts, CMake Release, CUDA arch 86). Happy to re-run with different flags if that would narrow down the conditions where the gain kicks in. |
|
Your GPU has 12 GB VRAM, so many MoE layers will end up on the CPU. With the default u-batch size of 512 the MoE computation for these will be done on the CPU1. Observed performance will be totally dominated by that. Here is what I get with full offload for this model on a single 3090 GPU: Main branch
PR
I.e., the PR is about 16% faster at zero context. Running this model CPU-only on my Ryzen-3995WX CPU I get
I.e., faster than your hybrid CPU/GPU result. I think you should buy a better CPU to go along your 12 GB GPU. 1 For MoE models the decision if to upload a tensor residing in RAM to the GPU to perform the matrix multiplication depends on the batch size, the number of total experts ( |
This reverts commit 3a945af.
|
I was doing a bunch of home improvement so I didn't have time to test this. On dense gemma 31B I end up with 1/4 of the prompt processing. I revert the commit and it's fine and 2000 t/s again. |
And this is with your usual 4x3090 setup using |
|
Yes, same gemma 31b. I didn't try mistral large or any MoE yet. I first thought it was my hardware but I went back to pre autoparser local branch and then went to main and reverted the commit. |
This is a port of PR 22298 in llama.cpp.
Noticeable performance gains (~10%) for MoE models, especially when using split mode
graph.Much more modest gains (2-3%) for dense models.