Skip to content

Merge ffn_up and ffn_gate experts tensors#1137

Merged
ikawrakow merged 13 commits intomainfrom
ik/fuse_merge_up_gate_exps
Jan 12, 2026
Merged

Merge ffn_up and ffn_gate experts tensors#1137
ikawrakow merged 13 commits intomainfrom
ik/fuse_merge_up_gate_exps

Conversation

@ikawrakow
Copy link
Owner

I have been thinking about merging the ffn_up and ffn_gate experts tensors into a single tensor for a while. But, this being a fairly intrusive change and me not being sure about how much performance improvement one might get from that, I have been reluctant to make the necessary changes. But now there is PR 18470 in mainline llama.cpp, which claims 10% PP performance improvement for GPT-OSS-20B, so I became curious to see if the merge will be as beneficial in ik_llama.cpp as it is in mainline. This PR is the result.

Unlike PR 18470, where the merge is done during model conversion, and hence this requires everyone to re-download many gigabytes of data, in this PR the merge is done on-the-fly during model loading. The implementation of the merging is not ideal at this point, but I became tired of fighting against the machine (the "machine" being llama.cpp model loading machinery and ggml-backend limitations in this case), so I can have something working to test. Basically, ffn_up_exps and ffn_gate_exps tensors are copied into temporary buffers (possibly from a GPU), the merge in prepared on the host, and then the merged tensor is copied back to the corresponding device. This may add a significant additional model loading time for very large models.

Limitations:

  • The PR is limited to just GPT-OSS and Qwen3-MoE. If feedback is positive, it will not take much to extend to all other supported MoE models (mainline's PR is limited to GPT-OSS only)
  • The merge is enabled only for split mode layer. As above, if feedback is positive, I think it will not be hard to also add to split mode graph.
  • Merge is disabled in layers with tensor overrides if the experts have biases (only applies to GPT-OSS at this point).

The merge is disabled by default. To enable it one needs to add a command-line argument

-muge or --merge-up-gate-experts

Below are some performance comparisons for Qwen3-30B-A3B-IQ2_XXS and GPT-OSS-20B-MXFP4 with full offload on an RTX-3090 GPU. We observe ~7% (GPT-OSS) or ~10% (Qwe3-A30B-A3B) better PP.

GPT-OSS-20B

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.298 6872.48 0.558 229.22
2048 128 2048 0.272 7538.23 0.571 224.14
2048 128 4096 0.281 7284.91 0.585 218.63
2048 128 6144 0.292 7008.54 0.598 214.16
2048 128 8192 0.303 6767.76 0.608 210.65
2048 128 10240 0.315 6501.17 0.613 208.90
2048 128 12288 0.327 6261.50 0.622 205.66
2048 128 14336 0.337 6069.26 0.633 202.05
2048 128 16384 0.351 5842.60 0.644 198.75
2048 128 18432 0.360 5689.51 0.654 195.64
2048 128 20480 0.374 5482.09 0.658 194.39
2048 128 22528 0.383 5341.48 0.670 191.11
2048 128 24576 0.395 5190.30 0.681 187.95
2048 128 26624 0.407 5027.64 0.692 185.03
2048 128 28672 0.420 4881.91 0.696 183.96
2048 128 30720 0.432 4739.47 0.713 179.54
2048 128 32768 0.440 4650.04 0.722 177.25

GPT-OSS-20B with -muge (this PR)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.278 7360.68 0.554 231.05
2048 128 2048 0.254 8069.16 0.571 224.02
2048 128 4096 0.262 7801.99 0.586 218.48
2048 128 6144 0.274 7464.10 0.598 214.13
2048 128 8192 0.286 7164.32 0.607 210.79
2048 128 10240 0.298 6871.61 0.612 208.98
2048 128 12288 0.312 6559.41 0.623 205.47
2048 128 14336 0.320 6397.94 0.636 201.34
2048 128 16384 0.333 6155.10 0.645 198.59
2048 128 18432 0.343 5963.86 0.655 195.53
2048 128 20480 0.355 5770.04 0.659 194.23
2048 128 22528 0.366 5596.01 0.672 190.53
2048 128 24576 0.381 5380.94 0.688 186.11
2048 128 26624 0.390 5257.79 0.698 183.48
2048 128 28672 0.401 5112.25 0.696 183.86
2048 128 30720 0.415 4931.78 0.707 180.93
2048 128 32768 0.425 4820.84 0.724 176.79

Qwen3-30B-A3B-IQ2_XXS

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.393 5208.25 0.636 201.16
2048 128 2048 0.383 5345.54 0.632 202.49
2048 128 4096 0.426 4810.45 0.676 189.44
2048 128 6144 0.465 4403.54 0.733 174.59
2048 128 8192 0.503 4073.75 0.736 173.88
2048 128 10240 0.545 3756.13 0.779 164.29
2048 128 12288 0.585 3502.74 0.817 156.73
2048 128 14336 0.624 3282.38 0.834 153.52
2048 128 16384 0.664 3084.09 0.879 145.56
2048 128 18432 0.703 2914.87 0.888 144.12
2048 128 20480 0.745 2749.70 0.913 140.27
2048 128 22528 0.783 2614.42 0.952 134.48
2048 128 24576 0.822 2490.25 0.973 131.57
2048 128 26624 0.859 2384.98 1.010 126.68
2048 128 28672 0.901 2272.91 1.023 125.17
2048 128 30720 0.939 2182.05 1.045 122.44
2048 128 32768 0.982 2085.91 1.098 116.60
2048 128 34816 1.023 2002.47 1.110 115.34
2048 128 36864 1.063 1926.89 1.157 110.61
2048 128 38912 1.107 1850.49 1.171 109.28

Qwen3-30B-A3B-IQ2_XXS with -muge (this PR)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.334 6132.49 0.607 210.72
2048 128 2048 0.348 5878.47 0.638 200.48
2048 128 4096 0.387 5286.58 0.670 190.92
2048 128 6144 0.425 4820.12 0.727 176.06
2048 128 8192 0.465 4407.45 0.736 173.88
2048 128 10240 0.504 4065.60 0.771 166.05
2048 128 12288 0.547 3742.51 0.820 156.17
2048 128 14336 0.587 3487.47 0.835 153.37
2048 128 16384 0.627 3267.30 0.872 146.73
2048 128 18432 0.667 3072.32 0.890 143.87
2048 128 20480 0.707 2897.05 0.910 140.73
2048 128 22528 0.745 2747.78 0.954 134.20
2048 128 24576 0.786 2604.78 0.976 131.15
2048 128 26624 0.827 2476.13 1.014 126.21
2048 128 28672 0.868 2359.36 1.032 124.09
2048 128 30720 0.908 2255.11 1.046 122.37
2048 128 32768 0.949 2158.36 1.099 116.42
2048 128 34816 0.992 2065.28 1.119 114.39
2048 128 36864 1.028 1991.84 1.157 110.67
2048 128 38912 1.070 1913.25 1.168 109.58

However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.
But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.
@am17an
Copy link

am17an commented Jan 12, 2026

I tried the merging approach at run-time for QKV merge (ggml-org/llama.cpp#16813), it turned out to be quite a messy thing. I have not seen how you implemented this PR, but glad to see you also see similar improvement!

@ikawrakow
Copy link
Owner Author

I tried the merging approach at run-time for QKV merge (ggml-org/llama.cpp#16813), it turned out to be quite a messy thing. I have not seen how you implemented this PR, but glad to see you also see similar improvement!

The Q, K, V merge at run time is actually quite simple, and has been available here for a while. Merging ffn_up and ffn_gate at run time is much messier.

@ikawrakow ikawrakow merged commit c03c2d7 into main Jan 12, 2026
@MrHills-rs
Copy link

Does this work with qwen3vl too? I can test with mixed inference if it does

@ikawrakow
Copy link
Owner Author

Does this work with qwen3vl too? I can test with mixed inference if it does

Added in PR #1139. I haven't tested Qwen3VL-MoE, so would appreciate your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants