Merge ffn_up and ffn_gate experts tensors by ikawrakow · Pull Request #1137 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-01-12T12:58:04Z

I have been thinking about merging the ffn_up and ffn_gate experts tensors into a single tensor for a while. But, this being a fairly intrusive change and me not being sure about how much performance improvement one might get from that, I have been reluctant to make the necessary changes. But now there is PR 18470 in mainline llama.cpp, which claims 10% PP performance improvement for GPT-OSS-20B, so I became curious to see if the merge will be as beneficial in ik_llama.cpp as it is in mainline. This PR is the result.

Unlike PR 18470, where the merge is done during model conversion, and hence this requires everyone to re-download many gigabytes of data, in this PR the merge is done on-the-fly during model loading. The implementation of the merging is not ideal at this point, but I became tired of fighting against the machine (the "machine" being llama.cpp model loading machinery and ggml-backend limitations in this case), so I can have something working to test. Basically, ffn_up_exps and ffn_gate_exps tensors are copied into temporary buffers (possibly from a GPU), the merge in prepared on the host, and then the merged tensor is copied back to the corresponding device. This may add a significant additional model loading time for very large models.

Limitations:

The PR is limited to just GPT-OSS and Qwen3-MoE. If feedback is positive, it will not take much to extend to all other supported MoE models (mainline's PR is limited to GPT-OSS only)
The merge is enabled only for split mode layer. As above, if feedback is positive, I think it will not be hard to also add to split mode graph.
Merge is disabled in layers with tensor overrides if the experts have biases (only applies to GPT-OSS at this point).

The merge is disabled by default. To enable it one needs to add a command-line argument

-muge or --merge-up-gate-experts

Below are some performance comparisons for Qwen3-30B-A3B-IQ2_XXS and GPT-OSS-20B-MXFP4 with full offload on an RTX-3090 GPU. We observe ~7% (GPT-OSS) or ~10% (Qwe3-A30B-A3B) better PP.

GPT-OSS-20B

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.298	6872.48	0.558	229.22
2048	128	2048	0.272	7538.23	0.571	224.14
2048	128	4096	0.281	7284.91	0.585	218.63
2048	128	6144	0.292	7008.54	0.598	214.16
2048	128	8192	0.303	6767.76	0.608	210.65
2048	128	10240	0.315	6501.17	0.613	208.90
2048	128	12288	0.327	6261.50	0.622	205.66
2048	128	14336	0.337	6069.26	0.633	202.05
2048	128	16384	0.351	5842.60	0.644	198.75
2048	128	18432	0.360	5689.51	0.654	195.64
2048	128	20480	0.374	5482.09	0.658	194.39
2048	128	22528	0.383	5341.48	0.670	191.11
2048	128	24576	0.395	5190.30	0.681	187.95
2048	128	26624	0.407	5027.64	0.692	185.03
2048	128	28672	0.420	4881.91	0.696	183.96
2048	128	30720	0.432	4739.47	0.713	179.54
2048	128	32768	0.440	4650.04	0.722	177.25

GPT-OSS-20B with -muge (this PR)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.278	7360.68	0.554	231.05
2048	128	2048	0.254	8069.16	0.571	224.02
2048	128	4096	0.262	7801.99	0.586	218.48
2048	128	6144	0.274	7464.10	0.598	214.13
2048	128	8192	0.286	7164.32	0.607	210.79
2048	128	10240	0.298	6871.61	0.612	208.98
2048	128	12288	0.312	6559.41	0.623	205.47
2048	128	14336	0.320	6397.94	0.636	201.34
2048	128	16384	0.333	6155.10	0.645	198.59
2048	128	18432	0.343	5963.86	0.655	195.53
2048	128	20480	0.355	5770.04	0.659	194.23
2048	128	22528	0.366	5596.01	0.672	190.53
2048	128	24576	0.381	5380.94	0.688	186.11
2048	128	26624	0.390	5257.79	0.698	183.48
2048	128	28672	0.401	5112.25	0.696	183.86
2048	128	30720	0.415	4931.78	0.707	180.93
2048	128	32768	0.425	4820.84	0.724	176.79

Qwen3-30B-A3B-IQ2_XXS

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.393	5208.25	0.636	201.16
2048	128	2048	0.383	5345.54	0.632	202.49
2048	128	4096	0.426	4810.45	0.676	189.44
2048	128	6144	0.465	4403.54	0.733	174.59
2048	128	8192	0.503	4073.75	0.736	173.88
2048	128	10240	0.545	3756.13	0.779	164.29
2048	128	12288	0.585	3502.74	0.817	156.73
2048	128	14336	0.624	3282.38	0.834	153.52
2048	128	16384	0.664	3084.09	0.879	145.56
2048	128	18432	0.703	2914.87	0.888	144.12
2048	128	20480	0.745	2749.70	0.913	140.27
2048	128	22528	0.783	2614.42	0.952	134.48
2048	128	24576	0.822	2490.25	0.973	131.57
2048	128	26624	0.859	2384.98	1.010	126.68
2048	128	28672	0.901	2272.91	1.023	125.17
2048	128	30720	0.939	2182.05	1.045	122.44
2048	128	32768	0.982	2085.91	1.098	116.60
2048	128	34816	1.023	2002.47	1.110	115.34
2048	128	36864	1.063	1926.89	1.157	110.61
2048	128	38912	1.107	1850.49	1.171	109.28

Qwen3-30B-A3B-IQ2_XXS with -muge (this PR)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.334	6132.49	0.607	210.72
2048	128	2048	0.348	5878.47	0.638	200.48
2048	128	4096	0.387	5286.58	0.670	190.92
2048	128	6144	0.425	4820.12	0.727	176.06
2048	128	8192	0.465	4407.45	0.736	173.88
2048	128	10240	0.504	4065.60	0.771	166.05
2048	128	12288	0.547	3742.51	0.820	156.17
2048	128	14336	0.587	3487.47	0.835	153.37
2048	128	16384	0.627	3267.30	0.872	146.73
2048	128	18432	0.667	3072.32	0.890	143.87
2048	128	20480	0.707	2897.05	0.910	140.73
2048	128	22528	0.745	2747.78	0.954	134.20
2048	128	24576	0.786	2604.78	0.976	131.15
2048	128	26624	0.827	2476.13	1.014	126.21
2048	128	28672	0.868	2359.36	1.032	124.09
2048	128	30720	0.908	2255.11	1.046	122.37
2048	128	32768	0.949	2158.36	1.099	116.42
2048	128	34816	0.992	2065.28	1.119	114.39
2048	128	36864	1.028	1991.84	1.157	110.67
2048	128	38912	1.070	1913.25	1.168	109.58

However, extremely stupid. The only way I could correctly repack the up/gate experts is to copy up and gate into host buffers, repack into another host buffer, copy back into the ffn_up_gate_exps tensor. This is going to be very slow for giant 500 GB models. My attempts to do this via a compute graph on the backend holding the tensors was unsuccessful. For GPT-OSS-20B I see ~6-7% better PP when using the original ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when using the small batch size implementation. Other models are not working yet on CUDA as I need to fix the fused mul-unary implementation.

But when I say here and in the previous commit "working", I mean PP is working. TG is still broken.

It is not yet implemented

am17an · 2026-01-12T15:30:56Z

I tried the merging approach at run-time for QKV merge (ggml-org/llama.cpp#16813), it turned out to be quite a messy thing. I have not seen how you implemented this PR, but glad to see you also see similar improvement!

ikawrakow · 2026-01-12T15:43:26Z

I tried the merging approach at run-time for QKV merge (ggml-org/llama.cpp#16813), it turned out to be quite a messy thing. I have not seen how you implemented this PR, but glad to see you also see similar improvement!

The Q, K, V merge at run time is actually quite simple, and has been available here for a while. Merging ffn_up and ffn_gate at run time is much messier.

MrHills-rs · 2026-01-12T16:44:52Z

Does this work with qwen3vl too? I can test with mixed inference if it does

ikawrakow · 2026-01-12T16:55:32Z

Does this work with qwen3vl too? I can test with mixed inference if it does

Added in PR #1139. I haven't tested Qwen3VL-MoE, so would appreciate your feedback.

ikawrakow added 13 commits January 11, 2026 12:29

WIP - not working

6ba5772

WIP - not working

77bd2ef

WIP

4e4fabf

WIP - Qwen3-MoE (and hopefully all others) working

3a848fc

But when I say here and in the previous commit "working", I mean PP is working. TG is still broken.

WIP: TG seems to be working

c7ae5d4

Minor

80f2b09

Add command line option to merge experts up/gate

7671335

Add merge up/gate command line parameter to llama-bench

7ad7d83

Turn off merge_up_gate_exps if split mode graph

ec105a8

It is not yet implemented

When no bias, allow merging up/gate with tensor overrides

9821ac7

Arghh, we need to increase the context size again

74dc8aa

Cleanup

905bca2

ikawrakow merged commit c03c2d7 into main Jan 12, 2026

ikawrakow mentioned this pull request Jan 12, 2026

Merge ffn_up and ffn_gate experts tensors (part 2) #1139

Merged

This was referenced Mar 11, 2026

Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next #1403

Merged

Split mode graph for models with pre-merged ffn_up/ffn_gate experts #1412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge ffn_up and ffn_gate experts tensors#1137

Merge ffn_up and ffn_gate experts tensors#1137
ikawrakow merged 13 commits intomainfrom
ik/fuse_merge_up_gate_exps

ikawrakow commented Jan 12, 2026

Uh oh!

am17an commented Jan 12, 2026 •

edited

Loading

Uh oh!

ikawrakow commented Jan 12, 2026

Uh oh!

MrHills-rs commented Jan 12, 2026

Uh oh!

ikawrakow commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Jan 12, 2026

GPT-OSS-20B

GPT-OSS-20B with -muge (this PR)

Qwen3-30B-A3B-IQ2_XXS

Qwen3-30B-A3B-IQ2_XXS with -muge (this PR)

Uh oh!

am17an commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Jan 12, 2026

Uh oh!

MrHills-rs commented Jan 12, 2026

Uh oh!

ikawrakow commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Jan 12, 2026 •

edited

Loading