Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next by ikawrakow · Pull Request #1403 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-03-11T12:04:52Z

See #1137 for more details. Enabled via -muge on the command line. Only works for split mode layer.

We get a few percent PP performance improvement. For instance, for Qwen3.5-35B-A3B on a 3090 GPU

ffn_up_exps and ffn_gate_exps not merged

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.456	4491.79	0.880	145.44
2048	128	2048	0.424	4827.55	0.861	148.64
2048	128	4096	0.433	4733.82	0.856	149.61
2048	128	6144	0.445	4604.26	0.869	147.35
2048	128	8192	0.451	4541.44	0.879	145.69
2048	128	10240	0.461	4443.35	0.889	144.05
2048	128	12288	0.471	4351.20	0.899	142.41
2048	128	14336	0.482	4250.46	0.904	141.61
2048	128	16384	0.492	4165.51	0.907	141.07
2048	128	18432	0.502	4075.81	0.915	139.84
2048	128	20480	0.510	4016.34	0.928	137.98
2048	128	22528	0.520	3936.46	0.938	136.40
2048	128	24576	0.529	3869.91	0.940	136.11
2048	128	26624	0.542	3779.71	0.944	135.57
2048	128	28672	0.551	3718.36	0.950	134.75
2048	128	30720	0.559	3661.77	0.955	134.09

ffn_up_exps and ffn_gate_exps merged

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.424	4834.69	0.845	151.42
2048	128	2048	0.405	5057.39	0.856	149.58
2048	128	4096	0.412	4969.45	0.860	148.77
2048	128	6144	0.423	4843.44	0.864	148.12
2048	128	8192	0.431	4750.75	0.874	146.53
2048	128	10240	0.440	4649.82	0.885	144.69
2048	128	12288	0.451	4537.33	0.901	142.11
2048	128	14336	0.462	4435.39	0.903	141.67
2048	128	16384	0.470	4357.42	0.907	141.09
2048	128	18432	0.481	4253.48	0.913	140.19
2048	128	20480	0.489	4187.76	0.919	139.27
2048	128	22528	0.499	4101.41	0.937	136.56
2048	128	24576	0.510	4016.21	0.939	136.25
2048	128	26624	0.520	3942.09	0.943	135.67
2048	128	28672	0.529	3868.80	0.949	134.94
2048	128	30720	0.539	3798.25	0.952	134.47

Note: in ik_llama.cpp the merge happens on-the-fly as the model gets loaded. This is unlike mainline llama.cpp, which requires a pre-merged model. Such pre-merged models will not work with ik_llama.cpp.

hksdpc255 · 2026-03-11T12:17:03Z

Why don’t the mainline maintainers implement merge-up/gate in a similar way to yours? Your implementation seems much simpler than theirs.

ikawrakow · 2026-03-11T12:27:32Z

Why don’t the mainline maintainers implement merge-up/gate in a similar way to yours? Your implementation seems much simpler than theirs.

You can go and ask them 😜

Their up/gate merge PR happened after #1137 here.

ubergarm · 2026-03-11T15:00:00Z

oof, i was worrying about this as well as imatrix compatibility and hoping mainline would opt for a "runtime fusion" approach (like ik uses) as I mentioned here: ggml-org/llama.cpp#19139 (comment)

Such pre-merged models will not work with ik_llama.cpp.

yes, I just confirmed by testing @AesSedai 's re-uploads here: https://huggingface.co/AesSedai/Qwen3.5-35B-A3B-GGUF

the new pre-fused quants throw this error:

print_info: max token length = 256
llm_load_tensors: ggml ctx size =    4.10 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.ffn_gate_exps.weight' not found
llama_model_load_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/raid/models/AesSedai/Qwen3.5-35B-A3B-GGUF/IQ4_XS/Qwen3.5-35B-A3B-IQ4_XS-00001-of-00002.gguf'
 ERR [              load_model] unable to load model | tid="136757531189248" timestamp=1773240905 model="/mnt/raid/models/AesSedai/Qwen3.5-35B-A3B-GGUF/IQ4_XS/Qwen3.5-35B-A3B-IQ4_XS-00001-of-00002.gguf"

what a bummer

ubergarm · 2026-03-11T15:50:05Z

Can confirm that using this PR with -sm layer --merge-qkv -muge shows about 2~6% increase in PP over baseline -sm layer. Also showing -sm graph here for comparison.

sweep-bench-Qwen3 5-122B-A10B-IQ4_KSS-PR1403

👈 Details

title: "ik_llama.cpp PR1403 qwen35moe_muge@d62e8e5d"
subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)"
hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

-sm graph

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm graph \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	1.319	3104.62	1.639	78.10
4096	128	4096	1.372	2986.31	1.644	77.88
4096	128	8192	1.416	2893.66	1.661	77.07
4096	128	12288	1.473	2781.05	1.694	75.57
4096	128	16384	1.531	2674.73	1.699	75.36
4096	128	20480	1.592	2572.56	1.705	75.06
4096	128	24576	1.649	2483.42	1.731	73.96
4096	128	28672	1.699	2411.32	1.736	73.74
4096	128	32768	1.764	2322.46	1.765	72.53
4096	128	36864	1.821	2249.65	1.769	72.35
4096	128	40960	1.877	2182.48	1.774	72.15
4096	128	45056	1.932	2120.15	1.797	71.23
4096	128	49152	1.986	2062.78	1.806	70.86
4096	128	53248	2.032	2015.42	1.813	70.60
4096	128	57344	2.088	1961.55	1.836	69.74
4096	128	61440	2.135	1918.67	1.842	69.48
4096	128	65536	2.192	1868.38	1.866	68.58
4096	128	69632	2.245	1824.58	1.879	68.14
4096	128	73728	2.303	1778.89	1.880	68.08
4096	128	77824	2.357	1737.79	1.905	67.20
4096	128	81920	2.400	1706.38	1.911	66.98
4096	128	86016	2.456	1668.01	1.922	66.61
4096	128	90112	2.510	1631.76	1.938	66.03
4096	128	94208	2.574	1591.24	1.945	65.80
4096	128	98304	2.623	1561.31	1.970	64.97
4096	128	102400	2.679	1529.01	1.977	64.74
4096	128	106496	2.735	1497.65	1.988	64.40
4096	128	110592	2.781	1472.92	2.010	63.67
4096	128	114688	2.846	1439.05	2.016	63.49
4096	128	118784	2.884	1420.08	2.041	62.72
4096	128	122880	2.941	1392.68	2.051	62.40
4096	128	126976	2.993	1368.40	2.054	62.32
4096	128	131072	3.044	1345.58	2.082	61.47

-sm layer

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm layer \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	2.073	1975.47	2.118	60.44
4096	128	4096	2.170	1887.89	2.136	59.93
4096	128	8192	2.277	1798.78	2.168	59.04
4096	128	12288	2.380	1720.86	2.192	58.40
4096	128	16384	2.485	1648.18	2.221	57.64
4096	128	20480	2.568	1594.75	2.230	57.40
4096	128	24576	2.673	1532.13	2.255	56.77
4096	128	28672	2.765	1481.16	2.280	56.14
4096	128	32768	2.863	1430.67	2.308	55.46
4096	128	36864	2.959	1384.42	2.312	55.37
4096	128	40960	3.058	1339.44	2.337	54.77
4096	128	45056	3.147	1301.58	2.364	54.16
4096	128	49152	3.250	1260.34	2.391	53.53
4096	128	53248	3.337	1227.58	2.400	53.33
4096	128	57344	3.433	1193.22	2.422	52.85
4096	128	61440	3.530	1160.39	2.450	52.25
4096	128	65536	3.617	1132.27	2.476	51.69
4096	128	69632	3.721	1100.72	2.486	51.49
4096	128	73728	3.813	1074.35	2.512	50.96
4096	128	77824	3.914	1046.41	2.536	50.48
4096	128	81920	4.001	1023.71	2.562	49.96
4096	128	86016	4.089	1001.83	2.573	49.75
4096	128	90112	4.200	975.32	2.595	49.33
4096	128	94208	4.276	957.89	2.622	48.81
4096	128	98304	4.367	938.00	2.646	48.37
4096	128	102400	4.465	917.29	2.669	47.97
4096	128	106496	4.571	896.17	2.676	47.83
4096	128	110592	4.655	879.84	2.702	47.37
4096	128	114688	4.740	864.14	2.729	46.91
4096	128	118784	4.836	847.05	2.756	46.44
4096	128	122880	4.931	830.67	2.762	46.34
4096	128	126976	5.021	815.75	2.784	45.98
4096	128	131072	5.112	801.19	2.814	45.49

-sm layer --merge-qkv -muge

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm layer \
  --merge-qkv \
  -muge \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	1.948	2102.43	2.114	60.54
4096	128	4096	2.042	2006.37	2.128	60.14
4096	128	8192	2.145	1909.84	2.155	59.41
4096	128	12288	2.244	1824.99	2.183	58.65
4096	128	16384	2.343	1748.14	2.213	57.85
4096	128	20480	2.438	1679.79	2.219	57.69
4096	128	24576	2.530	1618.95	2.242	57.08
4096	128	28672	2.634	1554.99	2.271	56.36
4096	128	32768	2.724	1503.67	2.299	55.67
4096	128	36864	2.825	1450.16	2.307	55.48
4096	128	40960	2.915	1405.16	2.330	54.94
4096	128	45056	3.014	1358.91	2.358	54.27
4096	128	49152	3.118	1313.70	2.385	53.67
4096	128	53248	3.203	1278.96	2.392	53.52
4096	128	57344	3.305	1239.45	2.417	52.95
4096	128	61440	3.400	1204.54	2.442	52.41
4096	128	65536	3.492	1173.10	2.467	51.87
4096	128	69632	3.591	1140.54	2.484	51.53
4096	128	73728	3.687	1110.88	2.505	51.10
4096	128	77824	3.781	1083.20	2.533	50.54
4096	128	81920	3.874	1057.25	2.556	50.08
4096	128	86016	3.976	1030.10	2.567	49.86
4096	128	90112	4.070	1006.40	2.588	49.46
4096	128	94208	4.162	984.26	2.615	48.95
4096	128	98304	4.250	963.88	2.640	48.48
4096	128	102400	4.355	940.56	2.662	48.09
4096	128	106496	4.453	919.87	2.672	47.90
4096	128	110592	4.538	902.69	2.696	47.47
4096	128	114688	4.631	884.52	2.724	46.99
4096	128	118784	4.721	867.68	2.752	46.52
4096	128	122880	4.824	849.12	2.758	46.42
4096	128	126976	4.904	835.19	2.781	46.03
4096	128	131072	5.000	819.20	2.809	45.57

Ph0rk0z · 2026-03-11T18:27:34Z

I was adding this to my command lines and wondering why it wasn't helping. Now I know

ikawrakow · 2026-03-12T06:55:10Z

Oh, if somebody is wondering why in mainline land they talk about 10% performance improvement due to the merge, while in ik_llama.cpp we only see 3-5% improvement: the performance gains come from two factors

Do not quantize the same activations twice. This has been in ik_llama.cpp for a very long time, and it represents then larger portion of the performance gain.
The matrix multiplication becomes 2x larger. In many MoE models the expert tensors are quite small, so CUDA matrix multiplications cannot fully utilize the available compute capacity. Merging ffn_up_exps and ffn_gate_exps helps with that, but it is a less important ingredient for the observed performance improvement.

Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next

d62e8e5

ubergarm mentioned this pull request Mar 11, 2026

Add python script that converts GGUF imatrix files to the format supported here #1405

Merged

ikawrakow merged commit 1f4dcab into main Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next#1403

Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next#1403
ikawrakow merged 1 commit intomainfrom
ik/qwen35moe_muge

ikawrakow commented Mar 11, 2026

Uh oh!

hksdpc255 commented Mar 11, 2026

Uh oh!

ikawrakow commented Mar 11, 2026

Uh oh!

ubergarm commented Mar 11, 2026 •

edited

Loading

Uh oh!

ubergarm commented Mar 11, 2026

title: "ik_llama.cpp PR1403 qwen35moe_muge@d62e8e5d"
subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)"
hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

-sm graph

-sm layer

-sm layer --merge-qkv -muge

Uh oh!

Ph0rk0z commented Mar 11, 2026

Uh oh!

ikawrakow commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Mar 11, 2026

ffn_up_exps and ffn_gate_exps not merged

ffn_up_exps and ffn_gate_exps merged

Uh oh!

hksdpc255 commented Mar 11, 2026

Uh oh!

ikawrakow commented Mar 11, 2026

Uh oh!

ubergarm commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Mar 11, 2026

title: "ik_llama.cpp PR1403 qwen35moe_muge@d62e8e5d" subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)" hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

-sm graph

-sm layer

-sm layer --merge-qkv -muge

Uh oh!

Ph0rk0z commented Mar 11, 2026

Uh oh!

ikawrakow commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ubergarm commented Mar 11, 2026 •

edited

Loading

title: "ik_llama.cpp PR1403 qwen35moe_muge@d62e8e5d"
subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)"
hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"