Skip to content

Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next#1403

Merged
ikawrakow merged 1 commit intomainfrom
ik/qwen35moe_muge
Mar 11, 2026
Merged

Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next#1403
ikawrakow merged 1 commit intomainfrom
ik/qwen35moe_muge

Conversation

@ikawrakow
Copy link
Owner

See #1137 for more details. Enabled via -muge on the command line. Only works for split mode layer.

We get a few percent PP performance improvement. For instance, for Qwen3.5-35B-A3B on a 3090 GPU

ffn_up_exps and ffn_gate_exps not merged

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.456 4491.79 0.880 145.44
2048 128 2048 0.424 4827.55 0.861 148.64
2048 128 4096 0.433 4733.82 0.856 149.61
2048 128 6144 0.445 4604.26 0.869 147.35
2048 128 8192 0.451 4541.44 0.879 145.69
2048 128 10240 0.461 4443.35 0.889 144.05
2048 128 12288 0.471 4351.20 0.899 142.41
2048 128 14336 0.482 4250.46 0.904 141.61
2048 128 16384 0.492 4165.51 0.907 141.07
2048 128 18432 0.502 4075.81 0.915 139.84
2048 128 20480 0.510 4016.34 0.928 137.98
2048 128 22528 0.520 3936.46 0.938 136.40
2048 128 24576 0.529 3869.91 0.940 136.11
2048 128 26624 0.542 3779.71 0.944 135.57
2048 128 28672 0.551 3718.36 0.950 134.75
2048 128 30720 0.559 3661.77 0.955 134.09

ffn_up_exps and ffn_gate_exps merged

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.424 4834.69 0.845 151.42
2048 128 2048 0.405 5057.39 0.856 149.58
2048 128 4096 0.412 4969.45 0.860 148.77
2048 128 6144 0.423 4843.44 0.864 148.12
2048 128 8192 0.431 4750.75 0.874 146.53
2048 128 10240 0.440 4649.82 0.885 144.69
2048 128 12288 0.451 4537.33 0.901 142.11
2048 128 14336 0.462 4435.39 0.903 141.67
2048 128 16384 0.470 4357.42 0.907 141.09
2048 128 18432 0.481 4253.48 0.913 140.19
2048 128 20480 0.489 4187.76 0.919 139.27
2048 128 22528 0.499 4101.41 0.937 136.56
2048 128 24576 0.510 4016.21 0.939 136.25
2048 128 26624 0.520 3942.09 0.943 135.67
2048 128 28672 0.529 3868.80 0.949 134.94
2048 128 30720 0.539 3798.25 0.952 134.47

Note: in ik_llama.cpp the merge happens on-the-fly as the model gets loaded. This is unlike mainline llama.cpp, which requires a pre-merged model. Such pre-merged models will not work with ik_llama.cpp.

@hksdpc255
Copy link
Contributor

Why don’t the mainline maintainers implement merge-up/gate in a similar way to yours? Your implementation seems much simpler than theirs.

@ikawrakow
Copy link
Owner Author

Why don’t the mainline maintainers implement merge-up/gate in a similar way to yours? Your implementation seems much simpler than theirs.

You can go and ask them 😜

Their up/gate merge PR happened after #1137 here.

@ubergarm
Copy link
Contributor

ubergarm commented Mar 11, 2026

oof, i was worrying about this as well as imatrix compatibility and hoping mainline would opt for a "runtime fusion" approach (like ik uses) as I mentioned here: ggml-org/llama.cpp#19139 (comment)

Such pre-merged models will not work with ik_llama.cpp.

yes, I just confirmed by testing @AesSedai 's re-uploads here: https://huggingface.co/AesSedai/Qwen3.5-35B-A3B-GGUF

the new pre-fused quants throw this error:

print_info: max token length = 256
llm_load_tensors: ggml ctx size =    4.10 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.ffn_gate_exps.weight' not found
llama_model_load_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/raid/models/AesSedai/Qwen3.5-35B-A3B-GGUF/IQ4_XS/Qwen3.5-35B-A3B-IQ4_XS-00001-of-00002.gguf'
 ERR [              load_model] unable to load model | tid="136757531189248" timestamp=1773240905 model="/mnt/raid/models/AesSedai/Qwen3.5-35B-A3B-GGUF/IQ4_XS/Qwen3.5-35B-A3B-IQ4_XS-00001-of-00002.gguf"

what a bummer

@ubergarm
Copy link
Contributor

Can confirm that using this PR with -sm layer --merge-qkv -muge shows about 2~6% increase in PP over baseline -sm layer. Also showing -sm graph here for comparison.

sweep-bench-Qwen3 5-122B-A10B-IQ4_KSS-PR1403
👈 Details

title: "ik_llama.cpp PR1403 qwen35moe_muge@d62e8e5d"
subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)"
hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

-sm graph

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm graph \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 1.319 3104.62 1.639 78.10
4096 128 4096 1.372 2986.31 1.644 77.88
4096 128 8192 1.416 2893.66 1.661 77.07
4096 128 12288 1.473 2781.05 1.694 75.57
4096 128 16384 1.531 2674.73 1.699 75.36
4096 128 20480 1.592 2572.56 1.705 75.06
4096 128 24576 1.649 2483.42 1.731 73.96
4096 128 28672 1.699 2411.32 1.736 73.74
4096 128 32768 1.764 2322.46 1.765 72.53
4096 128 36864 1.821 2249.65 1.769 72.35
4096 128 40960 1.877 2182.48 1.774 72.15
4096 128 45056 1.932 2120.15 1.797 71.23
4096 128 49152 1.986 2062.78 1.806 70.86
4096 128 53248 2.032 2015.42 1.813 70.60
4096 128 57344 2.088 1961.55 1.836 69.74
4096 128 61440 2.135 1918.67 1.842 69.48
4096 128 65536 2.192 1868.38 1.866 68.58
4096 128 69632 2.245 1824.58 1.879 68.14
4096 128 73728 2.303 1778.89 1.880 68.08
4096 128 77824 2.357 1737.79 1.905 67.20
4096 128 81920 2.400 1706.38 1.911 66.98
4096 128 86016 2.456 1668.01 1.922 66.61
4096 128 90112 2.510 1631.76 1.938 66.03
4096 128 94208 2.574 1591.24 1.945 65.80
4096 128 98304 2.623 1561.31 1.970 64.97
4096 128 102400 2.679 1529.01 1.977 64.74
4096 128 106496 2.735 1497.65 1.988 64.40
4096 128 110592 2.781 1472.92 2.010 63.67
4096 128 114688 2.846 1439.05 2.016 63.49
4096 128 118784 2.884 1420.08 2.041 62.72
4096 128 122880 2.941 1392.68 2.051 62.40
4096 128 126976 2.993 1368.40 2.054 62.32
4096 128 131072 3.044 1345.58 2.082 61.47

-sm layer

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm layer \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 2.073 1975.47 2.118 60.44
4096 128 4096 2.170 1887.89 2.136 59.93
4096 128 8192 2.277 1798.78 2.168 59.04
4096 128 12288 2.380 1720.86 2.192 58.40
4096 128 16384 2.485 1648.18 2.221 57.64
4096 128 20480 2.568 1594.75 2.230 57.40
4096 128 24576 2.673 1532.13 2.255 56.77
4096 128 28672 2.765 1481.16 2.280 56.14
4096 128 32768 2.863 1430.67 2.308 55.46
4096 128 36864 2.959 1384.42 2.312 55.37
4096 128 40960 3.058 1339.44 2.337 54.77
4096 128 45056 3.147 1301.58 2.364 54.16
4096 128 49152 3.250 1260.34 2.391 53.53
4096 128 53248 3.337 1227.58 2.400 53.33
4096 128 57344 3.433 1193.22 2.422 52.85
4096 128 61440 3.530 1160.39 2.450 52.25
4096 128 65536 3.617 1132.27 2.476 51.69
4096 128 69632 3.721 1100.72 2.486 51.49
4096 128 73728 3.813 1074.35 2.512 50.96
4096 128 77824 3.914 1046.41 2.536 50.48
4096 128 81920 4.001 1023.71 2.562 49.96
4096 128 86016 4.089 1001.83 2.573 49.75
4096 128 90112 4.200 975.32 2.595 49.33
4096 128 94208 4.276 957.89 2.622 48.81
4096 128 98304 4.367 938.00 2.646 48.37
4096 128 102400 4.465 917.29 2.669 47.97
4096 128 106496 4.571 896.17 2.676 47.83
4096 128 110592 4.655 879.84 2.702 47.37
4096 128 114688 4.740 864.14 2.729 46.91
4096 128 118784 4.836 847.05 2.756 46.44
4096 128 122880 4.931 830.67 2.762 46.34
4096 128 126976 5.021 815.75 2.784 45.98
4096 128 131072 5.112 801.19 2.814 45.49

-sm layer --merge-qkv -muge

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm layer \
  --merge-qkv \
  -muge \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 1.948 2102.43 2.114 60.54
4096 128 4096 2.042 2006.37 2.128 60.14
4096 128 8192 2.145 1909.84 2.155 59.41
4096 128 12288 2.244 1824.99 2.183 58.65
4096 128 16384 2.343 1748.14 2.213 57.85
4096 128 20480 2.438 1679.79 2.219 57.69
4096 128 24576 2.530 1618.95 2.242 57.08
4096 128 28672 2.634 1554.99 2.271 56.36
4096 128 32768 2.724 1503.67 2.299 55.67
4096 128 36864 2.825 1450.16 2.307 55.48
4096 128 40960 2.915 1405.16 2.330 54.94
4096 128 45056 3.014 1358.91 2.358 54.27
4096 128 49152 3.118 1313.70 2.385 53.67
4096 128 53248 3.203 1278.96 2.392 53.52
4096 128 57344 3.305 1239.45 2.417 52.95
4096 128 61440 3.400 1204.54 2.442 52.41
4096 128 65536 3.492 1173.10 2.467 51.87
4096 128 69632 3.591 1140.54 2.484 51.53
4096 128 73728 3.687 1110.88 2.505 51.10
4096 128 77824 3.781 1083.20 2.533 50.54
4096 128 81920 3.874 1057.25 2.556 50.08
4096 128 86016 3.976 1030.10 2.567 49.86
4096 128 90112 4.070 1006.40 2.588 49.46
4096 128 94208 4.162 984.26 2.615 48.95
4096 128 98304 4.250 963.88 2.640 48.48
4096 128 102400 4.355 940.56 2.662 48.09
4096 128 106496 4.453 919.87 2.672 47.90
4096 128 110592 4.538 902.69 2.696 47.47
4096 128 114688 4.631 884.52 2.724 46.99
4096 128 118784 4.721 867.68 2.752 46.52
4096 128 122880 4.824 849.12 2.758 46.42
4096 128 126976 4.904 835.19 2.781 46.03
4096 128 131072 5.000 819.20 2.809 45.57

@Ph0rk0z
Copy link

Ph0rk0z commented Mar 11, 2026

I was adding this to my command lines and wondering why it wasn't helping. Now I know

@ikawrakow ikawrakow merged commit 1f4dcab into main Mar 11, 2026
@ikawrakow
Copy link
Owner Author

Oh, if somebody is wondering why in mainline land they talk about 10% performance improvement due to the merge, while in ik_llama.cpp we only see 3-5% improvement: the performance gains come from two factors

  • Do not quantize the same activations twice. This has been in ik_llama.cpp for a very long time, and it represents then larger portion of the performance gain.
  • The matrix multiplication becomes 2x larger. In many MoE models the expert tensors are quite small, so CUDA matrix multiplications cannot fully utilize the available compute capacity. Merging ffn_up_exps and ffn_gate_exps helps with that, but it is a less important ingredient for the observed performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants