Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next#1403
Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next#1403
Conversation
|
Why don’t the mainline maintainers implement merge-up/gate in a similar way to yours? Your implementation seems much simpler than theirs. |
You can go and ask them 😜 Their up/gate merge PR happened after #1137 here. |
|
oof, i was worrying about this as well as imatrix compatibility and hoping mainline would opt for a "runtime fusion" approach (like ik uses) as I mentioned here: ggml-org/llama.cpp#19139 (comment)
yes, I just confirmed by testing @AesSedai 's re-uploads here: https://huggingface.co/AesSedai/Qwen3.5-35B-A3B-GGUF the new pre-fused quants throw this error: print_info: max token length = 256
llm_load_tensors: ggml ctx size = 4.10 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.ffn_gate_exps.weight' not found
llama_model_load_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/raid/models/AesSedai/Qwen3.5-35B-A3B-GGUF/IQ4_XS/Qwen3.5-35B-A3B-IQ4_XS-00001-of-00002.gguf'
ERR [ load_model] unable to load model | tid="136757531189248" timestamp=1773240905 model="/mnt/raid/models/AesSedai/Qwen3.5-35B-A3B-GGUF/IQ4_XS/Qwen3.5-35B-A3B-IQ4_XS-00001-of-00002.gguf"what a bummer |
|
I was adding this to my command lines and wondering why it wasn't helping. Now I know |
|
Oh, if somebody is wondering why in mainline land they talk about 10% performance improvement due to the merge, while in
|

See #1137 for more details. Enabled via
-mugeon the command line. Only works for split modelayer.We get a few percent PP performance improvement. For instance, for Qwen3.5-35B-A3B on a 3090 GPU
ffn_up_exps and ffn_gate_exps not merged
ffn_up_exps and ffn_gate_exps merged
Note: in
ik_llama.cppthe merge happens on-the-fly as the model gets loaded. This is unlike mainlinellama.cpp, which requires a pre-merged model. Such pre-merged models will not work withik_llama.cpp.