Merge ffn_up and ffn_gate experts tensors#1137
Conversation
However, extremely stupid. The only way I could correctly repack the up/gate experts is to copy up and gate into host buffers, repack into another host buffer, copy back into the ffn_up_gate_exps tensor. This is going to be very slow for giant 500 GB models. My attempts to do this via a compute graph on the backend holding the tensors was unsuccessful. For GPT-OSS-20B I see ~6-7% better PP when using the original ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when using the small batch size implementation. Other models are not working yet on CUDA as I need to fix the fused mul-unary implementation.
But when I say here and in the previous commit "working", I mean PP is working. TG is still broken.
It is not yet implemented
|
I tried the merging approach at run-time for QKV merge (ggml-org/llama.cpp#16813), it turned out to be quite a messy thing. I have not seen how you implemented this PR, but glad to see you also see similar improvement! |
The Q, K, V merge at run time is actually quite simple, and has been available here for a while. Merging |
|
Does this work with qwen3vl too? I can test with mixed inference if it does |
Added in PR #1139. I haven't tested Qwen3VL-MoE, so would appreciate your feedback. |
I have been thinking about merging the
ffn_upandffn_gateexperts tensors into a single tensor for a while. But, this being a fairly intrusive change and me not being sure about how much performance improvement one might get from that, I have been reluctant to make the necessary changes. But now there is PR 18470 in mainlinellama.cpp, which claims 10% PP performance improvement for GPT-OSS-20B, so I became curious to see if the merge will be as beneficial inik_llama.cppas it is in mainline. This PR is the result.Unlike PR 18470, where the merge is done during model conversion, and hence this requires everyone to re-download many gigabytes of data, in this PR the merge is done on-the-fly during model loading. The implementation of the merging is not ideal at this point, but I became tired of fighting against the machine (the "machine" being
llama.cppmodel loading machinery and ggml-backend limitations in this case), so I can have something working to test. Basically,ffn_up_expsandffn_gate_expstensors are copied into temporary buffers (possibly from a GPU), the merge in prepared on the host, and then the merged tensor is copied back to the corresponding device. This may add a significant additional model loading time for very large models.Limitations:
layer. As above, if feedback is positive, I think it will not be hard to also add to split modegraph.The merge is disabled by default. To enable it one needs to add a command-line argument
Below are some performance comparisons for Qwen3-30B-A3B-IQ2_XXS and GPT-OSS-20B-MXFP4 with full offload on an RTX-3090 GPU. We observe ~7% (GPT-OSS) or ~10% (Qwe3-A30B-A3B) better PP.
GPT-OSS-20B
GPT-OSS-20B with -muge (this PR)
Qwen3-30B-A3B-IQ2_XXS
Qwen3-30B-A3B-IQ2_XXS with -muge (this PR)