Fused MoE ffn_up and ffn_gate #229
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In all MoE models one has the following sequence of operations as part of the feed forward network (simplified):
Each of the
ggml_mul_mat_idoperations requires a search of activated experts (which is the same for all 3). Also,upandgatehave the same second operand so that, if they are quantized, the quantization is unnecessarily repeated. There is a barrier after each operation. On CUDA there is no implementation of indirect matrix multiplication, so eachggml_mul_mat_idop triggers a copy of the rows of the second operand to a contiguous memory block, actual matrix multiplication, and then another copy from the contiguous matrix multiplication result to the non-contiguous op result. All of this adds overhead thus reducing performance.This PR adds new
ggmlop that fuses theup, gateandactoperations. On CUDA, if the next op in the computation graph is theparop, it is auto-fused as well. Thedownoperation is not included for now, but a future PR may do so.This is relevant for the performance of the large DeepSeekV3/R1 models. I don't have the means to run DeepSeekV3/R1, hence using DeepSeek-Lite (very similar architecture but only16B parameters with 2.4B active parameters). For this model, we gain ~3-4% in prompt processing (PP) speed and 1-2% for token generation (TG) when running on the CPU. The performance gains are much more significant on CUDA - about 26% speedup for PP and 7% for TG. On my RTX-4080 I now get
PP-512 = 4400 t/sfor DeepSeek-Lite. This is still much to low compared to a dense model with 2.4B parameters (one should get in the range of 15,000 t/s), but quite a bit better than the 3450 t/s one gets on the main branch (and also in mainlinellama.cpp).As the new op is not implemented on all platforms (Metal is missing), it is enabled via a command line option that is off by default. To turn on, use
-fmoeor--fused-moe.Obviously this option cannot be used when computing an imatrix because than the intermediate results remain in temporary work buffers, hence will not be propagated to collect activation statistics for the
up_expsandgate_expstensors.