Fused MoE ffn_up and ffn_gate #229

ikawrakow · 2025-02-23T10:40:43Z

In all MoE models one has the following sequence of operations as part of the feed forward network (simplified):

up   = ggml_mul_mat_id(up_exps, cur, selected_experts)
gate = ggml_mul_mat_id(gate_exps, cur, selected_experts)
act  = ggml_silu(gate) or ggml_gelu(gate)
par  = ggml_mul(up, act)
down = ggml_mul_mat_id(down_exps, par)

Each of the ggml_mul_mat_id operations requires a search of activated experts (which is the same for all 3). Also, up and gate have the same second operand so that, if they are quantized, the quantization is unnecessarily repeated. There is a barrier after each operation. On CUDA there is no implementation of indirect matrix multiplication, so each ggml_mul_mat_id op triggers a copy of the rows of the second operand to a contiguous memory block, actual matrix multiplication, and then another copy from the contiguous matrix multiplication result to the non-contiguous op result. All of this adds overhead thus reducing performance.

This PR adds new ggml op that fuses the up, gate and act operations. On CUDA, if the next op in the computation graph is the par op, it is auto-fused as well. The down operation is not included for now, but a future PR may do so.

This is relevant for the performance of the large DeepSeekV3/R1 models. I don't have the means to run DeepSeekV3/R1, hence using DeepSeek-Lite (very similar architecture but only16B parameters with 2.4B active parameters). For this model, we gain ~3-4% in prompt processing (PP) speed and 1-2% for token generation (TG) when running on the CPU. The performance gains are much more significant on CUDA - about 26% speedup for PP and 7% for TG. On my RTX-4080 I now get PP-512 = 4400 t/s for DeepSeek-Lite. This is still much to low compared to a dense model with 2.4B parameters (one should get in the range of 15,000 t/s), but quite a bit better than the 3450 t/s one gets on the main branch (and also in mainline llama.cpp).

As the new op is not implemented on all platforms (Metal is missing), it is enabled via a command line option that is off by default. To turn on, use -fmoe or --fused-moe.

Obviously this option cannot be used when computing an imatrix because than the intermediate results remain in temporary work buffers, hence will not be propagated to collect activation statistics for the up_exps and gate_exps tensors.

We get ~13% speedup for PP-512 and ~2% for TG-128 for DeepSeek-Lite

in case the MUL_MAT_ID op for the down experts is the next op in the graph.

Iwan Kawrakow added 6 commits February 23, 2025 11:37

Fusing MoE up * unary(gate)

da64673

Fusing MoE up * unary(gate): CUDA

001abcc

We get ~13% speedup for PP-512 and ~2% for TG-128 for DeepSeek-Lite

On CUDA also fuse MoE down * (up * unary(gate))

c229183

in case the MUL_MAT_ID op for the down experts is the next op in the graph.

Command line option to enable fused MoE up*unary(gate)

5bf5467

Add fmoe option to llama-bench

a72cd96

Adding forgotten gelu, relu, silu on ARM

cf7b98d

ikawrakow merged commit ac1d259 into main Feb 23, 2025

ikawrakow mentioned this pull request Feb 27, 2025

Option to use MLA without a transposed cache #235

Merged

saood06 mentioned this pull request Jul 12, 2025

Bug: llama-imatrix crashing #601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fused MoE ffn_up and ffn_gate #229

Fused MoE ffn_up and ffn_gate #229

Uh oh!

ikawrakow commented Feb 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fused MoE ffn_up and ffn_gate #229

Fused MoE ffn_up and ffn_gate #229

Uh oh!

Conversation

ikawrakow commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikawrakow commented Feb 23, 2025 •

edited

Loading