Split mode graph for models with pre-merged ffn_up/ffn_gate experts by ikawrakow · Pull Request #1412 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-03-12T10:43:20Z

GGUFs where the ffn_up_exps and ffn_gate_exps tensors have been merged into a combined ffn_gate_up_exps tensor have been popping up on HF, see for instance https://huggingface.co/AesSedai/Qwen3.5-35B-A3B-GGUF. This PR adds the ability to use split mode graph (a.k.a. tensor parallel) with such models.

In ik_llama.cpp the ability to merge ffn_up_exps and ffn_gate_exps has been around for a while (see #1137, enabled via -muge), but I have preferred the merge to be done on-the-fly while loading the model rather than the tensors being pre-merged. However, split mode graph did not support merging ffn_up_exps and ffn_gate_exps, with the main reason being that I did not consider the minor performance benefit to be worth the added complexity (splitting the merged tensors between GPUs is quite a bit more complicated, and then one needs to complicate the logic of building the compute graph by adding checks for the presence of the merged tensor). But now that mainline maintainers have forced my hand by releasing this incompatible change, I decided to bite the bullet and add the ability to use split mode graph to pre-merged models. The PR follows in the footsteps of #1408, which added the ability to use pre-merged models in the first place.

I have used the Qwen3.5-35B-A3B IQ4_XS variant from AesSedai for testing on a 2x3090 system. llama.cpp Qwen-3.5 PP performance has improved quite a bit since I last checked. This seems to be clearly due to PR 20304 from yesterday, which enables the fused delta net implementation also for PP (OK, they call it "gated delta net" but it is the same thing as the "fused delta net" added here in #1315, and then further optimized in #1320, #1333, #1340). To verify, I have added llama.cpp PP performance results with the commit just before 20340 in orange. What does #20340 do? The same things as here: keep the state in shared memory instead of reading/writing to global memory for each token, avoid the repeat of Q and K (#1373), etc. The "fused delta net" (a.k.a. "gated delta net") story in llama.cpp is quite interesting. They had PR 18102 since Dec 16, but never looked at it for 2+ months. PR #1315 here started from that PR, and boom, fused delta net became a thing in mainline. I guess, yet another totally random coincidence.

Anyway, here are the graphs. The llama.cpp stop earlier because I get this error

init_batch: failed to prepare attention ubatches
decode: failed to find a memory slot for batch of size 2048
failed to decode the batch, n_batch = 2048, ret = 1 
main: llama_decode() failed

before the N_KV = 61440 result is produced.

Haha, mainline has elected to arrange the merged tensors the other way around compared to what I had done in the on-the-fly merge.

…d_up_gate

No need for that after PRs #1408 and #1412

ikawrakow added 7 commits March 11, 2026 18:30

WIP: support pre-merged up/gate experts

01d13b3

Haha, mainline has elected to arrange the merged tensors the other way around compared to what I had done in the on-the-fly merge.

Change the order of on-the-fly packed up/gate

fb9e2ad

OpenAI

ae3dbd4

CUDA TG

27bc936

CPU

329099a

Split mode graph for models with pre-merged ffn_up/ffn_gate experts

210a052

Merge remote-tracking branch 'origin/main' into ik/sm_graph_pre_merge…

aa09683

…d_up_gate

ikawrakow merged commit c85361f into main Mar 12, 2026

ikawrakow added a commit that referenced this pull request Mar 12, 2026

Remove pre-merged up/gate notice from the README

714329f

No need for that after PRs #1408 and #1412

ikawrakow mentioned this pull request Mar 12, 2026

Enable split mode graph for on-the-fly merged up/gate experts #1413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split mode graph for models with pre-merged ffn_up/ffn_gate experts#1412

Split mode graph for models with pre-merged ffn_up/ffn_gate experts#1412
ikawrakow merged 7 commits intomainfrom
ik/sm_graph_pre_merged_up_gate

ikawrakow commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikawrakow commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant