Skip to content

Split mode graph for models with pre-merged ffn_up/ffn_gate experts#1412

Merged
ikawrakow merged 7 commits intomainfrom
ik/sm_graph_pre_merged_up_gate
Mar 12, 2026
Merged

Split mode graph for models with pre-merged ffn_up/ffn_gate experts#1412
ikawrakow merged 7 commits intomainfrom
ik/sm_graph_pre_merged_up_gate

Conversation

@ikawrakow
Copy link
Owner

GGUFs where the ffn_up_exps and ffn_gate_exps tensors have been merged into a combined ffn_gate_up_exps tensor have been popping up on HF, see for instance https://huggingface.co/AesSedai/Qwen3.5-35B-A3B-GGUF. This PR adds the ability to use split mode graph (a.k.a. tensor parallel) with such models.

In ik_llama.cpp the ability to merge ffn_up_exps and ffn_gate_exps has been around for a while (see #1137, enabled via -muge), but I have preferred the merge to be done on-the-fly while loading the model rather than the tensors being pre-merged. However, split mode graph did not support merging ffn_up_exps and ffn_gate_exps, with the main reason being that I did not consider the minor performance benefit to be worth the added complexity (splitting the merged tensors between GPUs is quite a bit more complicated, and then one needs to complicate the logic of building the compute graph by adding checks for the presence of the merged tensor). But now that mainline maintainers have forced my hand by releasing this incompatible change, I decided to bite the bullet and add the ability to use split mode graph to pre-merged models. The PR follows in the footsteps of #1408, which added the ability to use pre-merged models in the first place.

I have used the Qwen3.5-35B-A3B IQ4_XS variant from AesSedai for testing on a 2x3090 system. llama.cpp Qwen-3.5 PP performance has improved quite a bit since I last checked. This seems to be clearly due to PR 20304 from yesterday, which enables the fused delta net implementation also for PP (OK, they call it "gated delta net" but it is the same thing as the "fused delta net" added here in #1315, and then further optimized in #1320, #1333, #1340). To verify, I have added llama.cpp PP performance results with the commit just before 20340 in orange. What does #20340 do? The same things as here: keep the state in shared memory instead of reading/writing to global memory for each token, avoid the repeat of Q and K (#1373), etc. The "fused delta net" (a.k.a. "gated delta net") story in llama.cpp is quite interesting. They had PR 18102 since Dec 16, but never looked at it for 2+ months. PR #1315 here started from that PR, and boom, fused delta net became a thing in mainline. I guess, yet another totally random coincidence.

Anyway, here are the graphs. The llama.cpp stop earlier because I get this error

init_batch: failed to prepare attention ubatches
decode: failed to find a memory slot for batch of size 2048
failed to decode the batch, n_batch = 2048, ret = 1 
main: llama_decode() failed

before the N_KV = 61440 result is produced.

q35_35b_a3b_pp q35_35b_a3b_tg

@ikawrakow ikawrakow merged commit c85361f into main Mar 12, 2026
ikawrakow added a commit that referenced this pull request Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant