Split mode graph for models with pre-merged ffn_up/ffn_gate experts#1412
Merged
Split mode graph for models with pre-merged ffn_up/ffn_gate experts#1412
Conversation
Haha, mainline has elected to arrange the merged tensors the other way around compared to what I had done in the on-the-fly merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GGUFs where the
ffn_up_expsandffn_gate_expstensors have been merged into a combinedffn_gate_up_expstensor have been popping up on HF, see for instance https://huggingface.co/AesSedai/Qwen3.5-35B-A3B-GGUF. This PR adds the ability to use split modegraph(a.k.a. tensor parallel) with such models.In
ik_llama.cppthe ability to mergeffn_up_expsandffn_gate_expshas been around for a while (see #1137, enabled via-muge), but I have preferred the merge to be done on-the-fly while loading the model rather than the tensors being pre-merged. However, split modegraphdid not support mergingffn_up_expsandffn_gate_exps, with the main reason being that I did not consider the minor performance benefit to be worth the added complexity (splitting the merged tensors between GPUs is quite a bit more complicated, and then one needs to complicate the logic of building the compute graph by adding checks for the presence of the merged tensor). But now that mainline maintainers have forced my hand by releasing this incompatible change, I decided to bite the bullet and add the ability to use split modegraphto pre-merged models. The PR follows in the footsteps of #1408, which added the ability to use pre-merged models in the first place.I have used the Qwen3.5-35B-A3B IQ4_XS variant from AesSedai for testing on a 2x3090 system.
llama.cppQwen-3.5 PP performance has improved quite a bit since I last checked. This seems to be clearly due to PR 20304 from yesterday, which enables the fused delta net implementation also for PP (OK, they call it "gated delta net" but it is the same thing as the "fused delta net" added here in #1315, and then further optimized in #1320, #1333, #1340). To verify, I have addedllama.cppPP performance results with the commit just before 20340 in orange. What does #20340 do? The same things as here: keep the state in shared memory instead of reading/writing to global memory for each token, avoid the repeat ofQandK(#1373), etc. The "fused delta net" (a.k.a. "gated delta net") story inllama.cppis quite interesting. They had PR 18102 since Dec 16, but never looked at it for 2+ months. PR #1315 here started from that PR, and boom, fused delta net became a thing in mainline. I guess, yet another totally random coincidence.Anyway, here are the graphs. The
llama.cppstop earlier because I get this errorbefore the
N_KV = 61440result is produced.