Conversation
|
Here
This is quite useable, I think.
|
It seems it is faster than the chunked implementation!
45970f8 to
b3cf43e
Compare
|
+about 6% in decode (Qwen3.5-IQ4_KSS) with 3975wx and two 3090:
Its very nice. [(*a note to myself)] It would be nice to test the [EDIT2]: speed comparison to the similar quant and apparently, a regular llama.cpp: https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF/discussions/4#6993f9542c659709a88d4cc2 Damn ... 🥇 [EDIT]: its crazy one can get such a performance with DDR4 lol. 33 tps -> 38 tps Details
Its very nice. Its almost 40 tps. |
Similarly for 3975wx, 2933 MT/s non-ECC RAM:
|
|
Is it possible to set the value differently for GPU and CPU? Qwen3.5 397B has more always active parameters (about 9.8B) than sparsely activated parameters (about 7.5B), so it is quite ideal for a strix halo + egpu setup with a slow PCIe link, where both PP and TG are done separately on GPU and CPU without weights transfer. Some numbers with the above setup using the PR branch... baseline: with -fdn 512: with -fdn 16: |
|
Not sure I understand the request. Can you share your command line so we can understand where the delta net tensors are stored? |
|
This is the command line: With the default of I probably misunderstood how it works 😅 |
|
@ikawrakow I believe most of the implementations came from either reference implementation or one of the subsequent optimization PRs (ggml-org/llama.cpp#18102 - still not merged btw). Happy to see it came in handy, especially with the release of Qwen3.5-122B-A10B and Qwen3.5-35B-A3B (https://unsloth.ai/docs/models/qwen3.5#qwen3.5-35b-a3b) - haven't tested them yet though. Exciting times - thank you for your hard work! |
|
With that command all recurrent attention tensors are on the GPU and the delta-net gets computed there, so |
|
@YurkoHoshko Thanks! |
|
Not sure if its worth mentioning but the prompt cache for the --reasoning-tokens none \ |
For the speculative decoding of the |
* Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name
* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench
This PR adds fused delta-net implementation for Qwen3-Next and Qwen3.5-MoE. We observe very significant performance gains for CPU-only inference (PP and TG), and a more modest TG performance improvement on CUDA.
I started from the fused delta-net implementation that was included in an early version of @YurkoHoshko's PR #1251 (@YurkoHoshko: where did this implementation come from?). It wasn't functioning correctly there, but not because of the delta-net implementation but due to other factors that I later corrected in the Qwen3-Next PR #1266. But that wasn't clear at the time, so @YurkoHoshko removed the fused delta-net implementation before I got involved. In any case, for this PR I added many performance optimizations, so the resulting implementation in this PR is quite different from where I started.
For now I have left the fused delta-net to be off by default. It can be turned on using
where
Nis an integer value, and the fused delta-net gets used for u-batch sizes<= N. The main reason that it is not turned on by default is that the performance characteristics on the CPU and on CUDA are quite different:u_batch <= 16.u_batch <= 512(and possibly beyond)Here
llama-benchresults for PP-512 and TG-128 on the Ryzen-3995WX CPU and the 3090 GPU for Qwen3-Next quantized withIQ4_XS. On CUDA, as mentioned above, it is best to use-fdn 16, so PP performance does not change.Mainline
llama.cppwith today's build (da426cb25 (8145)) has TG-128 = 10.15 t/s and PP-512 = 96.00 t/s on the Ryzen-3995WX CPU. I.e., with this PR the performance gap has widened to 4.2X (PP) and 3.06X (TG).The CPU implementation is SIMD-ified only for
x86-64(using vanillaAVX2). It will not be a big effort to addAVX512andARM_NEONimplementations, but I'm leaving this for a future PR.