Skip to content

Fused delta-net#1315

Merged
ikawrakow merged 10 commits intomainfrom
ik/fused_delta_net
Feb 25, 2026
Merged

Fused delta-net#1315
ikawrakow merged 10 commits intomainfrom
ik/fused_delta_net

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Feb 24, 2026

This PR adds fused delta-net implementation for Qwen3-Next and Qwen3.5-MoE. We observe very significant performance gains for CPU-only inference (PP and TG), and a more modest TG performance improvement on CUDA.

I started from the fused delta-net implementation that was included in an early version of @YurkoHoshko's PR #1251 (@YurkoHoshko: where did this implementation come from?). It wasn't functioning correctly there, but not because of the delta-net implementation but due to other factors that I later corrected in the Qwen3-Next PR #1266. But that wasn't clear at the time, so @YurkoHoshko removed the fused delta-net implementation before I got involved. In any case, for this PR I added many performance optimizations, so the resulting implementation in this PR is quite different from where I started.

For now I have left the fused delta-net to be off by default. It can be turned on using

-fdn | --fused-delta-net N

where N is an integer value, and the fused delta-net gets used for u-batch sizes <= N. The main reason that it is not turned on by default is that the performance characteristics on the CPU and on CUDA are quite different:

  • On CUDA (or at least on the 3090 GPU that I'm using for testing) the fused delta-net implementation is faster than the chunked delta-net for u_batch <= 16.
  • On the CPU (or at least on the CPU I'm testing with, which is a Ryzen-3995WX), the fused delta-net outperforms the chunked version for u_batch <= 512 (and possibly beyond)
  • In both cases (CPU and GPU) the fused delta net is faster than the autoregressive version (used for TG on the main branch).

Here llama-bench results for PP-512 and TG-128 on the Ryzen-3995WX CPU and the 3090 GPU for Qwen3-Next quantized with IQ4_XS. On CUDA, as mentioned above, it is best to use -fdn 16, so PP performance does not change.

model backend test t/s (main) t/s (fdn) Speedup
qwen3next 80B.A3B IQ4_XS CPU pp512 323.56 ± 8.26 402.95 ± 9.07 1.245
qwen3next 80B.A3B IQ4_XS CPU tg128 23.69 ± 0.08 31.05 ± 0.19 1.312
qwen3next 80B.A3B IQ4_XS CUDA tg128. 113.96 ± 0.01 124.50 ± 0.21 1.092

Mainline llama.cpp with today's build (da426cb25 (8145)) has TG-128 = 10.15 t/s and PP-512 = 96.00 t/s on the Ryzen-3995WX CPU. I.e., with this PR the performance gap has widened to 4.2X (PP) and 3.06X (TG).

The CPU implementation is SIMD-ified only for x86-64 (using vanilla AVX2). It will not be a big effort to add AVX512 and ARM_NEON implementations, but I'm leaving this for a future PR.

@ikawrakow
Copy link
Owner Author

ikawrakow commented Feb 24, 2026

Here llama-bench results for Qwen3.5MoE quantized with IQ4_XS running CPU-only on a Ryzen-3995WX CPU:

model size params backend threads rtr fdn test t/s
qwen35moe 397B.A17B IQ4_XS 197.12 GiB 396.35 B CPU 64 1 512 pp512 119.81 ± 1.91
qwen35moe 397B.A17B IQ4_XS 197.12 GiB 396.35 B CPU 64 1 512 tg128 9.79 ± 0.02

This is quite useable, I think.

llama.cpp delivers whopping 3.8 t/s TG.

@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 24, 2026

+about 6% in decode (Qwen3.5-IQ4_KSS) with 3975wx and two 3090:

-fdn 16
main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 13.364 306.49 54.495 18.79
4096 1024 4096 13.575 301.73 54.528 18.78
4096 1024 8192 13.692 299.16 54.933 18.64
4096 1024 12288 13.729 298.34 55.318 18.51
4096 1024 16384 13.804 296.72 55.681 18.39

Its very nice. [(*a note to myself)] It would be nice to test the FIM capabilities. Qwen3.5 by itself is really nice and capable!

[EDIT2]:

speed comparison to the similar quant and apparently, a regular llama.cpp: https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF/discussions/4#6993f9542c659709a88d4cc2

I'm getting similar perf for UD-Q4_K_XL and 72GB VRAM:

RTX 4090D 48GB
RTX 3090 24GB
Intel Xeon W5-3425 with 256GB DDR5-4800
prompt eval time =   13726.21 ms /   512 tokens (   26.81 ms per token,    37.30 tokens per second)
       eval time =   64585.92 ms /   857 tokens (   75.36 ms per token,    13.27 tokens per second)

Damn ... 🥇

[EDIT]: its crazy one can get such a performance with DDR4 lol. The full offload of IQ2_KL is only about 60% faster. Uh oh. There is a boost for the full GPU offload too.

33 tps -> 38 tps

Details
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 4.641 882.66 27.141 37.73
4096 1024 4096 4.752 861.87 27.066 37.83
4096 1024 8192 4.889 837.75 27.616 37.08
4096 1024 12288 5.023 815.50 28.147 36.38
4096 1024 16384 5.149 795.50 28.677 35.71
4096 1024 20480 5.263 778.24 28.960 35.36
4096 1024 24576 5.403 758.05 29.251 35.01
4096 1024 28672 5.529 740.81 29.552 34.65
4096 1024 32768 5.657 724.07 30.053 34.07
4096 1024 36864 5.786 707.91 30.568 33.50
4096 1024 40960 5.918 692.15 30.732 33.32
4096 1024 45056 6.047 677.38 31.113 32.91
4096 1024 49152 6.176 663.23 31.579 32.43
4096 1024 53248 6.301 650.04 31.910 32.09
4096 1024 57344 6.432 636.85 32.195 31.81
4096 1024 61440 6.590 621.53 32.598 31.41
4096 1024 65536 6.687 612.55 32.992 31.04
4096 1024 69632 6.814 601.14 33.332 30.72
4096 1024 73728 6.945 589.81 33.693 30.39
4096 1024 77824 7.064 579.88 33.945 30.17
4096 1024 81920 7.190 569.67 34.342 29.82
4096 1024 86016 7.337 558.25 34.591 29.60
4096 1024 90112 7.468 548.48 35.222 29.07
4096 1024 94208 7.579 540.43 35.580 28.78
4096 1024 98304 7.712 531.09 35.980 28.46
4096 1024 102400 7.856 521.37 36.221 28.27
4096 1024 106496 7.976 513.52 36.794 27.83
4096 1024 110592 8.111 505.02 36.976 27.69
4096 1024 114688 8.229 497.74 37.411 27.37
4096 1024 118784 8.361 489.87 37.665 27.19
4096 1024 122880 8.479 483.08 38.085 26.89
4096 1024 126976 8.659 473.03 38.342 26.71
4096 1024 131072 8.757 467.75 38.848 26.36
4096 1024 135168 8.884 461.04 39.145 26.16
4096 1024 139264 9.012 454.52 39.442 25.96
4096 1024 143360 9.152 447.53 39.992 25.60
4096 1024 147456 9.270 441.84 40.400 25.35
4096 1024 151552 9.412 435.21 40.736 25.14
4096 1024 155648 9.551 428.88 40.993 24.98

Its very nice. Its almost 40 tps.

@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 24, 2026

Here llama-bench results for Qwen3.5MoE quantized with IQ4_XS running CPU-only on a Ryzen-3995WX CPU:

Similarly for 3975wx, 2933 MT/s non-ECC RAM:

main: n_kv_max = 262144, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 0, n_threads = 32, n_threads_batch = 32
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 7.030 72.83 19.112 6.70
512 128 512 7.141 71.70 17.754 7.21
512 128 1024 7.206 71.05 17.900 7.15
512 128 1536 7.268 70.45 17.989 7.12
512 128 2048 7.395 69.24 17.966 7.12
512 128 2560 7.405 69.15 17.955 7.13
512 128 3072 7.830 65.39 18.044 7.09

@sayap
Copy link
Contributor

sayap commented Feb 25, 2026

Is it possible to set the value differently for GPU and CPU? Qwen3.5 397B has more always active parameters (about 9.8B) than sparsely activated parameters (about 7.5B), so it is quite ideal for a strix halo + egpu setup with a slow PCIe link, where both PP and TG are done separately on GPU and CPU without weights transfer.

Some numbers with the above setup using the PR branch...

baseline:

prompt eval time =   41744.87 ms /  5743 tokens (    7.27 ms per token,   137.57 tokens per second)
       eval time =   29973.34 ms /   588 tokens (   50.98 ms per token,    19.62 tokens per second)

with -fdn 512:

prompt eval time =   43873.05 ms /  5743 tokens (    7.64 ms per token,   130.90 tokens per second)
       eval time =   28286.41 ms /   588 tokens (   48.11 ms per token,    20.79 tokens per second)

with -fdn 16:

prompt eval time =   42106.33 ms /  5743 tokens (    7.33 ms per token,   136.39 tokens per second)
       eval time =   28160.44 ms /   588 tokens (   47.89 ms per token,    20.88 tokens per second)

@ikawrakow
Copy link
Owner Author

@sayap

Not sure I understand the request. Can you share your command line so we can understand where the delta net tensors are stored?

@sayap
Copy link
Contributor

sayap commented Feb 25, 2026

This is the command line:

-ngl 999 -ncmoe 59 -cram 0 --no-mmap -c 262144 --jinja -ctk q8_0 -ctv q8_0 -fdn 16
...
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors:        CPU buffer size = 122130.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   545.62 MiB
llm_load_tensors:      CUDA0 buffer size = 10749.78 MiB

With the default of -ub 512, there is no GPU offload for prompt processing.

I probably misunderstood how it works 😅

@YurkoHoshko
Copy link
Contributor

YurkoHoshko commented Feb 25, 2026

@ikawrakow I believe most of the implementations came from either reference implementation or one of the subsequent optimization PRs (ggml-org/llama.cpp#18102 - still not merged btw).

Happy to see it came in handy, especially with the release of Qwen3.5-122B-A10B and Qwen3.5-35B-A3B (https://unsloth.ai/docs/models/qwen3.5#qwen3.5-35b-a3b) - haven't tested them yet though.

Exciting times - thank you for your hard work!

@ikawrakow
Copy link
Owner Author

@sayap

With that command all recurrent attention tensors are on the GPU and the delta-net gets computed there, so -fdn 16 is what you need. There is no reason for PP to go down with that, so the change from 137.6 to 136.4 t/s is most likely noise in the measurement.

@ikawrakow
Copy link
Owner Author

@YurkoHoshko Thanks!

@magikRUKKOLA
Copy link

Not sure if its worth mentioning but the prompt cache for the Qwen3.5 works only if this option is enabled:

--reasoning-tokens none \

@magikRUKKOLA
Copy link

@YurkoHoshko

(https://unsloth.ai/docs/models/qwen3.5#qwen3.5-35b-a3b) - haven't tested them yet though.

For the speculative decoding of the Qwen3.5-397B-A17B ?

@ikawrakow ikawrakow merged commit c77ec4b into main Feb 25, 2026
@magikRUKKOLA magikRUKKOLA mentioned this pull request Feb 25, 2026
abc-nix added a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants