Fused delta-net by ikawrakow · Pull Request #1315 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-24T13:55:42Z

This PR adds fused delta-net implementation for Qwen3-Next and Qwen3.5-MoE. We observe very significant performance gains for CPU-only inference (PP and TG), and a more modest TG performance improvement on CUDA.

I started from the fused delta-net implementation that was included in an early version of @YurkoHoshko's PR #1251 (@YurkoHoshko: where did this implementation come from?). It wasn't functioning correctly there, but not because of the delta-net implementation but due to other factors that I later corrected in the Qwen3-Next PR #1266. But that wasn't clear at the time, so @YurkoHoshko removed the fused delta-net implementation before I got involved. In any case, for this PR I added many performance optimizations, so the resulting implementation in this PR is quite different from where I started.

For now I have left the fused delta-net to be off by default. It can be turned on using

-fdn | --fused-delta-net N

where N is an integer value, and the fused delta-net gets used for u-batch sizes <= N. The main reason that it is not turned on by default is that the performance characteristics on the CPU and on CUDA are quite different:

On CUDA (or at least on the 3090 GPU that I'm using for testing) the fused delta-net implementation is faster than the chunked delta-net for u_batch <= 16.
On the CPU (or at least on the CPU I'm testing with, which is a Ryzen-3995WX), the fused delta-net outperforms the chunked version for u_batch <= 512 (and possibly beyond)
In both cases (CPU and GPU) the fused delta net is faster than the autoregressive version (used for TG on the main branch).

Here llama-bench results for PP-512 and TG-128 on the Ryzen-3995WX CPU and the 3090 GPU for Qwen3-Next quantized with IQ4_XS. On CUDA, as mentioned above, it is best to use -fdn 16, so PP performance does not change.

model	backend	test	t/s (main)	t/s (fdn)	Speedup
qwen3next 80B.A3B IQ4_XS	CPU	pp512	323.56 ± 8.26	402.95 ± 9.07	1.245
qwen3next 80B.A3B IQ4_XS	CPU	tg128	23.69 ± 0.08	31.05 ± 0.19	1.312
qwen3next 80B.A3B IQ4_XS	CUDA	tg128.	113.96 ± 0.01	124.50 ± 0.21	1.092

Mainline llama.cpp with today's build (da426cb25 (8145)) has TG-128 = 10.15 t/s and PP-512 = 96.00 t/s on the Ryzen-3995WX CPU. I.e., with this PR the performance gap has widened to 4.2X (PP) and 3.06X (TG).

The CPU implementation is SIMD-ified only for x86-64 (using vanilla AVX2). It will not be a big effort to add AVX512 and ARM_NEON implementations, but I'm leaving this for a future PR.

ikawrakow · 2026-02-24T14:37:08Z

Here llama-bench results for Qwen3.5MoE quantized with IQ4_XS running CPU-only on a Ryzen-3995WX CPU:

model	size	params	backend	threads	rtr	fdn	test	t/s
qwen35moe 397B.A17B IQ4_XS	197.12 GiB	396.35 B	CPU	64	1	512	pp512	119.81 ± 1.91
qwen35moe 397B.A17B IQ4_XS	197.12 GiB	396.35 B	CPU	64	1	512	tg128	9.79 ± 0.02

This is quite useable, I think.

llama.cpp delivers whopping 3.8 t/s TG.

It seems it is faster than the chunked implementation!

magikRUKKOLA · 2026-02-24T22:00:37Z

+about 6% in decode (Qwen3.5-IQ4_KSS) with 3975wx and two 3090:

-fdn 16

main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 32, n_threads_batch = 32

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	13.364	306.49	54.495	18.79
4096	1024	4096	13.575	301.73	54.528	18.78
4096	1024	8192	13.692	299.16	54.933	18.64
4096	1024	12288	13.729	298.34	55.318	18.51
4096	1024	16384	13.804	296.72	55.681	18.39

Its very nice. [(*a note to myself)] It would be nice to test the FIM capabilities. Qwen3.5 by itself is really nice and capable!

[EDIT2]:

speed comparison to the similar quant and apparently, a regular llama.cpp: https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF/discussions/4#6993f9542c659709a88d4cc2

I'm getting similar perf for UD-Q4_K_XL and 72GB VRAM:

RTX 4090D 48GB
RTX 3090 24GB
Intel Xeon W5-3425 with 256GB DDR5-4800
prompt eval time =   13726.21 ms /   512 tokens (   26.81 ms per token,    37.30 tokens per second)
       eval time =   64585.92 ms /   857 tokens (   75.36 ms per token,    13.27 tokens per second)

Damn ... 🥇

[EDIT]: its crazy one can get such a performance with DDR4 lol. ~~The full offload of IQ2_KL is only about 60% faster.~~ Uh oh. There is a boost for the full GPU offload too.

33 tps -> 38 tps

Details

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	4.641	882.66	27.141	37.73
4096	1024	4096	4.752	861.87	27.066	37.83
4096	1024	8192	4.889	837.75	27.616	37.08
4096	1024	12288	5.023	815.50	28.147	36.38
4096	1024	16384	5.149	795.50	28.677	35.71
4096	1024	20480	5.263	778.24	28.960	35.36
4096	1024	24576	5.403	758.05	29.251	35.01
4096	1024	28672	5.529	740.81	29.552	34.65
4096	1024	32768	5.657	724.07	30.053	34.07
4096	1024	36864	5.786	707.91	30.568	33.50
4096	1024	40960	5.918	692.15	30.732	33.32
4096	1024	45056	6.047	677.38	31.113	32.91
4096	1024	49152	6.176	663.23	31.579	32.43
4096	1024	53248	6.301	650.04	31.910	32.09
4096	1024	57344	6.432	636.85	32.195	31.81
4096	1024	61440	6.590	621.53	32.598	31.41
4096	1024	65536	6.687	612.55	32.992	31.04
4096	1024	69632	6.814	601.14	33.332	30.72
4096	1024	73728	6.945	589.81	33.693	30.39
4096	1024	77824	7.064	579.88	33.945	30.17
4096	1024	81920	7.190	569.67	34.342	29.82
4096	1024	86016	7.337	558.25	34.591	29.60
4096	1024	90112	7.468	548.48	35.222	29.07
4096	1024	94208	7.579	540.43	35.580	28.78
4096	1024	98304	7.712	531.09	35.980	28.46
4096	1024	102400	7.856	521.37	36.221	28.27
4096	1024	106496	7.976	513.52	36.794	27.83
4096	1024	110592	8.111	505.02	36.976	27.69
4096	1024	114688	8.229	497.74	37.411	27.37
4096	1024	118784	8.361	489.87	37.665	27.19
4096	1024	122880	8.479	483.08	38.085	26.89
4096	1024	126976	8.659	473.03	38.342	26.71
4096	1024	131072	8.757	467.75	38.848	26.36
4096	1024	135168	8.884	461.04	39.145	26.16
4096	1024	139264	9.012	454.52	39.442	25.96
4096	1024	143360	9.152	447.53	39.992	25.60
4096	1024	147456	9.270	441.84	40.400	25.35
4096	1024	151552	9.412	435.21	40.736	25.14
4096	1024	155648	9.551	428.88	40.993	24.98

Its very nice. Its almost 40 tps.

magikRUKKOLA · 2026-02-24T22:38:59Z

Here llama-bench results for Qwen3.5MoE quantized with IQ4_XS running CPU-only on a Ryzen-3995WX CPU:

Similarly for 3975wx, 2933 MT/s non-ECC RAM:

main: n_kv_max = 262144, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 0, n_threads = 32, n_threads_batch = 32

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	7.030	72.83	19.112	6.70
512	128	512	7.141	71.70	17.754	7.21
512	128	1024	7.206	71.05	17.900	7.15
512	128	1536	7.268	70.45	17.989	7.12
512	128	2048	7.395	69.24	17.966	7.12
512	128	2560	7.405	69.15	17.955	7.13
512	128	3072	7.830	65.39	18.044	7.09

sayap · 2026-02-25T03:33:08Z

Is it possible to set the value differently for GPU and CPU? Qwen3.5 397B has more always active parameters (about 9.8B) than sparsely activated parameters (about 7.5B), so it is quite ideal for a strix halo + egpu setup with a slow PCIe link, where both PP and TG are done separately on GPU and CPU without weights transfer.

Some numbers with the above setup using the PR branch...

baseline:

prompt eval time =   41744.87 ms /  5743 tokens (    7.27 ms per token,   137.57 tokens per second)
       eval time =   29973.34 ms /   588 tokens (   50.98 ms per token,    19.62 tokens per second)

with -fdn 512:

prompt eval time =   43873.05 ms /  5743 tokens (    7.64 ms per token,   130.90 tokens per second)
       eval time =   28286.41 ms /   588 tokens (   48.11 ms per token,    20.79 tokens per second)

with -fdn 16:

prompt eval time =   42106.33 ms /  5743 tokens (    7.33 ms per token,   136.39 tokens per second)
       eval time =   28160.44 ms /   588 tokens (   47.89 ms per token,    20.88 tokens per second)

ikawrakow · 2026-02-25T06:22:15Z

@sayap

Not sure I understand the request. Can you share your command line so we can understand where the delta net tensors are stored?

sayap · 2026-02-25T06:47:41Z

This is the command line:

-ngl 999 -ncmoe 59 -cram 0 --no-mmap -c 262144 --jinja -ctk q8_0 -ctv q8_0 -fdn 16
...
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors:        CPU buffer size = 122130.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   545.62 MiB
llm_load_tensors:      CUDA0 buffer size = 10749.78 MiB

With the default of -ub 512, there is no GPU offload for prompt processing.

I probably misunderstood how it works 😅

YurkoHoshko · 2026-02-25T06:50:53Z

@ikawrakow I believe most of the implementations came from either reference implementation or one of the subsequent optimization PRs (ggml-org/llama.cpp#18102 - still not merged btw).

Happy to see it came in handy, especially with the release of Qwen3.5-122B-A10B and Qwen3.5-35B-A3B (https://unsloth.ai/docs/models/qwen3.5#qwen3.5-35b-a3b) - haven't tested them yet though.

Exciting times - thank you for your hard work!

ikawrakow · 2026-02-25T06:56:04Z

@sayap

With that command all recurrent attention tensors are on the GPU and the delta-net gets computed there, so -fdn 16 is what you need. There is no reason for PP to go down with that, so the change from 137.6 to 136.4 t/s is most likely noise in the measurement.

ikawrakow · 2026-02-25T06:56:55Z

@YurkoHoshko Thanks!

magikRUKKOLA · 2026-02-25T07:05:22Z

Not sure if its worth mentioning but the prompt cache for the Qwen3.5 works only if this option is enabled:

--reasoning-tokens none \

magikRUKKOLA · 2026-02-25T07:11:06Z

@YurkoHoshko

(https://unsloth.ai/docs/models/qwen3.5#qwen3.5-35b-a3b) - haven't tested them yet though.

For the speculative decoding of the Qwen3.5-397B-A17B ?

* Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

ikawrakow added 10 commits February 24, 2026 16:47

Revive fused delta-net

a350f1b

Add command line argument for fused delta net

28b31a6

Simplify/improve CUDA delta-net

dc44a37

Add -fdn to llama-bench

fecdcd5

More CUDA fused delta net optimizations

7af6892

CPU optimizations

2ef38b5

Much faster fused delta-net on the CPU

b184e84

It seems it is faster than the chunked implementation!

Change meaning of fdn from bool flag to threshold value

d7c0104

Use eps = 1e-6

1687ff8

Give some nodes a name

b3cf43e

ikawrakow force-pushed the ik/fused_delta_net branch from 45970f8 to b3cf43e Compare February 24, 2026 16:50

ikawrakow merged commit c77ec4b into main Feb 25, 2026

magikRUKKOLA mentioned this pull request Feb 25, 2026

Fused delta net 2 #1320

Merged

ikawrakow mentioned this pull request Mar 12, 2026

Split mode graph for models with pre-merged ffn_up/ffn_gate experts #1412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused delta-net#1315

Fused delta-net#1315
ikawrakow merged 10 commits intomainfrom
ik/fused_delta_net

ikawrakow commented Feb 24, 2026 •

edited

Loading

Uh oh!

ikawrakow commented Feb 24, 2026 •

edited

Loading

Uh oh!

magikRUKKOLA commented Feb 24, 2026 •

edited

Loading

Uh oh!

magikRUKKOLA commented Feb 24, 2026 •

edited

Loading

Uh oh!

sayap commented Feb 25, 2026

Uh oh!

ikawrakow commented Feb 25, 2026

Uh oh!

sayap commented Feb 25, 2026

Uh oh!

YurkoHoshko commented Feb 25, 2026 •

edited

Loading

Uh oh!

ikawrakow commented Feb 25, 2026

Uh oh!

ikawrakow commented Feb 25, 2026

Uh oh!

magikRUKKOLA commented Feb 25, 2026

Uh oh!

magikRUKKOLA commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magikRUKKOLA commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magikRUKKOLA commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayap commented Feb 25, 2026

Uh oh!

ikawrakow commented Feb 25, 2026

Uh oh!

sayap commented Feb 25, 2026

Uh oh!

YurkoHoshko commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Feb 25, 2026

Uh oh!

ikawrakow commented Feb 25, 2026

Uh oh!

magikRUKKOLA commented Feb 25, 2026

Uh oh!

magikRUKKOLA commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikawrakow commented Feb 24, 2026 •

edited

Loading

ikawrakow commented Feb 24, 2026 •

edited

Loading

magikRUKKOLA commented Feb 24, 2026 •

edited

Loading

magikRUKKOLA commented Feb 24, 2026 •

edited

Loading

YurkoHoshko commented Feb 25, 2026 •

edited

Loading