Fused delta net 2 by ikawrakow · Pull Request #1320 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-25T13:46:58Z

This PR adds further optimizations to the CUDA fused delta net implementation.

For Qwen3-Next fully offloaded to the GPU this results in ~6-7% better TG.

More importantly, fused delta net PP performance is not almost on par with the chunked implementation. Why is this more important? Because I can see a path forward to graph parallel with the fused delta net, while I consider the chunked implementation basically hopeless for graph parallel.

Anyway, the table shows PP-X as a function of prompt length X for the chunked and delta net implementations on a 2x3090 system running Qwen3-Next fully offloaded to the GPUs

test	t/s (chunked)	t/s (fused, PR)	Speedup
pp2	69.66 ± 7.71	100.88 ± 11.92	1.448
pp4	125.50 ± 2.95	174.56 ± 4.12	1.391
pp8	228.45 ± 5.92	296.60 ± 7.67	1.298
pp16	389.50 ± 24.63	463.12 ± 30.34	1.189
pp32	626.37 ± 14.78	679.66 ± 16.18	1.085
pp64	915.86 ± 17.08	902.35 ± 16.38	0.985
pp128	1056.49 ± 16.53	1001.80 ± 69.70	0.948
pp256	1529.79 ± 17.49	1422.41 ± 14.17	0.930
pp512	2014.66 ± 16.04	1833.90 ± 8.86	0.910

In comparison, here is what we had on the main branch

test	t/s (chunked)	t/s (fused, main)	Speedup
pp2	69.66 ± 7.71	97.82 ± 11.44	1.404
pp4	125.50 ± 2.95	164.99 ± 3.71	1.315
pp8	228.45 ± 5.92	272.36 ± 6.22	1.192
pp16	389.50 ± 24.63	405.20 ± 24.37	1.040
pp32	626.37 ± 14.78	562.87 ± 12.07	0.899
pp64	915.86 ± 17.08	710.33 ± 9.63	0.775
pp128	1056.49 ± 16.53	773.17 ± 8.34	0.732
pp256	1529.79 ± 17.49	998.91 ± 6.90	0.653
pp512	2014.66 ± 16.04	1186.01 ± 5.37	0.589

It seems it is faster than the chunked implementation!

Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes.

magikRUKKOLA · 2026-02-25T14:43:02Z

Qwen3.5 IQ2_KL 8x3090:

+ ~2% (decode)

Details

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	4.641	882.66	26.550	38.57
4096	1024	4096	4.759	860.76	26.633	38.45
4096	1024	8192	4.879	839.56	27.088	37.80
4096	1024	12288	5.021	815.70	27.479	37.27
4096	1024	16384	5.148	795.64	27.979	36.60
4096	1024	20480	5.276	776.28	28.165	36.36
4096	1024	24576	5.414	756.50	28.522	35.90
4096	1024	28672	5.531	740.56	29.032	35.27
4096	1024	32768	5.669	722.48	29.308	34.94
4096	1024	36864	5.789	707.51	29.922	34.22
4096	1024	40960	5.917	692.24	30.066	34.06
4096	1024	45056	6.047	677.40	30.414	33.67
4096	1024	49152	6.172	663.66	30.992	33.04
4096	1024	53248	6.291	651.12	31.258	32.76
4096	1024	57344	6.423	637.74	31.467	32.54
4096	1024	61440	6.580	622.54	31.938	32.06
4096	1024	65536	6.693	611.96	32.208	31.79
4096	1024	69632	6.812	601.32	32.743	31.27
4096	1024	73728	6.937	590.49	33.201	30.84
4096	1024	77824	7.068	579.48	33.366	30.69
4096	1024	81920	7.190	569.69	33.702	30.38
4096	1024	86016	7.336	558.33	34.125	30.01
4096	1024	90112	7.458	549.24	34.843	29.39
4096	1024	94208	7.584	540.07	35.088	29.18
4096	1024	98304	7.722	530.44	35.141	29.14

IQ4_KSS (DDR4 + 2x3090):

+ ~1.3% (decode)

Details

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	13.340	307.04	54.068	18.94
4096	1024	4096	13.557	302.13	53.770	19.04
4096	1024	8192	13.788	297.07	54.150	18.91
4096	1024	12288	13.848	295.79	55.297	18.52
4096	1024	16384	13.771	297.45	55.258	18.53
4096	1024	20480	14.047	291.60	56.212	18.22
4096	1024	24576	14.311	286.22	55.819	18.34

[references]: previous test: #1315 (comment)

ubergarm · 2026-02-25T18:08:00Z

I only tried with/without -fdn 4096 which is likely much too high given the original PR1315 using -fdn 16... Not sure the best values to try for CUDA offload or hybrid CPU. (a quick -fdn 16 test gave results similar to -fdn 4096 fwiw)

It still shows that this PR achieves the fastest TG speeds.

👈 Details

main@216f4436 -sm layer

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	1.859	2203.45	0.780	82.01
4096	64	4096	1.899	2156.44	0.770	83.07
4096	64	8192	1.941	2110.53	0.780	82.00
4096	64	12288	1.995	2053.33	0.800	80.03
4096	64	16384	2.061	1987.05	0.806	79.42
4096	64	20480	2.129	1923.86	0.816	78.39
4096	64	24576	2.181	1877.89	0.830	77.13
4096	64	28672	2.239	1829.02	0.840	76.15
4096	64	32768	2.300	1780.82	0.856	74.73
4096	64	36864	2.358	1736.76	0.859	74.46
4096	64	40960	2.414	1696.51	0.871	73.50
4096	64	45056	2.474	1655.46	0.885	72.33
4096	64	49152	2.499	1639.05	0.891	71.79
4096	64	53248	2.584	1585.19	0.902	70.94
4096	64	57344	2.630	1557.45	0.913	70.11
4096	64	61440	2.701	1516.64	0.922	69.40
4096	64	65536	2.748	1490.37	0.939	68.18
4096	64	69632	2.798	1463.95	0.944	67.83
4096	64	73728	2.864	1430.01	0.953	67.12
4096	64	77824	2.915	1404.91	0.966	66.26
4096	64	81920	2.979	1374.98	0.972	65.82
4096	64	86016	3.043	1346.19	0.989	64.69
4096	64	90112	3.098	1321.97	0.996	64.26
4096	64	94208	3.158	1297.09	1.005	63.70
4096	64	98304	3.234	1266.48	1.021	62.66
4096	64	102400	3.298	1241.83	1.027	62.34
4096	64	106496	3.348	1223.50	1.036	61.78
4096	64	110592	3.413	1200.13	1.050	60.96
4096	64	114688	3.468	1181.14	1.057	60.55
4096	64	118784	3.534	1159.11	1.076	59.48
4096	64	122880	3.588	1141.74	1.083	59.12
4096	64	126976	3.650	1122.22	1.091	58.66
4096	64	131072	3.721	1100.84	1.106	57.86

main@216f4436 -sm graph

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	2.029	2019.19	0.911	70.22
4096	64	4096	2.051	1996.76	0.902	70.98
4096	64	8192	2.067	1982.04	0.902	70.99
4096	64	12288	2.093	1956.92	0.912	70.15
4096	64	16384	2.118	1933.81	0.923	69.30
4096	64	20480	2.152	1903.54	0.933	68.60
4096	64	24576	2.173	1884.80	0.946	67.65
4096	64	28672	2.210	1853.21	0.949	67.42
4096	64	32768	2.233	1833.92	0.954	67.07
4096	64	36864	2.264	1809.25	0.961	66.58
4096	64	40960	2.295	1785.01	0.967	66.19
4096	64	45056	2.329	1758.68	0.980	65.34
4096	64	49152	2.354	1739.72	0.985	65.01
4096	64	53248	2.385	1717.45	0.990	64.66
4096	64	57344	2.421	1692.08	0.993	64.42
4096	64	61440	2.467	1659.99	0.996	64.24
4096	64	65536	2.500	1638.72	1.012	63.22
4096	64	69632	2.512	1630.65	1.013	63.17
4096	64	73728	2.546	1608.83	1.019	62.82
4096	64	77824	2.581	1586.70	1.028	62.24
4096	64	81920	2.615	1566.27	1.030	62.12
4096	64	86016	2.644	1549.41	1.043	61.37
4096	64	90112	2.683	1526.80	1.049	61.00
4096	64	94208	2.711	1511.01	1.051	60.89
4096	64	98304	2.756	1486.23	1.055	60.67
4096	64	102400	2.762	1483.11	1.060	60.40
4096	64	106496	2.810	1457.76	1.064	60.13
4096	64	110592	2.838	1443.03	1.075	59.53
4096	64	114688	2.870	1427.25	1.080	59.27
4096	64	118784	2.898	1413.41	1.082	59.15
4096	64	122880	2.928	1398.78	1.088	58.85
4096	64	126976	2.954	1386.45	1.093	58.53
4096	64	131072	2.989	1370.35	1.103	58.03

main@216f4436 -sm layer -fdn 4096

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -fdn 4096 \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	3.207	1277.01	0.725	88.24
4096	64	4096	3.245	1262.07	0.715	89.56
4096	64	8192	3.294	1243.32	0.724	88.39
4096	64	12288	3.357	1220.19	0.742	86.22
4096	64	16384	3.407	1202.30	0.748	85.60
4096	64	20480	3.469	1180.74	0.756	84.63
4096	64	24576	3.531	1160.07	0.771	83.01
4096	64	28672	3.587	1141.76	0.782	81.81
4096	64	32768	3.661	1118.96	0.800	80.02
4096	64	36864	3.728	1098.58	0.803	79.70
4096	64	40960	3.792	1080.14	0.816	78.42
4096	64	45056	3.837	1067.57	0.830	77.07
4096	64	49152	3.877	1056.58	0.835	76.64
4096	64	53248	3.943	1038.91	0.846	75.63
4096	64	57344	4.008	1022.03	0.859	74.50
4096	64	61440	4.077	1004.76	0.866	73.94
4096	64	65536	4.114	995.53	0.883	72.45
4096	64	69632	4.180	979.92	0.890	71.94
4096	64	73728	4.263	960.76	0.897	71.37
4096	64	77824	4.313	949.59	0.912	70.21
4096	64	81920	4.352	941.17	0.919	69.62
4096	64	86016	4.424	925.85	0.932	68.64
4096	64	90112	4.495	911.28	0.941	68.03
4096	64	94208	4.539	902.43	0.949	67.40
4096	64	98304	4.622	886.24	0.966	66.26
4096	64	102400	4.699	871.69	0.973	65.75
4096	64	106496	4.740	864.21	0.982	65.17
4096	64	110592	4.798	853.77	0.995	64.29
4096	64	114688	4.865	841.97	1.003	63.79
4096	64	118784	4.905	835.07	1.019	62.81
4096	64	122880	4.962	825.47	1.024	62.49
4096	64	126976	5.042	812.43	1.035	61.83
4096	64	131072	5.089	804.90	1.048	61.06

main@216f4436 -sm graph -fdn 4096

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -fdn 4096 \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	3.382	1211.08	0.843	75.93
4096	64	4096	3.422	1197.02	0.829	77.23
4096	64	8192	3.440	1190.69	0.835	76.64
4096	64	12288	3.460	1183.89	0.846	75.69
4096	64	16384	3.492	1172.95	0.854	74.92
4096	64	20480	3.527	1161.37	0.865	74.00
4096	64	24576	3.550	1153.92	0.878	72.89
4096	64	28672	3.586	1142.27	0.880	72.70
4096	64	32768	3.625	1129.93	0.884	72.37
4096	64	36864	3.673	1115.07	0.893	71.65
4096	64	40960	3.694	1108.94	0.900	71.10
4096	64	45056	3.727	1099.00	0.913	70.10
4096	64	49152	3.763	1088.38	0.916	69.88
4096	64	53248	3.787	1081.52	0.921	69.52
4096	64	57344	3.827	1070.34	0.925	69.21
4096	64	61440	3.864	1060.04	0.931	68.75
4096	64	65536	3.888	1053.56	0.943	67.88
4096	64	69632	3.932	1041.74	0.948	67.51
4096	64	73728	3.956	1035.34	0.949	67.41
4096	64	77824	3.979	1029.47	0.956	66.93
4096	64	81920	4.008	1022.03	0.960	66.66
4096	64	86016	4.038	1014.38	0.973	65.81
4096	64	90112	4.075	1005.22	0.975	65.65
4096	64	94208	4.123	993.42	0.978	65.42
4096	64	98304	4.135	990.68	0.983	65.12
4096	64	102400	4.194	976.59	0.989	64.73
4096	64	106496	4.224	969.74	0.995	64.30
4096	64	110592	4.256	962.30	1.007	63.54
4096	64	114688	4.282	956.66	1.010	63.38
4096	64	118784	4.304	951.68	1.014	63.14
4096	64	122880	4.349	941.81	1.019	62.79
4096	64	126976	4.365	938.40	1.024	62.50
4096	64	131072	4.411	928.59	1.036	61.75

PR1329 ik/fused_delta_net_2@0579a868 -sm layer

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	1.858	2205.07	0.780	82.03
4096	64	4096	1.896	2159.80	0.769	83.21
4096	64	8192	1.946	2104.85	0.781	81.95
4096	64	12288	2.002	2046.02	0.799	80.06
4096	64	16384	2.057	1990.99	0.806	79.43
4096	64	20480	2.123	1929.44	0.816	78.48
4096	64	24576	2.172	1885.83	0.830	77.11
4096	64	28672	2.229	1837.29	0.840	76.19
4096	64	32768	2.299	1782.02	0.855	74.84
4096	64	36864	2.356	1738.35	0.859	74.54
4096	64	40960	2.422	1691.15	0.869	73.63
4096	64	45056	2.477	1653.42	0.883	72.50
4096	64	49152	2.524	1623.10	0.889	72.01
4096	64	53248	2.578	1589.05	0.900	71.12
4096	64	57344	2.631	1556.56	0.912	70.16
4096	64	61440	2.695	1519.81	0.922	69.39
4096	64	65536	2.749	1490.06	0.939	68.13
4096	64	69632	2.799	1463.42	0.943	67.89
4096	64	73728	2.866	1429.07	0.955	67.03
4096	64	77824	2.913	1406.23	0.966	66.28
4096	64	81920	2.979	1374.82	0.972	65.82
4096	64	86016	3.051	1342.71	0.990	64.67
4096	64	90112	3.105	1318.98	0.996	64.29
4096	64	94208	3.175	1290.05	1.004	63.77
4096	64	98304	3.223	1270.68	1.021	62.69
4096	64	102400	3.314	1235.87	1.026	62.36
4096	64	106496	3.336	1227.80	1.036	61.77
4096	64	110592	3.401	1204.23	1.050	60.98
4096	64	114688	3.470	1180.32	1.057	60.55
4096	64	118784	3.531	1159.85	1.075	59.55
4096	64	122880	3.605	1136.15	1.081	59.20
4096	64	126976	3.639	1125.55	1.090	58.73
4096	64	131072	3.714	1102.89	1.107	57.80

PR1329 ik/fused_delta_net_2@0579a868 -sm graph

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	2.015	2032.48	0.910	70.34
4096	64	4096	2.034	2013.88	0.899	71.17
4096	64	8192	2.050	1998.27	0.901	71.06
4096	64	12288	2.073	1976.08	0.910	70.35
4096	64	16384	2.099	1951.35	0.921	69.51
4096	64	20480	2.121	1931.20	0.929	68.89
4096	64	24576	2.148	1906.83	0.941	68.01
4096	64	28672	2.174	1884.23	0.945	67.71
4096	64	32768	2.215	1849.06	0.949	67.41
4096	64	36864	2.231	1835.82	0.957	66.87
4096	64	40960	2.265	1808.31	0.962	66.52
4096	64	45056	2.300	1780.81	0.977	65.48
4096	64	49152	2.333	1755.74	0.978	65.44
4096	64	53248	2.369	1728.69	0.982	65.18
4096	64	57344	2.400	1706.65	0.988	64.80
4096	64	61440	2.439	1679.43	0.996	64.25
4096	64	65536	2.463	1663.00	1.012	63.27
4096	64	69632	2.502	1637.21	1.012	63.25
4096	64	73728	2.534	1616.42	1.014	63.13
4096	64	77824	2.561	1599.49	1.021	62.67
4096	64	81920	2.596	1578.05	1.026	62.35
4096	64	86016	2.628	1558.40	1.042	61.42
4096	64	90112	2.657	1541.65	1.047	61.11
4096	64	94208	2.692	1521.79	1.048	61.08
4096	64	98304	2.739	1495.33	1.053	60.79
4096	64	102400	2.775	1476.23	1.059	60.45
4096	64	106496	2.788	1468.92	1.061	60.31
4096	64	110592	2.824	1450.39	1.075	59.56
4096	64	114688	2.851	1436.56	1.078	59.34
4096	64	118784	2.888	1418.34	1.081	59.19
4096	64	122880	2.924	1400.62	1.088	58.84
4096	64	126976	2.949	1389.08	1.095	58.44
4096	64	131072	2.984	1372.64	1.105	57.94

PR1329 ik/fused_delta_net_2@0579a868 -sm layer -fdn 4096

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -fdn 4096 \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	1.861	2200.59	0.688	92.97
4096	64	4096	1.912	2142.22	0.678	94.43
4096	64	8192	1.964	2085.05	0.688	92.97
4096	64	12288	2.026	2022.11	0.706	90.65
4096	64	16384	2.092	1957.71	0.711	89.99
4096	64	20480	2.156	1899.75	0.721	88.73
4096	64	24576	2.206	1857.13	0.737	86.85
4096	64	28672	2.263	1810.05	0.745	85.93
4096	64	32768	2.332	1756.65	0.763	83.85
4096	64	36864	2.394	1710.65	0.767	83.49
4096	64	40960	2.468	1659.43	0.776	82.49
4096	64	45056	2.505	1635.07	0.791	80.88
4096	64	49152	2.561	1599.20	0.799	80.14
4096	64	53248	2.618	1564.42	0.807	79.28
4096	64	57344	2.675	1531.19	0.821	77.94
4096	64	61440	2.744	1492.69	0.829	77.16
4096	64	65536	2.793	1466.77	0.844	75.81
4096	64	69632	2.838	1443.38	0.851	75.24
4096	64	73728	2.912	1406.77	0.859	74.48
4096	64	77824	2.971	1378.78	0.874	73.25
4096	64	81920	3.016	1358.25	0.881	72.63
4096	64	86016	3.082	1329.12	0.895	71.49
4096	64	90112	3.144	1302.82	0.904	70.77
4096	64	94208	3.203	1278.82	0.911	70.25
4096	64	98304	3.266	1254.01	0.926	69.12
4096	64	102400	3.329	1230.23	0.932	68.66
4096	64	106496	3.372	1214.88	0.942	67.94
4096	64	110592	3.460	1183.85	0.957	66.88
4096	64	114688	3.513	1165.87	0.964	66.39
4096	64	118784	3.544	1155.62	0.979	65.34
4096	64	122880	3.641	1125.06	0.986	64.89
4096	64	126976	3.688	1110.62	0.995	64.30
4096	64	131072	3.729	1098.32	1.011	63.31

PR1329 ik/fused_delta_net_2@0579a868 -sm graph -fdn 4096

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -fdn 4096 \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	2.033	2014.60	0.807	79.28
4096	64	4096	2.057	1991.14	0.795	80.49
4096	64	8192	2.078	1970.85	0.800	80.05
4096	64	12288	2.105	1945.93	0.808	79.19
4096	64	16384	2.139	1914.84	0.818	78.28
4096	64	20480	2.176	1882.05	0.829	77.24
4096	64	24576	2.188	1872.31	0.842	76.04
4096	64	28672	2.229	1837.97	0.846	75.67
4096	64	32768	2.262	1810.50	0.850	75.28
4096	64	36864	2.294	1785.61	0.856	74.77
4096	64	40960	2.331	1756.96	0.863	74.19
4096	64	45056	2.367	1730.37	0.876	73.06
4096	64	49152	2.392	1712.40	0.879	72.85
4096	64	53248	2.430	1685.79	0.882	72.57
4096	64	57344	2.456	1667.85	0.887	72.12
4096	64	61440	2.502	1636.81	0.893	71.66
4096	64	65536	2.527	1620.73	0.907	70.56
4096	64	69632	2.566	1596.36	0.910	70.32
4096	64	73728	2.591	1580.79	0.912	70.14
4096	64	77824	2.622	1562.01	0.918	69.70
4096	64	81920	2.654	1543.58	0.923	69.31
4096	64	86016	2.696	1519.03	0.936	68.35
4096	64	90112	2.731	1499.84	0.940	68.08
4096	64	94208	2.754	1487.14	0.943	67.84
4096	64	98304	2.783	1471.97	0.948	67.51
4096	64	102400	2.831	1446.81	0.953	67.16
4096	64	106496	2.852	1436.01	0.958	66.83
4096	64	110592	2.890	1417.07	0.969	66.06
4096	64	114688	2.915	1405.34	0.971	65.90
4096	64	118784	2.960	1383.95	0.975	65.61
4096	64	122880	2.973	1377.96	0.981	65.23
4096	64	126976	2.997	1366.71	0.986	64.92
4096	64	131072	3.028	1352.88	0.997	64.22

* Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes.

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

ikawrakow added 15 commits February 24, 2026 16:47

Revive fused delta-net

a350f1b

Add command line argument for fused delta net

28b31a6

Simplify/improve CUDA delta-net

dc44a37

Add -fdn to llama-bench

fecdcd5

More CUDA fused delta net optimizations

7af6892

CPU optimizations

2ef38b5

Much faster fused delta-net on the CPU

b184e84

It seems it is faster than the chunked implementation!

Change meaning of fdn from bool flag to threshold value

d7c0104

Use eps = 1e-6

1687ff8

Give some nodes a name

b3cf43e

Don't re-apply L2 norm - it has already been done

0ec3e73

This seems quite a bit better

8af3755

More tweaks

a8ef7e2

Merge remote-tracking branch 'origin/main' into ik/fused_delta_net_2

ef2ab07

Restore per context buffer size log

0579a86

Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes.

Nexesenex mentioned this pull request Feb 26, 2026

Display the size of the tensors overriden during the tensor loading #1318

Merged

4 tasks

ikawrakow merged commit 2616efa into main Feb 26, 2026

This was referenced Feb 26, 2026

Very slightly better fused delta-net #1330

Merged

Minor delta-net tweak #1337

Merged

ikawrakow mentioned this pull request Mar 12, 2026

Split mode graph for models with pre-merged ffn_up/ffn_gate experts #1412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused delta net 2#1320

Fused delta net 2#1320
ikawrakow merged 15 commits intomainfrom
ik/fused_delta_net_2

ikawrakow commented Feb 25, 2026

Uh oh!

magikRUKKOLA commented Feb 25, 2026

Uh oh!

ubergarm commented Feb 25, 2026 •

edited

Loading

main@216f4436 -sm layer

main@216f4436 -sm graph

main@216f4436 -sm layer -fdn 4096

main@216f4436 -sm graph -fdn 4096

PR1329 ik/fused_delta_net_2@0579a868 -sm layer

PR1329 ik/fused_delta_net_2@0579a868 -sm graph

PR1329 ik/fused_delta_net_2@0579a868 -sm layer -fdn 4096

PR1329 ik/fused_delta_net_2@0579a868 -sm graph -fdn 4096

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Feb 25, 2026

Uh oh!

magikRUKKOLA commented Feb 25, 2026

Uh oh!

ubergarm commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

main@216f4436 -sm layer

main@216f4436 -sm graph

main@216f4436 -sm layer -fdn 4096

main@216f4436 -sm graph -fdn 4096

PR1329 ik/fused_delta_net_2@0579a868 -sm layer

PR1329 ik/fused_delta_net_2@0579a868 -sm graph

PR1329 ik/fused_delta_net_2@0579a868 -sm layer -fdn 4096

PR1329 ik/fused_delta_net_2@0579a868 -sm graph -fdn 4096

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ubergarm commented Feb 25, 2026 •

edited

Loading