Skip to content

Fused delta net 2#1320

Merged
ikawrakow merged 15 commits intomainfrom
ik/fused_delta_net_2
Feb 26, 2026
Merged

Fused delta net 2#1320
ikawrakow merged 15 commits intomainfrom
ik/fused_delta_net_2

Conversation

@ikawrakow
Copy link
Owner

This PR adds further optimizations to the CUDA fused delta net implementation.

For Qwen3-Next fully offloaded to the GPU this results in ~6-7% better TG.

More importantly, fused delta net PP performance is not almost on par with the chunked implementation. Why is this more important? Because I can see a path forward to graph parallel with the fused delta net, while I consider the chunked implementation basically hopeless for graph parallel.

Anyway, the table shows PP-X as a function of prompt length X for the chunked and delta net implementations on a 2x3090 system running Qwen3-Next fully offloaded to the GPUs

test t/s (chunked) t/s (fused, PR) Speedup
pp2 69.66 ± 7.71 100.88 ± 11.92 1.448
pp4 125.50 ± 2.95 174.56 ± 4.12 1.391
pp8 228.45 ± 5.92 296.60 ± 7.67 1.298
pp16 389.50 ± 24.63 463.12 ± 30.34 1.189
pp32 626.37 ± 14.78 679.66 ± 16.18 1.085
pp64 915.86 ± 17.08 902.35 ± 16.38 0.985
pp128 1056.49 ± 16.53 1001.80 ± 69.70 0.948
pp256 1529.79 ± 17.49 1422.41 ± 14.17 0.930
pp512 2014.66 ± 16.04 1833.90 ± 8.86 0.910

In comparison, here is what we had on the main branch

test t/s (chunked) t/s (fused, main) Speedup
pp2 69.66 ± 7.71 97.82 ± 11.44 1.404
pp4 125.50 ± 2.95 164.99 ± 3.71 1.315
pp8 228.45 ± 5.92 272.36 ± 6.22 1.192
pp16 389.50 ± 24.63 405.20 ± 24.37 1.040
pp32 626.37 ± 14.78 562.87 ± 12.07 0.899
pp64 915.86 ± 17.08 710.33 ± 9.63 0.775
pp128 1056.49 ± 16.53 773.17 ± 8.34 0.732
pp256 1529.79 ± 17.49 998.91 ± 6.90 0.653
pp512 2014.66 ± 16.04 1186.01 ± 5.37 0.589

@magikRUKKOLA
Copy link

Qwen3.5 IQ2_KL 8x3090:

+ ~2% (decode)

Details
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 4.641 882.66 26.550 38.57
4096 1024 4096 4.759 860.76 26.633 38.45
4096 1024 8192 4.879 839.56 27.088 37.80
4096 1024 12288 5.021 815.70 27.479 37.27
4096 1024 16384 5.148 795.64 27.979 36.60
4096 1024 20480 5.276 776.28 28.165 36.36
4096 1024 24576 5.414 756.50 28.522 35.90
4096 1024 28672 5.531 740.56 29.032 35.27
4096 1024 32768 5.669 722.48 29.308 34.94
4096 1024 36864 5.789 707.51 29.922 34.22
4096 1024 40960 5.917 692.24 30.066 34.06
4096 1024 45056 6.047 677.40 30.414 33.67
4096 1024 49152 6.172 663.66 30.992 33.04
4096 1024 53248 6.291 651.12 31.258 32.76
4096 1024 57344 6.423 637.74 31.467 32.54
4096 1024 61440 6.580 622.54 31.938 32.06
4096 1024 65536 6.693 611.96 32.208 31.79
4096 1024 69632 6.812 601.32 32.743 31.27
4096 1024 73728 6.937 590.49 33.201 30.84
4096 1024 77824 7.068 579.48 33.366 30.69
4096 1024 81920 7.190 569.69 33.702 30.38
4096 1024 86016 7.336 558.33 34.125 30.01
4096 1024 90112 7.458 549.24 34.843 29.39
4096 1024 94208 7.584 540.07 35.088 29.18
4096 1024 98304 7.722 530.44 35.141 29.14

IQ4_KSS (DDR4 + 2x3090):

+ ~1.3% (decode)

Details
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 13.340 307.04 54.068 18.94
4096 1024 4096 13.557 302.13 53.770 19.04
4096 1024 8192 13.788 297.07 54.150 18.91
4096 1024 12288 13.848 295.79 55.297 18.52
4096 1024 16384 13.771 297.45 55.258 18.53
4096 1024 20480 14.047 291.60 56.212 18.22
4096 1024 24576 14.311 286.22 55.819 18.34

[references]: previous test: #1315 (comment)

@ubergarm
Copy link
Contributor

ubergarm commented Feb 25, 2026

I only tried with/without -fdn 4096 which is likely much too high given the original PR1315 using -fdn 16... Not sure the best values to try for CUDA offload or hybrid CPU. (a quick -fdn 16 test gave results similar to -fdn 4096 fwiw)

It still shows that this PR achieves the fastest TG speeds.

sweep-bench-Qwen3-Coder-Next-PR1320
👈 Details

main@216f4436 -sm layer

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 1.859 2203.45 0.780 82.01
4096 64 4096 1.899 2156.44 0.770 83.07
4096 64 8192 1.941 2110.53 0.780 82.00
4096 64 12288 1.995 2053.33 0.800 80.03
4096 64 16384 2.061 1987.05 0.806 79.42
4096 64 20480 2.129 1923.86 0.816 78.39
4096 64 24576 2.181 1877.89 0.830 77.13
4096 64 28672 2.239 1829.02 0.840 76.15
4096 64 32768 2.300 1780.82 0.856 74.73
4096 64 36864 2.358 1736.76 0.859 74.46
4096 64 40960 2.414 1696.51 0.871 73.50
4096 64 45056 2.474 1655.46 0.885 72.33
4096 64 49152 2.499 1639.05 0.891 71.79
4096 64 53248 2.584 1585.19 0.902 70.94
4096 64 57344 2.630 1557.45 0.913 70.11
4096 64 61440 2.701 1516.64 0.922 69.40
4096 64 65536 2.748 1490.37 0.939 68.18
4096 64 69632 2.798 1463.95 0.944 67.83
4096 64 73728 2.864 1430.01 0.953 67.12
4096 64 77824 2.915 1404.91 0.966 66.26
4096 64 81920 2.979 1374.98 0.972 65.82
4096 64 86016 3.043 1346.19 0.989 64.69
4096 64 90112 3.098 1321.97 0.996 64.26
4096 64 94208 3.158 1297.09 1.005 63.70
4096 64 98304 3.234 1266.48 1.021 62.66
4096 64 102400 3.298 1241.83 1.027 62.34
4096 64 106496 3.348 1223.50 1.036 61.78
4096 64 110592 3.413 1200.13 1.050 60.96
4096 64 114688 3.468 1181.14 1.057 60.55
4096 64 118784 3.534 1159.11 1.076 59.48
4096 64 122880 3.588 1141.74 1.083 59.12
4096 64 126976 3.650 1122.22 1.091 58.66
4096 64 131072 3.721 1100.84 1.106 57.86

main@216f4436 -sm graph

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 2.029 2019.19 0.911 70.22
4096 64 4096 2.051 1996.76 0.902 70.98
4096 64 8192 2.067 1982.04 0.902 70.99
4096 64 12288 2.093 1956.92 0.912 70.15
4096 64 16384 2.118 1933.81 0.923 69.30
4096 64 20480 2.152 1903.54 0.933 68.60
4096 64 24576 2.173 1884.80 0.946 67.65
4096 64 28672 2.210 1853.21 0.949 67.42
4096 64 32768 2.233 1833.92 0.954 67.07
4096 64 36864 2.264 1809.25 0.961 66.58
4096 64 40960 2.295 1785.01 0.967 66.19
4096 64 45056 2.329 1758.68 0.980 65.34
4096 64 49152 2.354 1739.72 0.985 65.01
4096 64 53248 2.385 1717.45 0.990 64.66
4096 64 57344 2.421 1692.08 0.993 64.42
4096 64 61440 2.467 1659.99 0.996 64.24
4096 64 65536 2.500 1638.72 1.012 63.22
4096 64 69632 2.512 1630.65 1.013 63.17
4096 64 73728 2.546 1608.83 1.019 62.82
4096 64 77824 2.581 1586.70 1.028 62.24
4096 64 81920 2.615 1566.27 1.030 62.12
4096 64 86016 2.644 1549.41 1.043 61.37
4096 64 90112 2.683 1526.80 1.049 61.00
4096 64 94208 2.711 1511.01 1.051 60.89
4096 64 98304 2.756 1486.23 1.055 60.67
4096 64 102400 2.762 1483.11 1.060 60.40
4096 64 106496 2.810 1457.76 1.064 60.13
4096 64 110592 2.838 1443.03 1.075 59.53
4096 64 114688 2.870 1427.25 1.080 59.27
4096 64 118784 2.898 1413.41 1.082 59.15
4096 64 122880 2.928 1398.78 1.088 58.85
4096 64 126976 2.954 1386.45 1.093 58.53
4096 64 131072 2.989 1370.35 1.103 58.03

main@216f4436 -sm layer -fdn 4096

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -fdn 4096 \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 3.207 1277.01 0.725 88.24
4096 64 4096 3.245 1262.07 0.715 89.56
4096 64 8192 3.294 1243.32 0.724 88.39
4096 64 12288 3.357 1220.19 0.742 86.22
4096 64 16384 3.407 1202.30 0.748 85.60
4096 64 20480 3.469 1180.74 0.756 84.63
4096 64 24576 3.531 1160.07 0.771 83.01
4096 64 28672 3.587 1141.76 0.782 81.81
4096 64 32768 3.661 1118.96 0.800 80.02
4096 64 36864 3.728 1098.58 0.803 79.70
4096 64 40960 3.792 1080.14 0.816 78.42
4096 64 45056 3.837 1067.57 0.830 77.07
4096 64 49152 3.877 1056.58 0.835 76.64
4096 64 53248 3.943 1038.91 0.846 75.63
4096 64 57344 4.008 1022.03 0.859 74.50
4096 64 61440 4.077 1004.76 0.866 73.94
4096 64 65536 4.114 995.53 0.883 72.45
4096 64 69632 4.180 979.92 0.890 71.94
4096 64 73728 4.263 960.76 0.897 71.37
4096 64 77824 4.313 949.59 0.912 70.21
4096 64 81920 4.352 941.17 0.919 69.62
4096 64 86016 4.424 925.85 0.932 68.64
4096 64 90112 4.495 911.28 0.941 68.03
4096 64 94208 4.539 902.43 0.949 67.40
4096 64 98304 4.622 886.24 0.966 66.26
4096 64 102400 4.699 871.69 0.973 65.75
4096 64 106496 4.740 864.21 0.982 65.17
4096 64 110592 4.798 853.77 0.995 64.29
4096 64 114688 4.865 841.97 1.003 63.79
4096 64 118784 4.905 835.07 1.019 62.81
4096 64 122880 4.962 825.47 1.024 62.49
4096 64 126976 5.042 812.43 1.035 61.83
4096 64 131072 5.089 804.90 1.048 61.06

main@216f4436 -sm graph -fdn 4096

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -fdn 4096 \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 3.382 1211.08 0.843 75.93
4096 64 4096 3.422 1197.02 0.829 77.23
4096 64 8192 3.440 1190.69 0.835 76.64
4096 64 12288 3.460 1183.89 0.846 75.69
4096 64 16384 3.492 1172.95 0.854 74.92
4096 64 20480 3.527 1161.37 0.865 74.00
4096 64 24576 3.550 1153.92 0.878 72.89
4096 64 28672 3.586 1142.27 0.880 72.70
4096 64 32768 3.625 1129.93 0.884 72.37
4096 64 36864 3.673 1115.07 0.893 71.65
4096 64 40960 3.694 1108.94 0.900 71.10
4096 64 45056 3.727 1099.00 0.913 70.10
4096 64 49152 3.763 1088.38 0.916 69.88
4096 64 53248 3.787 1081.52 0.921 69.52
4096 64 57344 3.827 1070.34 0.925 69.21
4096 64 61440 3.864 1060.04 0.931 68.75
4096 64 65536 3.888 1053.56 0.943 67.88
4096 64 69632 3.932 1041.74 0.948 67.51
4096 64 73728 3.956 1035.34 0.949 67.41
4096 64 77824 3.979 1029.47 0.956 66.93
4096 64 81920 4.008 1022.03 0.960 66.66
4096 64 86016 4.038 1014.38 0.973 65.81
4096 64 90112 4.075 1005.22 0.975 65.65
4096 64 94208 4.123 993.42 0.978 65.42
4096 64 98304 4.135 990.68 0.983 65.12
4096 64 102400 4.194 976.59 0.989 64.73
4096 64 106496 4.224 969.74 0.995 64.30
4096 64 110592 4.256 962.30 1.007 63.54
4096 64 114688 4.282 956.66 1.010 63.38
4096 64 118784 4.304 951.68 1.014 63.14
4096 64 122880 4.349 941.81 1.019 62.79
4096 64 126976 4.365 938.40 1.024 62.50
4096 64 131072 4.411 928.59 1.036 61.75

PR1329 ik/fused_delta_net_2@0579a868 -sm layer

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 1.858 2205.07 0.780 82.03
4096 64 4096 1.896 2159.80 0.769 83.21
4096 64 8192 1.946 2104.85 0.781 81.95
4096 64 12288 2.002 2046.02 0.799 80.06
4096 64 16384 2.057 1990.99 0.806 79.43
4096 64 20480 2.123 1929.44 0.816 78.48
4096 64 24576 2.172 1885.83 0.830 77.11
4096 64 28672 2.229 1837.29 0.840 76.19
4096 64 32768 2.299 1782.02 0.855 74.84
4096 64 36864 2.356 1738.35 0.859 74.54
4096 64 40960 2.422 1691.15 0.869 73.63
4096 64 45056 2.477 1653.42 0.883 72.50
4096 64 49152 2.524 1623.10 0.889 72.01
4096 64 53248 2.578 1589.05 0.900 71.12
4096 64 57344 2.631 1556.56 0.912 70.16
4096 64 61440 2.695 1519.81 0.922 69.39
4096 64 65536 2.749 1490.06 0.939 68.13
4096 64 69632 2.799 1463.42 0.943 67.89
4096 64 73728 2.866 1429.07 0.955 67.03
4096 64 77824 2.913 1406.23 0.966 66.28
4096 64 81920 2.979 1374.82 0.972 65.82
4096 64 86016 3.051 1342.71 0.990 64.67
4096 64 90112 3.105 1318.98 0.996 64.29
4096 64 94208 3.175 1290.05 1.004 63.77
4096 64 98304 3.223 1270.68 1.021 62.69
4096 64 102400 3.314 1235.87 1.026 62.36
4096 64 106496 3.336 1227.80 1.036 61.77
4096 64 110592 3.401 1204.23 1.050 60.98
4096 64 114688 3.470 1180.32 1.057 60.55
4096 64 118784 3.531 1159.85 1.075 59.55
4096 64 122880 3.605 1136.15 1.081 59.20
4096 64 126976 3.639 1125.55 1.090 58.73
4096 64 131072 3.714 1102.89 1.107 57.80

PR1329 ik/fused_delta_net_2@0579a868 -sm graph

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 2.015 2032.48 0.910 70.34
4096 64 4096 2.034 2013.88 0.899 71.17
4096 64 8192 2.050 1998.27 0.901 71.06
4096 64 12288 2.073 1976.08 0.910 70.35
4096 64 16384 2.099 1951.35 0.921 69.51
4096 64 20480 2.121 1931.20 0.929 68.89
4096 64 24576 2.148 1906.83 0.941 68.01
4096 64 28672 2.174 1884.23 0.945 67.71
4096 64 32768 2.215 1849.06 0.949 67.41
4096 64 36864 2.231 1835.82 0.957 66.87
4096 64 40960 2.265 1808.31 0.962 66.52
4096 64 45056 2.300 1780.81 0.977 65.48
4096 64 49152 2.333 1755.74 0.978 65.44
4096 64 53248 2.369 1728.69 0.982 65.18
4096 64 57344 2.400 1706.65 0.988 64.80
4096 64 61440 2.439 1679.43 0.996 64.25
4096 64 65536 2.463 1663.00 1.012 63.27
4096 64 69632 2.502 1637.21 1.012 63.25
4096 64 73728 2.534 1616.42 1.014 63.13
4096 64 77824 2.561 1599.49 1.021 62.67
4096 64 81920 2.596 1578.05 1.026 62.35
4096 64 86016 2.628 1558.40 1.042 61.42
4096 64 90112 2.657 1541.65 1.047 61.11
4096 64 94208 2.692 1521.79 1.048 61.08
4096 64 98304 2.739 1495.33 1.053 60.79
4096 64 102400 2.775 1476.23 1.059 60.45
4096 64 106496 2.788 1468.92 1.061 60.31
4096 64 110592 2.824 1450.39 1.075 59.56
4096 64 114688 2.851 1436.56 1.078 59.34
4096 64 118784 2.888 1418.34 1.081 59.19
4096 64 122880 2.924 1400.62 1.088 58.84
4096 64 126976 2.949 1389.08 1.095 58.44
4096 64 131072 2.984 1372.64 1.105 57.94

PR1329 ik/fused_delta_net_2@0579a868 -sm layer -fdn 4096

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -fdn 4096 \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 1.861 2200.59 0.688 92.97
4096 64 4096 1.912 2142.22 0.678 94.43
4096 64 8192 1.964 2085.05 0.688 92.97
4096 64 12288 2.026 2022.11 0.706 90.65
4096 64 16384 2.092 1957.71 0.711 89.99
4096 64 20480 2.156 1899.75 0.721 88.73
4096 64 24576 2.206 1857.13 0.737 86.85
4096 64 28672 2.263 1810.05 0.745 85.93
4096 64 32768 2.332 1756.65 0.763 83.85
4096 64 36864 2.394 1710.65 0.767 83.49
4096 64 40960 2.468 1659.43 0.776 82.49
4096 64 45056 2.505 1635.07 0.791 80.88
4096 64 49152 2.561 1599.20 0.799 80.14
4096 64 53248 2.618 1564.42 0.807 79.28
4096 64 57344 2.675 1531.19 0.821 77.94
4096 64 61440 2.744 1492.69 0.829 77.16
4096 64 65536 2.793 1466.77 0.844 75.81
4096 64 69632 2.838 1443.38 0.851 75.24
4096 64 73728 2.912 1406.77 0.859 74.48
4096 64 77824 2.971 1378.78 0.874 73.25
4096 64 81920 3.016 1358.25 0.881 72.63
4096 64 86016 3.082 1329.12 0.895 71.49
4096 64 90112 3.144 1302.82 0.904 70.77
4096 64 94208 3.203 1278.82 0.911 70.25
4096 64 98304 3.266 1254.01 0.926 69.12
4096 64 102400 3.329 1230.23 0.932 68.66
4096 64 106496 3.372 1214.88 0.942 67.94
4096 64 110592 3.460 1183.85 0.957 66.88
4096 64 114688 3.513 1165.87 0.964 66.39
4096 64 118784 3.544 1155.62 0.979 65.34
4096 64 122880 3.641 1125.06 0.986 64.89
4096 64 126976 3.688 1110.62 0.995 64.30
4096 64 131072 3.729 1098.32 1.011 63.31

PR1329 ik/fused_delta_net_2@0579a868 -sm graph -fdn 4096

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -fdn 4096 \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 2.033 2014.60 0.807 79.28
4096 64 4096 2.057 1991.14 0.795 80.49
4096 64 8192 2.078 1970.85 0.800 80.05
4096 64 12288 2.105 1945.93 0.808 79.19
4096 64 16384 2.139 1914.84 0.818 78.28
4096 64 20480 2.176 1882.05 0.829 77.24
4096 64 24576 2.188 1872.31 0.842 76.04
4096 64 28672 2.229 1837.97 0.846 75.67
4096 64 32768 2.262 1810.50 0.850 75.28
4096 64 36864 2.294 1785.61 0.856 74.77
4096 64 40960 2.331 1756.96 0.863 74.19
4096 64 45056 2.367 1730.37 0.876 73.06
4096 64 49152 2.392 1712.40 0.879 72.85
4096 64 53248 2.430 1685.79 0.882 72.57
4096 64 57344 2.456 1667.85 0.887 72.12
4096 64 61440 2.502 1636.81 0.893 71.66
4096 64 65536 2.527 1620.73 0.907 70.56
4096 64 69632 2.566 1596.36 0.910 70.32
4096 64 73728 2.591 1580.79 0.912 70.14
4096 64 77824 2.622 1562.01 0.918 69.70
4096 64 81920 2.654 1543.58 0.923 69.31
4096 64 86016 2.696 1519.03 0.936 68.35
4096 64 90112 2.731 1499.84 0.940 68.08
4096 64 94208 2.754 1487.14 0.943 67.84
4096 64 98304 2.783 1471.97 0.948 67.51
4096 64 102400 2.831 1446.81 0.953 67.16
4096 64 106496 2.852 1436.01 0.958 66.83
4096 64 110592 2.890 1417.07 0.969 66.06
4096 64 114688 2.915 1405.34 0.971 65.90
4096 64 118784 2.960 1383.95 0.975 65.61
4096 64 122880 2.973 1377.96 0.981 65.23
4096 64 126976 2.997 1366.71 0.986 64.92
4096 64 131072 3.028 1352.88 0.997 64.22

@ikawrakow ikawrakow merged commit 2616efa into main Feb 26, 2026
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants