Skip to content

Minor delta-net tweak#1308

Merged
ikawrakow merged 3 commits intomainfrom
ik/minor_delta_tweak
Feb 24, 2026
Merged

Minor delta-net tweak#1308
ikawrakow merged 3 commits intomainfrom
ik/minor_delta_tweak

Conversation

@ikawrakow
Copy link
Owner

Use the fused multiply-unary op instead of SILU->MUL.

Worth ~1% better TG performance for Qwen3-Next with full GPU offload.

@ubergarm
Copy link
Contributor

ubergarm commented Feb 23, 2026

I merged this into PR1307 and tested, looks like a slight uplift in TG indeed.

sweep-bench-Qwen3-Coder-Next-PR1308
👈 Details

-sm graph PR1307 ik/graph_parallel_tweak@35da97d5

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 2.019 2028.63 0.910 70.30
4096 64 4096 2.038 2010.05 0.897 71.35
4096 64 8192 2.051 1996.71 0.898 71.28
4096 64 12288 2.068 1980.88 0.906 70.62
4096 64 16384 2.096 1954.19 0.918 69.73
4096 64 20480 2.125 1927.30 0.929 68.91
4096 64 24576 2.150 1905.39 0.941 68.05
4096 64 28672 2.187 1872.52 0.947 67.60
4096 64 32768 2.215 1848.85 0.951 67.29
4096 64 36864 2.230 1836.76 0.958 66.83
4096 64 40960 2.267 1807.15 0.961 66.63
4096 64 45056 2.300 1780.66 0.974 65.68
4096 64 49152 2.326 1760.63 0.977 65.50
4096 64 53248 2.361 1735.13 0.984 65.03
4096 64 57344 2.401 1705.98 0.991 64.56
4096 64 61440 2.428 1687.29 0.994 64.37
4096 64 65536 2.458 1666.43 1.011 63.31
4096 64 69632 2.491 1644.53 1.012 63.26
4096 64 73728 2.525 1622.12 1.015 63.06
4096 64 77824 2.561 1599.08 1.025 62.47
4096 64 81920 2.596 1577.83 1.027 62.29
4096 64 86016 2.637 1553.41 1.042 61.42
4096 64 90112 2.664 1537.81 1.045 61.23
4096 64 94208 2.711 1511.03 1.047 61.13
4096 64 98304 2.722 1504.53 1.051 60.91
4096 64 102400 2.768 1479.87 1.058 60.48
4096 64 106496 2.792 1467.23 1.061 60.30
4096 64 110592 2.834 1445.06 1.075 59.56
4096 64 114688 2.856 1434.32 1.078 59.34
4096 64 118784 2.882 1421.25 1.082 59.16
4096 64 122880 2.930 1397.73 1.086 58.91
4096 64 126976 2.947 1390.05 1.094 58.50
4096 64 131072 2.976 1376.35 1.104 57.96

-sm graph PR1307 ik/graph_parallel_tweak@35da97d5 + PR1308 ik/minor_delta_tweak

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 2.013 2034.57 0.908 70.50
4096 64 4096 2.032 2015.96 0.898 71.29
4096 64 8192 2.044 2004.38 0.896 71.39
4096 64 12288 2.062 1986.57 0.904 70.81
4096 64 16384 2.086 1963.28 0.914 70.02
4096 64 20480 2.119 1933.29 0.924 69.30
4096 64 24576 2.143 1911.05 0.939 68.18
4096 64 28672 2.170 1887.18 0.941 68.00
4096 64 32768 2.214 1850.03 0.947 67.57
4096 64 36864 2.233 1834.43 0.954 67.08
4096 64 40960 2.263 1809.96 0.959 66.73
4096 64 45056 2.304 1777.41 0.971 65.90
4096 64 49152 2.327 1760.31 0.976 65.60
4096 64 53248 2.363 1733.35 0.981 65.26
4096 64 57344 2.403 1704.28 0.986 64.89
4096 64 61440 2.425 1688.82 0.992 64.48
4096 64 65536 2.469 1659.19 1.006 63.61
4096 64 69632 2.483 1649.55 1.009 63.44
4096 64 73728 2.523 1623.29 1.013 63.19
4096 64 77824 2.557 1602.15 1.018 62.87
4096 64 81920 2.589 1581.84 1.022 62.64
4096 64 86016 2.626 1559.89 1.033 61.93
4096 64 90112 2.660 1539.87 1.041 61.48
4096 64 94208 2.687 1524.65 1.042 61.39
4096 64 98304 2.723 1504.02 1.047 61.11
4096 64 102400 2.754 1487.31 1.055 60.68
4096 64 106496 2.789 1468.45 1.058 60.48
4096 64 110592 2.817 1453.83 1.071 59.75
4096 64 114688 2.857 1433.48 1.073 59.64
4096 64 118784 2.893 1416.00 1.077 59.41
4096 64 122880 2.908 1408.64 1.082 59.13
4096 64 126976 2.954 1386.62 1.088 58.82
4096 64 131072 2.967 1380.35 1.097 58.35

@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 23, 2026

// offtopic

But does that LLM actually work (Qwen3-Coder-Next) ? I prefilled 120k+ tokens with a long conversation regarding the bugs and way to fix it. Here is what I got as a response:

Assistant:
rei-13c? No, the original implementation didn't handle the case properly. Let me fix the logic properly and ensure the correct behavior. Let me fix the logic properly and ensure the correct behavior. Let me fix the logic properly and ensure the correct behavior. Let me fix the logic properly and ensure the correct

And thats it. 🤣

@ikawrakow ikawrakow merged commit 38ca19d into main Feb 24, 2026
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants