Do not repeat yourself by ikawrakow · Pull Request #1373 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-03-06T13:16:19Z

In the Qwen3-Next and Qwen-3.5 series of models the number of recurrent K and Q attention heads is less than the number of V heads. To make them the same as the V heads, there is a GGMPL_OP_REPEAT operation applied to the K and Q activations before invoking the linear delta-net, which comes at a non-negligible performance cost.

This PR removes the repetition and adds to the CPU and CUDA delta net implementations the ability to deal with a different number of K/Q and V attention heads.

Depending on compute configuration (CPU-only, CUDA-only, or hybrid GPU/CPU) we gain up to 5% in PP, and up to 3% in TG performance.

Ph0rk0z · 2026-03-06T14:38:03Z

I didn't gain a whooole lot but I did a little:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	5.660	180.90	11.910	21.49
1024	256	1024	5.422	188.87	11.823	21.65
1024	256	2048	5.524	185.37	11.833	21.63
1024	256	3072	5.468	187.29	11.887	21.54
1024	256	4096	5.486	186.65	11.933	21.45
1024	256	5120	5.894	173.73	12.072	21.21
1024	256	6144	5.471	187.18	12.171	21.03
1024	256	7168	5.806	176.36	12.320	20.78
1024	256	8192	5.456	187.67	11.970	21.39

unlike every model I've tried in the past, increasing the batch size increased prompt process speed here even with RTR enabled.

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	10.310	198.65	23.116	22.15
2048	512	2048	10.343	198.00	23.330	21.95
2048	512	4096	10.349	197.89	23.496	21.79
2048	512	6144	10.467	195.66	23.715	21.59
2048	512	8192	10.418	196.58	23.904	21.42

ubergarm · 2026-03-06T19:42:18Z

I didn't do a before/after of this PR, but did include this PR while gathering some comparisons between ik and mainline forks.

The first one is a 3-way comparison CPU-only between ik, mainline, and a mainline PR ~~chunked~~ vector delta net implementation:

ggml: add GATED_DELTA_NET op ggml-org/llama.cpp#19504 (comment)

The others were without their improved chunked delta net implementation, but include hybrid CPU+GPU and full single GPU offload:

https://huggingface.co/ubergarm/Qwen3-Coder-Next-GGUF/discussions/6#69ab13f13b5711040e228cff

performance with ik is looking good on my gaming rig with the recent qwen gated delta net models, thanks!

ikawrakow · 2026-03-06T20:44:44Z

@ubergarm

Thank you for these comparisons. Most is similar to what I see with my various configurations (but I tend to see a larger ik_llama.cpp advantage on CUDA for long context). The most notable exception is CPU-only TG performance for Qwen-3.5-35B-A3B. There I see a nearly 3X difference between ik_llama.cpp and llama.cpp (without their PR 19504) vs your ~1.5X (31.5 t/s vs 11.4 t/s). That's on a Ryzen-3995WX, so maybe the higher memory bandwidth allows ik_llama.cpp to pull further ahead, while the older Zen2 core makes llama.cpp even more sluggish.

magikRUKKOLA · 2026-03-06T21:36:11Z

[EDIT]: note the black line does not pertain to the comparison of main vs PR -- Its just because I realized only post factum that I can actually use the batch size of 640 (the max VRAM consumption of the last GPU was about 23.9GB).

50 tps!

magikRUKKOLA · 2026-03-07T12:15:58Z

ikawrakow and others added 4 commits March 6, 2026 11:31

DRY - part 1

1b6b026

DRY - part 2

5ecb534

DRY - part 3

b1b2279

Fix NEON

7b19963

ikawrakow merged commit 277fc1d into main Mar 6, 2026

ikawrakow mentioned this pull request Mar 12, 2026

Split mode graph for models with pre-merged ffn_up/ffn_gate experts #1412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not repeat yourself#1373

Do not repeat yourself#1373
ikawrakow merged 4 commits intomainfrom
ik/delta_dry

ikawrakow commented Mar 6, 2026

Uh oh!

Ph0rk0z commented Mar 6, 2026 •

edited

Loading

Uh oh!

ubergarm commented Mar 6, 2026 •

edited

Loading

Uh oh!

ikawrakow commented Mar 6, 2026

Uh oh!

magikRUKKOLA commented Mar 6, 2026 •

edited

Loading

Uh oh!

magikRUKKOLA commented Mar 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Mar 6, 2026

Uh oh!

Ph0rk0z commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Mar 6, 2026

Uh oh!

magikRUKKOLA commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magikRUKKOLA commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ph0rk0z commented Mar 6, 2026 •

edited

Loading

ubergarm commented Mar 6, 2026 •

edited

Loading

magikRUKKOLA commented Mar 6, 2026 •

edited

Loading

magikRUKKOLA commented Mar 7, 2026 •

edited

Loading