Skip to content

Do not repeat yourself#1373

Merged
ikawrakow merged 4 commits intomainfrom
ik/delta_dry
Mar 6, 2026
Merged

Do not repeat yourself#1373
ikawrakow merged 4 commits intomainfrom
ik/delta_dry

Conversation

@ikawrakow
Copy link
Owner

In the Qwen3-Next and Qwen-3.5 series of models the number of recurrent K and Q attention heads is less than the number of V heads. To make them the same as the V heads, there is a GGMPL_OP_REPEAT operation applied to the K and Q activations before invoking the linear delta-net, which comes at a non-negligible performance cost.

This PR removes the repetition and adds to the CPU and CUDA delta net implementations the ability to deal with a different number of K/Q and V attention heads.

Depending on compute configuration (CPU-only, CUDA-only, or hybrid GPU/CPU) we gain up to 5% in PP, and up to 3% in TG performance.

@Ph0rk0z
Copy link

Ph0rk0z commented Mar 6, 2026

I didn't gain a whooole lot but I did a little:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 5.660 180.90 11.910 21.49
1024 256 1024 5.422 188.87 11.823 21.65
1024 256 2048 5.524 185.37 11.833 21.63
1024 256 3072 5.468 187.29 11.887 21.54
1024 256 4096 5.486 186.65 11.933 21.45
1024 256 5120 5.894 173.73 12.072 21.21
1024 256 6144 5.471 187.18 12.171 21.03
1024 256 7168 5.806 176.36 12.320 20.78
1024 256 8192 5.456 187.67 11.970 21.39

unlike every model I've tried in the past, increasing the batch size increased prompt process speed here even with RTR enabled.

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 10.310 198.65 23.116 22.15
2048 512 2048 10.343 198.00 23.330 21.95
2048 512 4096 10.349 197.89 23.496 21.79
2048 512 6144 10.467 195.66 23.715 21.59
2048 512 8192 10.418 196.58 23.904 21.42

@ikawrakow ikawrakow merged commit 277fc1d into main Mar 6, 2026
@ubergarm
Copy link
Contributor

ubergarm commented Mar 6, 2026

I didn't do a before/after of this PR, but did include this PR while gathering some comparisons between ik and mainline forks.

The first one is a 3-way comparison CPU-only between ik, mainline, and a mainline PR chunked vector delta net implementation:

The others were without their improved chunked delta net implementation, but include hybrid CPU+GPU and full single GPU offload:

performance with ik is looking good on my gaming rig with the recent qwen gated delta net models, thanks!

@ikawrakow
Copy link
Owner Author

@ubergarm

Thank you for these comparisons. Most is similar to what I see with my various configurations (but I tend to see a larger ik_llama.cpp advantage on CUDA for long context). The most notable exception is CPU-only TG performance for Qwen-3.5-35B-A3B. There I see a nearly 3X difference between ik_llama.cpp and llama.cpp (without their PR 19504) vs your ~1.5X (31.5 t/s vs 11.4 t/s). That's on a Ryzen-3995WX, so maybe the higher memory bandwidth allows ik_llama.cpp to pull further ahead, while the older Zen2 core makes llama.cpp even more sluggish.

@magikRUKKOLA
Copy link

magikRUKKOLA commented Mar 6, 2026

[EDIT]: note the black line does not pertain to the comparison of main vs PR -- Its just because I realized only post factum that I can actually use the batch size of 640 (the max VRAM consumption of the last GPU was about 23.9GB).

prefill-qwen35

50 tps!

decode-qwen35

@magikRUKKOLA
Copy link

magikRUKKOLA commented Mar 7, 2026

qwen35-combined-prefill

qwen35-combined-decode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants