Conversation
|
I didn't gain a whooole lot but I did a little:
unlike every model I've tried in the past, increasing the batch size increased prompt process speed here even with RTR enabled.
|
|
I didn't do a before/after of this PR, but did include this PR while gathering some comparisons between ik and mainline forks. The first one is a 3-way comparison CPU-only between ik, mainline, and a mainline PR The others were without their improved chunked delta net implementation, but include hybrid CPU+GPU and full single GPU offload: performance with ik is looking good on my gaming rig with the recent qwen gated delta net models, thanks! |
|
Thank you for these comparisons. Most is similar to what I see with my various configurations (but I tend to see a larger |
In the Qwen3-Next and Qwen-3.5 series of models the number of recurrent K and Q attention heads is less than the number of V heads. To make them the same as the V heads, there is a
GGMPL_OP_REPEAToperation applied to theKandQactivations before invoking the linear delta-net, which comes at a non-negligible performance cost.This PR removes the repetition and adds to the CPU and CUDA delta net implementations the ability to deal with a different number of
K/QandVattention heads.Depending on compute configuration (CPU-only, CUDA-only, or hybrid GPU/CPU) we gain up to 5% in PP, and up to 3% in TG performance.