Conversation
This reverts commit edadb80.
It was single-threaded and was taking ~25% of the computation time during TG. It is now down to 2%. Strangely enough, I measure 13.6 t/s with llama-bench, but if I let the model give me an actual response with llama-cli, I get close to 17 t/s.
For Qwen3Next there is a scale op on a largish tensor (548k elements) that has a single row for TG, so was done in a single thread. We now simply use blocks of 1024 elements.
|
THANK YOU, Ikawrakow and YurkoHoshko!!! A 80B model running at 12.7 T/s, only in CPU in my Dual Xeon!!!!! Christmas came early this year!!! 👍 |
|
CPU Perplexity Regression — DeltaNet (Qwen3-Coder-Next Q4_K_M) We've been digging into the CPU backend performance on DeltaNet models and found a significant PPL gap we can't fully explain. Sharing our findings in case you can spot what we're missing — we're still learning the IK fork's ggml internals. Setup: Qwen3-Coder-Next Q4_K_M, wikitext-2-raw, 2 chunks (512 tokens)
Two issues we see:
We've eliminated IQK mul_mat, cont fusion, mul broadcast, repeat fast paths, and the PR-specific ggml.c changes (sub revert, repeat ne[0]==1, fused delta-net op removal) as causes. The regression appears to come from pre-existing IK main branch ggml.c differences vs ggml-org. Is this a known issue with DeltaNet on CPU, or can you point us toward what we should be looking at? Happy to run any tests you'd find useful — we have the model and benchmark set up on a dedicated machine. Don't want to keep chasing this if it's already on your radar or if we're fundamentally misunderstanding something. |
Needs non-coontiguous variant of sum_rows. On the CPU this gave 30+% improvement in TG performance, on CUDA ist is disapointing 6-7%. I guess, this is because Georgi's cont CPU implementation was so bad that skipping it made such a big difference.
Worth 1% in TG
|
@ProgenyAlpha Yes, I know there is still something wrong in the chunked delta-net implementation on the CPU. I said that in the PR description, no? But the PR is still usable because:
I have checked all ops in the chunked delta-net part, and I don't immediately see anything wrong. I guess, I need to invest more time and go over all ops one-by-one. |
|
Impressive work, @ikawrakow - thank you so much for picking this up and applying your magic to semi-working PR of mine! This is a great unlock (I believe it can be a base for Kimi Linear and also future Qwen 3.5 has very similar architecture) and also a demonstration of how capable From my side, to give credit where the credit is due - @pwilkin did a lot of work to build the initial implementation for llama.cpp - and I don't think Codex would've been able to do so much without a reference implementation in llama.cpp. |
|
@ikawrakow @YurkoHoshko Thanks! Wasn't trying to say you weren't aware, we just wanted to share our findings after our deep dive. No expectation of a fix at all, just wanted the forensics on record since we'd already spent a full day tracing it down to Honestly, massive thanks to both of you for making this happen. Running an 80B MoE with DeltaNet attention on a Radeon 890M is something we didn't think was possible a month ago. Vulkan performance has been rock solid for us. Really excited to see where this goes with Kimi Linear and Qwen 3.5, if there's anything we can help test or benchmark on the AMD/Vulkan side, we're happy to put cycles into it. This project deserves more contributors and we'd love to give back where we can. |
Starting from PR #1251, this is WIP to integrate Qwen3Next.
Massive upgrade to CPU-ony TG performance (7.2 -> 24 t/s on a Ryzen-3995WX), seems to be working correctly.
CPU batch processing is still not fully correct, didn't see yet where is the issue.After various optimizations, CPU PP-512 performance went up from 160 t/s on PR #1251 to 240 t/s in this PR.Zero changes to CUDA compared to #1251, so CUDA PP is still massively slower thanllama.cpp.Marking as draft for now.
Update Spent a few hours fixing the bad CUDA performance. PP is now better than mainline by about 25%.
, TG is about 5% lower than mainline. So, I'll remove the draft label, but I think it needs some more work.Update 2: TG on CUDA should now be on par with mainline.
Update 3: PP on the CPU is now fixed, so the PR is fully functional.
While I was fixing this PR, PR-19375 in mainline was merged. The PR optimizes the Qwen3-Next compute graph, achieving a non-negligible performance improvement. Nothing along these lines has happened in
ik_llama.cppyet, so it is interesting to compare performance. Using llama.cppbuild: 27b93cbd1 (8064)for the following. Note thatsweep-benchdoes not work correctly for a recurrent model, so just plainllama-benchcomparison. The CUDA runs is with full offload on 2x3090, the CPU-only run is on a Ryzen-3995WX CPU.I think this is not too bad considering that mainline developers have been optimizing Qwen3-Next since last November, while this is the very first
ik_llama.cppPR, which was generated by @YurkoHoshko with the help of Codex-5.3, and I started looking at it more seriously just 2 days ago.