Conversation
There was a problem hiding this comment.
this doc was used solely by AI during development and I included it just to keep track of what was going on - please ignore.
scripts/qwen3next-eval.sh
Outdated
There was a problem hiding this comment.
this was used to compare to mainline as coding agent was advancing - please ignore
|
noticed some stability issues, investigating |
CPU fused DeltaNet threading bug — narrowed downHardware: AMD Ryzen AI 9 HX PRO 370 (12c/24t, AVX-512), 96GB DDR5, CPU-only, Q4_K GGUF via Ollama-converted blob. I've been debugging the CPU-only fused path and can confirm @ikawrakow's finding that it's broken. Here's where it's broken: The race condition is NOT inside Proof:
The last row is the key finding. Even with the kernel itself forced single-threaded, Additional verification:
The non-fused path avoids this entirely — it uses standard GGML ops (mul_mat, transpose, sum_rows) with no Separate issue — Ollama GGUF compat: Ollama-converted Q3CN GGUFs omit the Detailed notes and reproduction Dockerfiles: https://github.com/ProgenyAlpha/ik-deltanet-fix |
|
More observations for the CPU implementation:
Fused delta-net on, FA on: Fused delta-net off, FA on: So, basically one needs to turn on fused delta-net for prefill, and turn it off for generation. |
Follow-up: hybrid dispatch test (fused prefill + autoregressive generation)Based on @ikawrakow's observation that fused gives correct PPL for prefill but broken generation, I patched the dispatch at // BEFORE: fused handles everything when enabled
if (use_fused_delta_net) {
attn_out = build_delta_net_fused(...);
} else {
attn_out = n_tok == 1 ? build_delta_net_autoregressive(...) : build_delta_net_chunking(...);
}
// AFTER: fused for prefill only, autoregressive for generation
if (use_fused_delta_net && n_tok > 1) {
attn_out = build_delta_net_fused(...);
} else if (n_tok == 1) {
attn_out = build_delta_net_autoregressive(...);
} else {
attn_out = build_delta_net_chunking(...);
}Result: partial fix. Garbage output is eliminated — first ~100 tokens are correct and on-topic. But generation degrades after that, producing repetition and hallucinated HTML tags. Comparing the same prompt ("Explain how a CPU cache works in 3 paragraphs"):
The fused kernel and autoregressive path compute DeltaNet state differently, so when autoregressive picks up the state from fused prefill, the mismatch accumulates during generation. Conclusion: The real fix needs to address the fused kernel's T=1 threading bug in the Updated repo with v15 Dockerfile and results: https://github.com/ProgenyAlpha/ik-deltanet-fix |
Summary: three distinct CPU bugs in the DeltaNet implementationFor tracking purposes — there are three separate bugs, not one:
Bug 1 blocks using fused for generation. Bug 3 means the chunked fallback isn't equivalent. Together they explain why no single flag combination gives both correct PPL and correct generation on CPU. |
Mainline llama.cpp + Vulkan iGPU: working alternative while CPU DeltaNet is debuggedWhile the three CPU bugs above are being sorted, we got Q3CN running well on mainline llama.cpp with Vulkan on the same hardware (N5 Pro, HX370, Radeon 890M iGPU, 96GB DDR5). Setup:
Results (no flash attention):
Results (flash attention ):
Flash attention on Vulkan RDNA 3.5: slight TG boost for short responses, slight PP regression for longer prompts, but halves KV cache memory — enabling 8K-16K context without hitting memory walls. Key finding: Also added flash attention results (
FA halves KV cache memory, enabling 8K-16K context. Slight TG boost for short responses, slight PP regression on longer prompts. |
|
Thank you - very good advice re. running the model that fully fits on GPU to cover all scenarios - should've done it myself. I also saw other people using tiny models of the same architecture just for numerical verification - a bit above my head, but surely would speed up iteration cycles :) Again - really appreciate your feedback. I noticed elevated perplexity levels and attempted to fix it yesterday - no progress there yet, will convert this PR back to draft for the moment being. |
- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches
- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member
- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init
|
Alrighty, I removed a lot of code to keep things a little easier to review and tried to reduce the number of changes to existing files. At the moment, perplexity seems fixed: Details
CPU, w/o fa CPU, w/ fa Hybrid, w/ fa And bencheswith offloadCUDA_VISIBLE_DEVICES=0 build/bin/llama-sweep-bench -m /models/qwen3-next-coder.gguf -c 8192 -b 2048 -ub 512 -t 8 -fa on -ngl 999 --n-cpu-moe 35 -rtr --temp 1 --top-p 0.95 --top-k 40 --min-p 0.01 without offload
The model seems to work and has no major problems with tool calling. The fix to perplexity seemed to be this one aaa1b12 I am now following up with more things on a separate branch - just wanted to leave something more or less stable here to review. |
|
This looks much better than the previous version. However:
Do you mind if I take your branch and try to fix it? It will be too tedious to do this via PR comments. |
|
Please, by all means - I would really appreciate it! Just for my own education: to test CPU-only inference - is it sufficient to set —dev none - or should I recompile it with disabled cuda? - I may have been running my tests incorrectly 🤦 .
I ran it side by side with gpt oss 20b for that exact reason - seemed to work (though I need to review my setup to make sure I am doing things correctly - new docker guide is 🔥 ) Thank you for your time! |
|
You can run with In my case, I prefer to have two separate build folders, one with CUDA enabled ( |
|
Interested to know your thoughts and approach for |
|
Closing this PR in favour of #1266 |
|
@YurkoHoshko I was hoping you will figure out what is wrong with the CPU chunked delta net ;-) |
There will be no graph parallel for Qwen3Next. The attention architecture is completely different, and I don't see how one can effectively parallelize it over multiple GPUs. |
Will look into it over the weekend - apologies for the delay! |
Disclaimer
This PR was fully AI generated as a test of Codex 5.3 capabilities - this is by no means an optimized version that follows
ik_llama.cppbest practices and is not meant to be a contribution in the current shape. Opening this PR per request from ikawrakow on the issue (#1229 (comment)) and is meant to serve mostly as a reference for affected code paths.Testing methodology
Mainline was cross-referenced throughout the development, will all ops being tested and compared for correctness. Perplexity also seemed within norm.
I ran this model with OpenCode / Pi agents and it seems to work, tool calls are good.
Original bench
**Benchmark**
Command:
CUDA_VISIBLE_DEVICES=0,1 build/bin/llama-sweep-bench -m /models/qwen-3-coder-next-mxfp4.gguf -c 8192 -t 8 -fa on --jinja -ngl 999 --n-cpu-moe 25 -rtr --temp 1 --top-p 0.95 --top-k 40 --min-p 0.01Mainline:
This PR:
There are few more PRs in the mainline that might be potentially interesting / useful:
Update from Feb 7th
Spent some more time throwing Codex at this problem / PR, moved over some more code - it seems to work, but I can't vouch for the changes because it is over my head :) Got some gains.
New benchmark
Qwen3Next Benchmark: PP 16384 / TG 128 (`ik_llama.cpp` vs `llama.cpp`)Date: 2026-02-08
iktest2/models/qwen3-next-coder.gguf-p 16384-n 128-b 3072 -ub 768-t 8-r 1-mmp 0CUDA runs:
CUDA_VISIBLE_DEVICES=0-fa 1 -ngl 999 --n-cpu-moe 47CPU-only runs:
-fa 0 -ngl 0 --n-cpu-moe 0Hardware note:
NVIDIA GeForce RTX 5060 Ti,16311 MiBtotal (CUDA_VISIBLE_DEVICES=0for CUDA runs).NVIDIA GeForce RTX 3060,12288 MiBtotal.ikCUDA run (p=8192,b=2048,ub=512,n-cpu-moe=45): GPU0 memory used~12074 MiB(~3775 MiBfree), fromnvidia-smi.Results
ik_llama.cppllama.cppik_llama.cppllama.cppRelative (
ikvsllama.cpp)+11.91%+12.91%-4.38%+74.10%Additional CUDA rerun (requested lower
n-cpu-moeballpark)Adjusted config:
-p 8192 -n 128 -b 2048 -ub 512 -t 8 -fa 1 -ngl 999 -mmp 0CUDA_VISIBLE_DEVICES=0Fit checks on
ik:--n-cpu-moe 25-> fail to load model--n-cpu-moe 40-> fail to create context--n-cpu-moe 45-> worksWorking comparison at
--n-cpu-moe 45:ik_llama.cppllama.cppikrerun with-rtr 1at the same config (--n-cpu-moe 45):ik_llama.cpp(-rtr 1)