WIP: Qwen3Next by ikawrakow · Pull Request #1266 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-13T17:36:49Z

Starting from PR #1251, this is WIP to integrate Qwen3Next.

Massive upgrade to CPU-ony TG performance (7.2 -> 24 t/s on a Ryzen-3995WX), seems to be working correctly.

~~CPU batch processing is still not fully correct, didn't see yet where is the issue.~~ After various optimizations, CPU PP-512 performance went up from 160 t/s on PR #1251 to 240 t/s in this PR.

~~Zero changes to CUDA compared to #1251, so CUDA PP is still massively slower than llama.cpp.~~

Marking as draft for now.

Update Spent a few hours fixing the bad CUDA performance. PP is now better than mainline by about 25%.~~, TG is about 5% lower than mainline~~. So, I'll remove the draft label, but I think it needs some more work.

Update 2: TG on CUDA should now be on par with mainline.

Update 3: PP on the CPU is now fixed, so the PR is fully functional.

While I was fixing this PR, PR-19375 in mainline was merged. The PR optimizes the Qwen3-Next compute graph, achieving a non-negligible performance improvement. Nothing along these lines has happened in ik_llama.cpp yet, so it is interesting to compare performance. Using llama.cpp build: 27b93cbd1 (8064) for the following. Note that sweep-bench does not work correctly for a recurrent model, so just plain llama-bench comparison. The CUDA runs is with full offload on 2x3090, the CPU-only run is on a Ryzen-3995WX CPU.

model	backend	test	t/s (llama.cpp)	t/s (ik_llama.cpp)	Speedup
qwen3next 80B.A3B IQ4_XS	CPU	pp512	98.13 ± 1.27	239.06 ± 6.11	2.436
qwen3next 80B.A3B IQ4_XS	CPU	tg128	10.17 ± 0.38	23.00 ± 0.04	2.262
qwen3next 80B.A3B IQ4_XS	CUDA	pp2048	2124.74 ± 2.06	2269.94 ± 33.85	1.068
qwen3next 80B.A3B IQ4_XS	CUDA	tg128	103.46 ± 0.42	88.20 ± 0.01	0.853

I think this is not too bad considering that mainline developers have been optimizing Qwen3-Next since last November, while this is the very first ik_llama.cpp PR, which was generated by @YurkoHoshko with the help of Codex-5.3, and I started looking at it more seriously just 2 days ago.

This reverts commit edadb80.

It was single-threaded and was taking ~25% of the computation time during TG. It is now down to 2%. Strangely enough, I measure 13.6 t/s with llama-bench, but if I let the model give me an actual response with llama-cli, I get close to 17 t/s.

For Qwen3Next there is a scale op on a largish tensor (548k elements) that has a single row for TG, so was done in a single thread. We now simply use blocks of 1024 elements.

… for Qwen3Next

ChicoPinto70 · 2026-02-14T16:05:25Z

THANK YOU, Ikawrakow and YurkoHoshko!!! A 80B model running at 12.7 T/s, only in CPU in my Dual Xeon!!!!! Christmas came early this year!!! 👍

ProgenyAlpha · 2026-02-14T20:23:45Z

CPU Perplexity Regression — DeltaNet (Qwen3-Coder-Next Q4_K_M)

We've been digging into the CPU backend performance on DeltaNet models and found a significant PPL gap we can't fully explain. Sharing our findings in case you can spot what we're missing — we're still learning the IK fork's ggml internals.

Setup: Qwen3-Coder-Next Q4_K_M, wikitext-2-raw, 2 chunks (512 tokens)

Build	Backend	Threads	PPL
ggml-org mainline	CPU	12	4.85
ggml-org mainline	Vulkan	—	4.80
IK (`9a34efa`)	Vulkan	—	4.81
IK (`9a34efa`)	CPU	1	6.79
IK (`9a34efa`)	CPU	12	7.30
IK (`e1fa9e2`, base merge)	CPU	12	7.30

Two issues we see:

dup/cont threading corruption — At -t 1, CONT source and destination tensor means match exactly (-1141.27). At -t 12, destination mean collapses to ~0.003. Accounts for ~0.55 PPL.
Baseline CPU regression (~2 PPL) — Even at -t 1 with IQK disabled (PPL 6.75), there's a ~2 PPL gap vs mainline CPU. This is present at the base merge commit (e1fa9e2), before any of the optimization chain commits. Vulkan gets correct PPL on the same graph, so the graph construction looks right — something in the CPU kernel path diverges.

We've eliminated IQK mul_mat, cont fusion, mul broadcast, repeat fast paths, and the PR-specific ggml.c changes (sub revert, repeat ne[0]==1, fused delta-net op removal) as causes. The regression appears to come from pre-existing IK main branch ggml.c differences vs ggml-org.

Is this a known issue with DeltaNet on CPU, or can you point us toward what we should be looking at? Happy to run any tests you'd find useful — we have the model and benchmark set up on a dedicated machine. Don't want to keep chasing this if it's already on your radar or if we're fundamentally misunderstanding something.

Needs non-coontiguous variant of sum_rows. On the CPU this gave 30+% improvement in TG performance, on CUDA ist is disapointing 6-7%. I guess, this is because Georgi's cont CPU implementation was so bad that skipping it made such a big difference.

Worth 1% in TG

ikawrakow · 2026-02-15T06:43:24Z

@ProgenyAlpha Yes, I know there is still something wrong in the chunked delta-net implementation on the CPU. I said that in the PR description, no?

But the PR is still usable because:

Everything works fine on CUDA
Many people will use it in hybrid GPU/CPU mode where the attention part is always computed on the GPU, so that's fully correct too
Even if prompt processing is not 100% correct when running CPU-only, in my admittedly limited testing it still produces relevant and fully coherent responses when used CPU-only.

I have checked all ops in the chunked delta-net part, and I don't immediately see anything wrong. I guess, I need to invest more time and go over all ops one-by-one.

YurkoHoshko · 2026-02-15T19:49:27Z

Impressive work, @ikawrakow - thank you so much for picking this up and applying your magic to semi-working PR of mine! This is a great unlock (I believe it can be a base for Kimi Linear and also future Qwen 3.5 has very similar architecture) and also a demonstration of how capable ik_llama.cpp is.

From my side, to give credit where the credit is due - @pwilkin did a lot of work to build the initial implementation for llama.cpp - and I don't think Codex would've been able to do so much without a reference implementation in llama.cpp.

ProgenyAlpha · 2026-02-15T23:12:43Z

@ikawrakow @YurkoHoshko Thanks! Wasn't trying to say you weren't aware, we just wanted to share our findings after our deep dive. No expectation of a fix at all, just wanted the forensics on record since we'd already spent a full day tracing it down to ggml_compute_forward_dup on transposed DeltaNet tensors. Hopefully, that saves some time whenever you get around to it.

Honestly, massive thanks to both of you for making this happen. Running an 80B MoE with DeltaNet attention on a Radeon 890M is something we didn't think was possible a month ago. Vulkan performance has been rock solid for us.

Really excited to see where this goes with Kimi Linear and Qwen 3.5, if there's anything we can help test or benchmark on the AMD/Vulkan side, we're happy to put cycles into it. This project deserves more contributors and we'd love to give back where we can.

yurko and others added 30 commits February 6, 2026 12:13

qwen3next: add architecture support and recurrent-state fixes

a7df116

qwen3next: optimize broadcast sub and single-seq ssm conv

9fbb504

cuda: build MoE row mapping on device in mul_mat_id

89e9ecf

cuda: add guarded multi-seq fast path for ssm_conv

236633a

docs: update qwen3next perf report for cuda MoE/SSM tuning

c767cfa

cuda: reduce qwen3next moe/ssm sync overhead and refresh eval

e64b433

qwen3next: split cpu/cuda eval builds and tune PP scheduling

6db8dc8

qwen3next: harden seq-state flow and support optional dense FFN layers

fffd27e

qwen3next: trim delta-net graph overhead in chunking path

a1163d0

qwen3next: remove redundant v_conv cont in delta path

0e3891b

qwen3next: avoid extra cont on linear attention output

43edfa2

qwen3next: drop redundant cont before recurrent state flatten

de5bf44

qwen3next: keep recurrent state in 4d layout through delta path

5a6c4e8

qwen3next: add fused delta-net op and wire model path

6dd990d

tests: add backend-op coverage for ggml_delta_net

ed0565f

qwen3next: add runtime switch for fused delta-net path

b33cef6

docs: refresh qwen3next perf review and benchmark matrix

81e788e

qwen3next: default fused delta-net off and document quality checks

9930f4d

qwen3next: add decode-only fused delta mode

143e88a

qwen3next: make fused delta safe by default and fix fused tensor layout

64099e7

qwen3next: warn when forcing fused decode mode

343e335

qwen3next: add fused-delta regression runner script

44db394

qwen3next: integrate fused regression into eval harness

55270b0

qwen3next: clean up chunked delta-net shape handling

670434e

qwen3next: add absolute sanity guards to fused regression

691df60

qwen3next: add unified regression runner script

a822db6

qwen3next: disable flash-attn for cpu-only contexts

627d469

docs: reconcile qwen3next status and remaining upstream gaps

bd0dd78

common: add qwen3next fused-delta runtime flag

b5c9554

cuda: add qwen3next delta-net kernel dispatch override

eef360a

ikawrakow added 13 commits February 13, 2026 07:27

Minor

a51dda7

Revert "Discard unnecessary changes in llama-build-context.cpp"

a707400

This reverts commit edadb80.

Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches

e1fa9e2

Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next

1f46b2d

Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next

45fef3b

For Qwen3Next there is a scale op on a largish tensor (548k elements) that has a single row for TG, so was done in a single thread. We now simply use blocks of 1024 elements.

Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next

856d650

CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s…

84cab8d

… for Qwen3Next

Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512

b7b33cf

Multithreading for OP_SUB

de62a10

Don't commit with timing trace on

c453d56

Multithread neg and sigmoid

3ee0fdc

Be able to turn on/off fusion more easily (CPU)

d1653d3

YurkoHoshko mentioned this pull request Feb 14, 2026

[WIP] Qwen 3 Next experiment #1251

Closed

4 tasks

Name the mul_mat ops so we know where the time goes

fad84ab

ikawrakow added 2 commits February 14, 2026 17:49

WIP

c3fdcdb

Much better PP on CUDA

5b9f659

ikawrakow added 2 commits February 15, 2026 04:59

CUDA: faster mul for special case relevant for Qwen3Next

b81eec7

Worth 1% in TG

ikawrakow added 2 commits February 15, 2026 17:40

Fix CPU OP_CONT

78924f8

Merge remote-tracking branch 'origin/main' into ik/qwen3next

46c4057

ikawrakow marked this pull request as ready for review February 15, 2026 17:49

ikawrakow merged commit e30198a into main Feb 16, 2026

This was referenced Feb 17, 2026

Feature Request: Add support for Qwen3 Next #1229

Closed

Fused delta-net #1315

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Qwen3Next#1266

WIP: Qwen3Next#1266
ikawrakow merged 76 commits intomainfrom
ik/qwen3next

ikawrakow commented Feb 13, 2026 •

edited

Loading

Uh oh!

ChicoPinto70 commented Feb 14, 2026 •

edited

Loading

Uh oh!

ProgenyAlpha commented Feb 14, 2026

Uh oh!

ikawrakow commented Feb 15, 2026

Uh oh!

YurkoHoshko commented Feb 15, 2026 •

edited

Loading

Uh oh!

ProgenyAlpha commented Feb 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChicoPinto70 commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProgenyAlpha commented Feb 14, 2026

Uh oh!

ikawrakow commented Feb 15, 2026

Uh oh!

YurkoHoshko commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProgenyAlpha commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikawrakow commented Feb 13, 2026 •

edited

Loading

ChicoPinto70 commented Feb 14, 2026 •

edited

Loading

YurkoHoshko commented Feb 15, 2026 •

edited

Loading

ProgenyAlpha commented Feb 15, 2026 •

edited

Loading