Skip to content

UPSTREAM PR #18102: ggml-cuda: Delta-Net linear attention for Qwen3-Next#593

Open
loci-dev wants to merge 3 commits intomainfrom
upstream-PR18102-branch_hauhaut-deltanet-cuda
Open

UPSTREAM PR #18102: ggml-cuda: Delta-Net linear attention for Qwen3-Next#593
loci-dev wants to merge 3 commits intomainfrom
upstream-PR18102-branch_hauhaut-deltanet-cuda

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18102

cuda kernel for delta-net linear attention layers in qwen3next.

adds GGML_OP_DELTA_NET + recurrent kernel for decode, blackwell path (sm12.0+) for prefill with 64k shmem. also improved solve_tri for the chunked prefill path.

getting ~45-55 t/s on q4/mxfp4 and ~40 t/s bf16 on 80B-A3B (blackwell). pre-blackwell cards get ~38-40 t/s from solve_tri improvements (baseline was the original ~20 t/s).

Edit: omitted some small bits.

@loci-review
Copy link

loci-review bot commented Dec 16, 2025

Explore the complete analysis inside the Version Insights

@loci-dev loci-dev force-pushed the main branch 23 times, most recently from ab5b02c to 2f30a3d Compare December 18, 2025 17:11
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 086f09a to 15838f1 Compare December 24, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants