UPSTREAM PR #20571: sycl: add GGML_OP_GATED_DELTA_NET fused kernel by loci-dev · Pull Request #1256 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-15T03:09:59Z

Note

Source pull request: ggml-org/llama.cpp#20571

Summary

Port the Gated Delta Net (GDN) recurrence kernel from the Vulkan compute shader (gated_delta_net.comp) to the SYCL backend
Enables Qwen3.5 and other delta-net architecture models to run efficiently on Intel GPUs via oneAPI
Previously, the SYCL backend had no GDN support, causing fallback to CPU for these operations

Implementation

New files:

ggml/src/ggml-sycl/gdn.cpp — fused kernel implementation
ggml/src/ggml-sycl/gdn.hpp — header

Modified files:

ggml/src/ggml-sycl/backend.hpp — add include
ggml/src/ggml-sycl/ggml-sycl.cpp — add dispatch case and supports_op entry

Kernel features:

Both GDA (scalar gate) and KDA (key-dependent / vector gate) variants
Head sizes 32, 64, 128 via compile-time templates
GQA/MQA support through stride-based tensor access (matching Vulkan push constants layout)
sycl::float4 vectorized inner loops (same pattern as existing gla.cpp)
One workgroup per (head, seq), S_V threads per workgroup, state held in registers

Benchmark

Tested on Intel Arc 140V (Lunar Lake iGPU) with Qwen3.5-0.8B-Q4_K_M, -ngl 99:

Metric	Before (no GDN kernel)	After (this PR)	Change
Decode (tok/s)	22.0	54.0	+145%
Prompt (tok/s)	8.6	23.1	+169%

The decode improvement comes from GDN layers now running as a fused kernel on GPU instead of falling back to per-op CPU execution.

Test plan

Builds successfully with oneAPI 2025.3 + Ninja on Windows
End-to-end inference with Qwen3.5-0.8B produces coherent output
test-backend-ops passes GATED_DELTA_NET tests (test cases already exist in upstream)
CI with SYCL build

🤖 Generated with Claude Code

Port the Gated Delta Net (GDN) recurrence from the Vulkan compute shader (gated_delta_net.comp) to the SYCL backend, enabling Qwen3.5 and other delta-net models to run on Intel GPUs via oneAPI. Kernel features: - Supports both GDA (scalar gate) and KDA (vector gate / key-dependent) modes - Head sizes 32, 64, 128 via compile-time templates - GQA/MQA support through stride-based tensor access - Float4 vectorized inner loops matching the GLA kernel pattern - One workgroup per (head, seq) with S_V threads; state held in registers Tested on Intel Arc 140V (Lunar Lake) with Qwen3.5-0.8B-Q4_K_M: - Before (GDN fallback to CPU): 22.0 tok/s decode - After (GDN fused on GPU): 54.0 tok/s decode (+145%) - Prompt eval: 23.1 tok/s (vs Vulkan 2.0 tok/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

loci-dev force-pushed the main branch 12 times, most recently from e6c519b to 59f2b25 Compare March 23, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 89a1190 to 8fec234 Compare March 30, 2026 02:18

loci-dev force-pushed the main branch 8 times, most recently from 6ef937b to 3655621 Compare April 5, 2026 02:18

loci-dev force-pushed the main branch 8 times, most recently from 55afbee to ef0eff4 Compare April 12, 2026 02:18

loci-dev force-pushed the main branch 9 times, most recently from 63ab8d1 to 7638ab4 Compare April 19, 2026 02:19

loci-dev force-pushed the main branch from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #20571: sycl: add GGML_OP_GATED_DELTA_NET fused kernel#1256

UPSTREAM PR #20571: sycl: add GGML_OP_GATED_DELTA_NET fused kernel#1256
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-20571-sycl-gated-delta-net

loci-dev commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 15, 2026

Summary

Implementation

Benchmark

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants