Skip to content

UPSTREAM PR #20571: sycl: add GGML_OP_GATED_DELTA_NET fused kernel#1256

Open
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-20571-sycl-gated-delta-net
Open

UPSTREAM PR #20571: sycl: add GGML_OP_GATED_DELTA_NET fused kernel#1256
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-20571-sycl-gated-delta-net

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#20571

Summary

  • Port the Gated Delta Net (GDN) recurrence kernel from the Vulkan compute shader (gated_delta_net.comp) to the SYCL backend
  • Enables Qwen3.5 and other delta-net architecture models to run efficiently on Intel GPUs via oneAPI
  • Previously, the SYCL backend had no GDN support, causing fallback to CPU for these operations

Implementation

New files:

  • ggml/src/ggml-sycl/gdn.cpp — fused kernel implementation
  • ggml/src/ggml-sycl/gdn.hpp — header

Modified files:

  • ggml/src/ggml-sycl/backend.hpp — add include
  • ggml/src/ggml-sycl/ggml-sycl.cpp — add dispatch case and supports_op entry

Kernel features:

  • Both GDA (scalar gate) and KDA (key-dependent / vector gate) variants
  • Head sizes 32, 64, 128 via compile-time templates
  • GQA/MQA support through stride-based tensor access (matching Vulkan push constants layout)
  • sycl::float4 vectorized inner loops (same pattern as existing gla.cpp)
  • One workgroup per (head, seq), S_V threads per workgroup, state held in registers

Benchmark

Tested on Intel Arc 140V (Lunar Lake iGPU) with Qwen3.5-0.8B-Q4_K_M, -ngl 99:

Metric Before (no GDN kernel) After (this PR) Change
Decode (tok/s) 22.0 54.0 +145%
Prompt (tok/s) 8.6 23.1 +169%

The decode improvement comes from GDN layers now running as a fused kernel on GPU instead of falling back to per-op CPU execution.

Test plan

  • Builds successfully with oneAPI 2025.3 + Ninja on Windows
  • End-to-end inference with Qwen3.5-0.8B produces coherent output
  • test-backend-ops passes GATED_DELTA_NET tests (test cases already exist in upstream)
  • CI with SYCL build

🤖 Generated with Claude Code

Port the Gated Delta Net (GDN) recurrence from the Vulkan compute shader
(gated_delta_net.comp) to the SYCL backend, enabling Qwen3.5 and other
delta-net models to run on Intel GPUs via oneAPI.

Kernel features:
- Supports both GDA (scalar gate) and KDA (vector gate / key-dependent) modes
- Head sizes 32, 64, 128 via compile-time templates
- GQA/MQA support through stride-based tensor access
- Float4 vectorized inner loops matching the GLA kernel pattern
- One workgroup per (head, seq) with S_V threads; state held in registers

Tested on Intel Arc 140V (Lunar Lake) with Qwen3.5-0.8B-Q4_K_M:
- Before (GDN fallback to CPU): 22.0 tok/s decode
- After  (GDN fused on GPU):    54.0 tok/s decode  (+145%)
- Prompt eval: 23.1 tok/s (vs Vulkan 2.0 tok/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@loci-dev loci-dev force-pushed the main branch 12 times, most recently from e6c519b to 59f2b25 Compare March 23, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 89a1190 to 8fec234 Compare March 30, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 6ef937b to 3655621 Compare April 5, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 55afbee to ef0eff4 Compare April 12, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 63ab8d1 to 7638ab4 Compare April 19, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants