UPSTREAM PR #20571: sycl: add GGML_OP_GATED_DELTA_NET fused kernel#1256
Open
loci-dev wants to merge 1 commit into
Open
UPSTREAM PR #20571: sycl: add GGML_OP_GATED_DELTA_NET fused kernel#1256loci-dev wants to merge 1 commit into
loci-dev wants to merge 1 commit into
Conversation
Port the Gated Delta Net (GDN) recurrence from the Vulkan compute shader (gated_delta_net.comp) to the SYCL backend, enabling Qwen3.5 and other delta-net models to run on Intel GPUs via oneAPI. Kernel features: - Supports both GDA (scalar gate) and KDA (vector gate / key-dependent) modes - Head sizes 32, 64, 128 via compile-time templates - GQA/MQA support through stride-based tensor access - Float4 vectorized inner loops matching the GLA kernel pattern - One workgroup per (head, seq) with S_V threads; state held in registers Tested on Intel Arc 140V (Lunar Lake) with Qwen3.5-0.8B-Q4_K_M: - Before (GDN fallback to CPU): 22.0 tok/s decode - After (GDN fused on GPU): 54.0 tok/s decode (+145%) - Prompt eval: 23.1 tok/s (vs Vulkan 2.0 tok/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e6c519b to
59f2b25
Compare
89a1190 to
8fec234
Compare
6ef937b to
3655621
Compare
55afbee to
ef0eff4
Compare
63ab8d1 to
7638ab4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#20571
Summary
gated_delta_net.comp) to the SYCL backendImplementation
New files:
ggml/src/ggml-sycl/gdn.cpp— fused kernel implementationggml/src/ggml-sycl/gdn.hpp— headerModified files:
ggml/src/ggml-sycl/backend.hpp— add includeggml/src/ggml-sycl/ggml-sycl.cpp— add dispatch case andsupports_opentryKernel features:
sycl::float4vectorized inner loops (same pattern as existinggla.cpp)S_Vthreads per workgroup, state held in registersBenchmark
Tested on Intel Arc 140V (Lunar Lake iGPU) with Qwen3.5-0.8B-Q4_K_M,
-ngl 99:The decode improvement comes from GDN layers now running as a fused kernel on GPU instead of falling back to per-op CPU execution.
Test plan
test-backend-opspassesGATED_DELTA_NETtests (test cases already exist in upstream)🤖 Generated with Claude Code