metal: add fused Flux RoPE and direct conv2d kernels by gianni-cor · Pull Request #9 · tetherto/qvac-ext-ggml

gianni-cor · 2026-05-13T23:41:09Z

Summary

Fused RoPE Metal kernel (GGML_OP_ROPE_FLUX): applies rotary embedding + layout permute in a single Metal dispatch, replacing 13 ggml ops per Q/K. Eliminates 175 CONT memory copy operations per Flux denoising step.
Fused V permute kernel (kernel_permute_cont_021): single dispatch for V tensor preparation in flash attention.
Implicit GEMM conv2d kernel: replaces naive per-pixel conv2d with simdgroup MMA-based kernel (64×64 tiles). 17% faster than im2col+matmul, saves ~1GB VRAM on VAE.
Flash attention NQPTG>8 fix: adds query block loops to QK and O accumulation sections so NQPTG>8 produces correct results (though NQPTG=8 remains optimal).

Benchmarks (Flux2 Klein 4B, Q8_0)

Apple M4 (10 GPU cores, 16GB)

	Original	Optimized	Speedup
Per step	8.1 s/it	4.95 s/it	1.64×
Total (512×512)	39.1s	24.5s	1.60×

Apple M3 Ultra (60 GPU cores, 96GB)

Resolution	Original	Optimized	iris.c	Optimized vs original
512×512	6.12s	2.85s	2.79s	2.14×
1024×1024	19.24s	9.42s	9.80s	2.04×
1792×1792	63.0s	35.1s	36.1s	1.80×

sd.cpp optimized matches or beats iris.c on denoising at ≥1024×1024.

Test plan

test-rope-flux: bit-exact correctness across 4 configurations (small, medium, flux_klein, batch)
test-conv2d-direct: 14 conv2d configurations with max_abs=0.0000
End-to-end image generation verified on M4 and M3 Ultra
Quality comparison: perceptually identical to original pipeline
GGML_ROPE_FLUX_DISABLE=1 env var for fallback to original code path

Files changed

include/ggml.h: added GGML_OP_ROPE_FLUX
src/ggml.c: ggml_rope_flux() function
src/ggml-metal/ggml-metal.metal: kernel_rope_flux, kernel_permute_cont_021, kernel_conv_2d implicit GEMM, flash attention query block fix
src/ggml-metal/ggml-metal-ops.cpp: dispatch logic
src/ggml-metal/ggml-metal-impl.h: kargs structs
src/ggml-metal/ggml-metal-device.m: supports_op entries
src/ggml-metal/ggml-metal-device.cpp: conv2d pipeline/smem
tests/: test-rope-flux.cpp, test-conv2d-direct.cpp, test-mul-mat-bench.cpp
SPEED.md, SPEED-FLUX.md: optimization logs with benchmarks and images

…E + conv2d) Adds a vcpkg overlay port for ggml that points to gianni-cor/ggml@feat/metal-conv2d-implicit-gemm (tetherto/qvac-ext-ggml PR tetherto#9). This overlay overrides the registry ggml port with the optimized version for testing. Changes in the ggml overlay: - Fused RoPE Metal kernel (GGML_OP_ROPE_FLUX): 36% faster Flux2 denoising on M4 - Fused V permute kernel (kernel_permute_cont_021) - Implicit GEMM conv2d (17% faster than im2col, saves ~1GB VRAM) - Flash attention NQPTG>8 query block fix Benchmarks: see tetherto/qvac-ext-ggml#9 Co-authored-by: Cursor <cursoragent@cursor.com>

1. Fused RoPE (GGML_OP_ROPE_FLUX) - New op: ggml_rope_flux(a, b) in ggml.h / ggml.c - Metal kernel: kernel_rope_flux — applies interleaved rotary embedding and permutes output layout in a single dispatch - Metal kernel: kernel_permute_cont_021 — fused permute(0,2,1,3)+cont for V tensor preparation (called via ggml_rope_flux(v, NULL)) - Dispatch: ggml-metal-ops.cpp selects kernel based on PE presence - Supports_op: ggml-metal-device.m for F32 inputs - Test: test-rope-flux.cpp — bit-exact correctness across 4 configs 2. Implicit GEMM conv2d (kernel_conv_2d) - Rewrote naive per-pixel conv2d as implicit GEMM with simdgroup MMA - 64x64 output tiles, half-precision loads, float accumulators - 1x1 conv fast path, incremental k-decomposition - Dispatch: ggml-metal-ops.cpp updated for 256 threads, threadgroup mem - Pipeline: ggml-metal-device.cpp updated smem and bc_out condition - Test: test-conv2d-direct.cpp — 14 configs, exact match vs im2col 3. Flash attention NQPTG>8 fix - Added query block loops to QK and O accumulation sections - Enables correct output for NQPTG>8 (though NQPTG=8 remains optimal) Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Tighten fused RoPE validation and fallback behavior while covering the conv2d 1x1 edge case that could read out of bounds. Co-authored-by: Cursor <cursoragent@cursor.com>

Require simdgroup matrix multiply before advertising Metal conv2d support and add ROPE_FLUX to backend-op coverage. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep Metal ROPE_FLUX support checks aligned with the int32 dispatch limit so oversized tensors fall back instead of asserting at runtime. Co-authored-by: Cursor <cursoragent@cursor.com>

jpgaribotti · 2026-05-20T14:50:33Z

Tests don't cover non-contiguous source
The Metal kernel and CPU forward both honor src0->nb[*], but test-rope-flux.cpp only allocates a fresh ggml_new_tensor_4d. The whole point of the fusion is that the real Flux graph passes a non-contiguous Q/K directly; please add a test case where the input arrives as a ggml_permute(...) result so we exercise non-unit-stride loads through the kernel.

test-conv2d-direct.cpp has good shape coverage but only F16 weights with F32 input. The supports_op allows F32+F32 too — please add at least one F32 weight case.

Co-authored-by: Cursor <cursoragent@cursor.com>

Keep backend correctness tests on executable IM2COL_3D type combinations and make CPU/Vulkan supports_op reject combinations that would otherwise assert during compute. Co-authored-by: Cursor <cursoragent@cursor.com>

gianni-cor · 2026-05-21T10:51:53Z

/review

gianni-cor mentioned this pull request May 13, 2026

diffusion-cpp: accelerate Flux2 and Stable Diffusion Metal inference tetherto/qvac#2044

Merged

gianni-cor force-pushed the feat/metal-conv2d-implicit-gemm branch 2 times, most recently from 32f93bc to fb25b0f Compare May 14, 2026 21:59

gianni-cor force-pushed the feat/metal-conv2d-implicit-gemm branch from fb25b0f to 75c1904 Compare May 14, 2026 22:05

gianni-cor and others added 4 commits May 15, 2026 00:26

metal: move RoPE flux dispatch to helper

7c31d1f

Co-authored-by: Cursor <cursoragent@cursor.com>

metal: harden RoPE flux and conv2d direct

c53d008

Tighten fused RoPE validation and fallback behavior while covering the conv2d 1x1 edge case that could read out of bounds. Co-authored-by: Cursor <cursoragent@cursor.com>

metal: gate conv2d support and test RoPE flux

c83e9f0

Require simdgroup matrix multiply before advertising Metal conv2d support and add ROPE_FLUX to backend-op coverage. Co-authored-by: Cursor <cursoragent@cursor.com>

metal: reject oversized RoPE flux dispatches

615fc5f

Keep Metal ROPE_FLUX support checks aligned with the int32 dispatch limit so oversized tensors fall back instead of asserting at runtime. Co-authored-by: Cursor <cursoragent@cursor.com>

gianni-cor changed the title ~~metal: fused RoPE kernel + conv2d implicit GEMM — 2× faster Flux2 denoising~~ metal: add fused Flux RoPE and direct conv2d kernels May 20, 2026

aegioscy previously approved these changes May 20, 2026

View reviewed changes

jpgaribotti reviewed May 20, 2026

View reviewed changes

Comment thread src/ggml-metal/ggml-metal.metal Outdated

Comment thread src/ggml-metal/ggml-metal-impl.h Outdated

jpgaribotti requested changes May 20, 2026

View reviewed changes

Comment thread src/ggml-cpu/ops.cpp

gianni-cor and others added 3 commits May 20, 2026 17:40

metal: sync conv2d simdgroup store

acf0a21

Co-authored-by: Cursor <cursoragent@cursor.com>

metal: remove unused conv2d tile args

ebc8042

Co-authored-by: Cursor <cursoragent@cursor.com>

cpu: optimize Flux RoPE row traversal

4557f4c

Co-authored-by: Cursor <cursoragent@cursor.com>

gianni-cor dismissed aegioscy’s stale review via 4557f4c May 20, 2026 15:59

gianni-cor and others added 2 commits May 20, 2026 18:21

test: cover Flux RoPE strides and F32 conv2d

a3767e8

Co-authored-by: Cursor <cursoragent@cursor.com>

backend: reject unsupported IM2COL_3D cases

36255dd

Keep backend correctness tests on executable IM2COL_3D type combinations and make CPU/Vulkan supports_op reject combinations that would otherwise assert during compute. Co-authored-by: Cursor <cursoragent@cursor.com>

jpgaribotti approved these changes May 21, 2026

View reviewed changes

aegioscy approved these changes May 21, 2026

View reviewed changes

gianni-cor merged commit 3409834 into tetherto:2026-01-30 May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal: add fused Flux RoPE and direct conv2d kernels#9

metal: add fused Flux RoPE and direct conv2d kernels#9
gianni-cor merged 10 commits into
tetherto:2026-01-30from
gianni-cor:feat/metal-conv2d-implicit-gemm

gianni-cor commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jpgaribotti commented May 20, 2026

Uh oh!

Uh oh!

gianni-cor commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gianni-cor commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (Flux2 Klein 4B, Q8_0)

Apple M4 (10 GPU cores, 16GB)

Apple M3 Ultra (60 GPU cores, 96GB)

Test plan

Files changed

Uh oh!

Uh oh!

Uh oh!

jpgaribotti commented May 20, 2026

Uh oh!

Uh oh!

gianni-cor commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gianni-cor commented May 13, 2026 •

edited

Loading