Skip to content

metal: add fused Flux RoPE and direct conv2d kernels#9

Merged
gianni-cor merged 10 commits into
tetherto:2026-01-30from
gianni-cor:feat/metal-conv2d-implicit-gemm
May 21, 2026
Merged

metal: add fused Flux RoPE and direct conv2d kernels#9
gianni-cor merged 10 commits into
tetherto:2026-01-30from
gianni-cor:feat/metal-conv2d-implicit-gemm

Conversation

@gianni-cor

@gianni-cor gianni-cor commented May 13, 2026

Copy link
Copy Markdown

Summary

  • Fused RoPE Metal kernel (GGML_OP_ROPE_FLUX): applies rotary embedding + layout permute in a single Metal dispatch, replacing 13 ggml ops per Q/K. Eliminates 175 CONT memory copy operations per Flux denoising step.
  • Fused V permute kernel (kernel_permute_cont_021): single dispatch for V tensor preparation in flash attention.
  • Implicit GEMM conv2d kernel: replaces naive per-pixel conv2d with simdgroup MMA-based kernel (64×64 tiles). 17% faster than im2col+matmul, saves ~1GB VRAM on VAE.
  • Flash attention NQPTG>8 fix: adds query block loops to QK and O accumulation sections so NQPTG>8 produces correct results (though NQPTG=8 remains optimal).

Benchmarks (Flux2 Klein 4B, Q8_0)

Apple M4 (10 GPU cores, 16GB)

Original Optimized Speedup
Per step 8.1 s/it 4.95 s/it 1.64×
Total (512×512) 39.1s 24.5s 1.60×

Apple M3 Ultra (60 GPU cores, 96GB)

Resolution Original Optimized iris.c Optimized vs original
512×512 6.12s 2.85s 2.79s 2.14×
1024×1024 19.24s 9.42s 9.80s 2.04×
1792×1792 63.0s 35.1s 36.1s 1.80×

sd.cpp optimized matches or beats iris.c on denoising at ≥1024×1024.

Test plan

  • test-rope-flux: bit-exact correctness across 4 configurations (small, medium, flux_klein, batch)
  • test-conv2d-direct: 14 conv2d configurations with max_abs=0.0000
  • End-to-end image generation verified on M4 and M3 Ultra
  • Quality comparison: perceptually identical to original pipeline
  • GGML_ROPE_FLUX_DISABLE=1 env var for fallback to original code path

Files changed

  • include/ggml.h: added GGML_OP_ROPE_FLUX
  • src/ggml.c: ggml_rope_flux() function
  • src/ggml-metal/ggml-metal.metal: kernel_rope_flux, kernel_permute_cont_021, kernel_conv_2d implicit GEMM, flash attention query block fix
  • src/ggml-metal/ggml-metal-ops.cpp: dispatch logic
  • src/ggml-metal/ggml-metal-impl.h: kargs structs
  • src/ggml-metal/ggml-metal-device.m: supports_op entries
  • src/ggml-metal/ggml-metal-device.cpp: conv2d pipeline/smem
  • tests/: test-rope-flux.cpp, test-conv2d-direct.cpp, test-mul-mat-bench.cpp
  • SPEED.md, SPEED-FLUX.md: optimization logs with benchmarks and images

gianni-cor pushed a commit to gianni-cor/qvac that referenced this pull request May 13, 2026
…E + conv2d)

Adds a vcpkg overlay port for ggml that points to gianni-cor/ggml@feat/metal-conv2d-implicit-gemm
(tetherto/qvac-ext-ggml PR tetherto#9). This overlay overrides the registry ggml port
with the optimized version for testing.

Changes in the ggml overlay:
- Fused RoPE Metal kernel (GGML_OP_ROPE_FLUX): 36% faster Flux2 denoising on M4
- Fused V permute kernel (kernel_permute_cont_021)
- Implicit GEMM conv2d (17% faster than im2col, saves ~1GB VRAM)
- Flash attention NQPTG>8 query block fix

Benchmarks: see tetherto/qvac-ext-ggml#9
Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor gianni-cor force-pushed the feat/metal-conv2d-implicit-gemm branch 2 times, most recently from 32f93bc to fb25b0f Compare May 14, 2026 21:59
1. Fused RoPE (GGML_OP_ROPE_FLUX)
   - New op: ggml_rope_flux(a, b) in ggml.h / ggml.c
   - Metal kernel: kernel_rope_flux — applies interleaved rotary embedding
     and permutes output layout in a single dispatch
   - Metal kernel: kernel_permute_cont_021 — fused permute(0,2,1,3)+cont
     for V tensor preparation (called via ggml_rope_flux(v, NULL))
   - Dispatch: ggml-metal-ops.cpp selects kernel based on PE presence
   - Supports_op: ggml-metal-device.m for F32 inputs
   - Test: test-rope-flux.cpp — bit-exact correctness across 4 configs

2. Implicit GEMM conv2d (kernel_conv_2d)
   - Rewrote naive per-pixel conv2d as implicit GEMM with simdgroup MMA
   - 64x64 output tiles, half-precision loads, float accumulators
   - 1x1 conv fast path, incremental k-decomposition
   - Dispatch: ggml-metal-ops.cpp updated for 256 threads, threadgroup mem
   - Pipeline: ggml-metal-device.cpp updated smem and bc_out condition
   - Test: test-conv2d-direct.cpp — 14 configs, exact match vs im2col

3. Flash attention NQPTG>8 fix
   - Added query block loops to QK and O accumulation sections
   - Enables correct output for NQPTG>8 (though NQPTG=8 remains optimal)

Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor gianni-cor force-pushed the feat/metal-conv2d-implicit-gemm branch from fb25b0f to 75c1904 Compare May 14, 2026 22:05
gianni-cor and others added 4 commits May 15, 2026 00:26
Co-authored-by: Cursor <cursoragent@cursor.com>
Tighten fused RoPE validation and fallback behavior while covering the conv2d 1x1 edge case that could read out of bounds.

Co-authored-by: Cursor <cursoragent@cursor.com>
Require simdgroup matrix multiply before advertising Metal conv2d support and add ROPE_FLUX to backend-op coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep Metal ROPE_FLUX support checks aligned with the int32 dispatch limit so oversized tensors fall back instead of asserting at runtime.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor gianni-cor changed the title metal: fused RoPE kernel + conv2d implicit GEMM — 2× faster Flux2 denoising metal: add fused Flux RoPE and direct conv2d kernels May 20, 2026
aegioscy
aegioscy previously approved these changes May 20, 2026
Comment thread src/ggml-metal/ggml-metal.metal Outdated
Comment thread src/ggml-metal/ggml-metal-impl.h Outdated
@jpgaribotti

Copy link
Copy Markdown

Tests don't cover non-contiguous source
The Metal kernel and CPU forward both honor src0->nb[*], but test-rope-flux.cpp only allocates a fresh ggml_new_tensor_4d. The whole point of the fusion is that the real Flux graph passes a non-contiguous Q/K directly; please add a test case where the input arrives as a ggml_permute(...) result so we exercise non-unit-stride loads through the kernel.

test-conv2d-direct.cpp has good shape coverage but only F16 weights with F32 input. The supports_op allows F32+F32 too — please add at least one F32 weight case.

Comment thread src/ggml-cpu/ops.cpp
gianni-cor and others added 3 commits May 20, 2026 17:40
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor and others added 2 commits May 20, 2026 18:21
Co-authored-by: Cursor <cursoragent@cursor.com>
Keep backend correctness tests on executable IM2COL_3D type combinations and make CPU/Vulkan supports_op reject combinations that would otherwise assert during compute.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor

Copy link
Copy Markdown
Author

/review

@gianni-cor gianni-cor merged commit 3409834 into tetherto:2026-01-30 May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants