ggml-cuda: warp-cooperative conv_transpose_1d, MUL_MAT_VEC + ADD + ADD fusion, GGML_CUDA_PERF_LOGGER env var by Zbig9000 · Pull Request #1465 · ggml-org/ggml

Zbig9000 · 2026-04-27T15:23:16Z

ggml-cuda: three independent perf / diagnostic improvements

This PR bundles three logically-distinct ggml-cuda changes that came
out of profiling / optimising chatterbox.cpp text-to-speech on RTX
5090 (Blackwell, sm_120, CUDA Toolkit 12.8). Each one is a separate
commit so it can be cherry-picked / squashed individually if you
prefer to land them one at a time.

#	Commit	Headline
1	`test: add HiFT-realistic shapes to test_conv_transpose_1d`	+5 test-backend-ops cases at HiFT-vocoder shapes
2	`ggml-cuda: warp-cooperative conv_transpose_1d kernel`	~42× faster at HiFT shapes
3	`test: extend test_mul_mat_vec_fusion with decoder-realistic shapes`	+100 test-backend-ops cases (`m=1, n>=1024, k>=1024`)
4	`ggml-cuda: fuse MUL_MAT_VEC + ADD(bias) + ADD(residual)`	−12 % total GPU time on transformer decoders
5	`ggml-cuda: add GGML_CUDA_PERF_LOGGER env var`	Cross-backend perf diagnostic mirroring `GGML_VK_PERF_LOGGER`

Total: +571 / -43 lines across src/ggml-cuda/ and
tests/test-backend-ops.cpp.

1. Warp-cooperative `conv_transpose_1d` kernel (commits 1, 2)

Problem

Current scalar kernel: one CUDA thread per output pixel, scanning all
IC × IL inputs with a per-iteration skip branch:

for (int c = 0; c < src0_ne2; c++) {                // c in [0, IC)
    for (int i = 0; i < src1_ne0; i++) {            // i in [0, IL)
        if (!(idx >= i*s0 && idx < i*s0 + src0_ne0)) {
            continue;                               // skip ~99% of iterations
        }
        accumulator += src0[...] * src1[...];
    }
}

At HiFT-vocoder-realistic shapes (L=303, IC=80, K=16, s0=8) this is
~24 K loop iterations per output pixel with the skip branch firing on
~99 % of iterations. HiFT decode spends 67 % of total GPU time in
this kernel (4 700 µs / call) on RTX 5090.

Solution

Narrow the input range to what actually contributes:
i ∈ [⌈(ol - K + 1)/s0⌉, ⌊ol/s0⌋] ∩ [0, IL-1]. At K=16, s0=8
this is 2 iterations of i instead of IL=O(100). Skip
branch eliminated entirely.
Parallelise IC reduction across the warp (32 threads); reduce
via 5 stages of __shfl_xor_sync.
Block size: 256 → 32 (one warp per output pixel).

Perf (RTX 5090, CUDA 12.8)

conv_transpose_1d kernel time: 4 700 µs → 110 µs per call (~42×)
HiFT decode total GPU time: 67 % → 1.6 %
End-to-end speedup on chatterbox 232-token utterance: 4.7× faster HiFT

Tests

Existing 116 test_conv_transpose_1d cases continue to pass. Commit
1 of this PR adds 5 HiFT-realistic shapes that exercise the
warp-cooperative reduction at scale (IC > 32 multi-warp accumulation,
K > s0 inner-loop unroll). These pass on both the legacy
scalar kernel and the new warp-cooperative kernel — they're a
regression guard for future changes, not a "fail before, pass after"
pattern.

Test phase	Result
`test-backend-ops -o CONV_TRANSPOSE_1D -b CUDA0` baseline	116/116 PASS
`…` after commit 1 (test-only)	121/121 PASS
`…` after commit 2 (kernel rewrite)	121/121 PASS

Files: src/ggml-cuda/conv-transpose-1d.cu,
src/ggml-cuda/conv-transpose-1d.cuh, tests/test-backend-ops.cpp.

2. `MUL_MAT_VEC + ADD(bias) + ADD(residual)` 3-op fusion (commits 3, 4)

Problem

ggml-cuda's fusion engine knows the 2-op pattern MUL_MAT_VEC + ADD(bias) but not the 3-op pattern that includes the residual. The
graph shape ((mat * y) + bias) + residual is common in transformer
attention-output and FFN-output blocks; ggml-vulkan already fuses it
via the MUL_MAT_ADD_ADD shader. Profiling chatterbox vs Vulkan
showed CUDA paid ~67 ms / utterance more in stand-alone ADD launches.

Solution

Extend ggml_cuda_mm_fusion_args_* (in common.cuh) with an
x_residual field. When set together with x_bias, the matmul-vec
kernel performs dst = mat * y + bias + residual in a single
dispatch.
Detect the {MUL_MAT, ADD, ADD} pattern in
ggml_cuda_graph_evaluate_and_capture (placed above the
existing 2-op fusion so the greedy match prefers the larger).
Resolve bias_tensor / residual_tensor from the cgraph;
reject patterns with broadcasting on either ADD (matches the 2-op
fusion's existing constraint).
Update mul_mat_vec_q (mmvq.cu) and mul_mat_vec_f (mmvf.cu)
templates to consume x_residual — added after bias and any GLU,
matching ggml-vulkan's MUL_MAT_ADD_ADD execution order.

Only MUL_MAT (not MUL_MAT_ID) is handled — the residual ADD
pattern doesn't apply to MoE expert routing. The new code path
falls back gracefully to plain dispatch when the fusion isn't
applicable.

Perf (RTX 5090, CUDA 12.8, chatterbox Turbo Q4_0, 232-token prompt)

Total GPU time per utterance: −12 %
MUL_MAT_VEC q4_0 op-bucket time: −47 ms / utterance
CUDA ↔ Vulkan gap on long prompts: 1.29× → 1.13×

Tests

test_mul_mat_vec_fusion::build_graph already exercises the
((mm * y) + bias) + residual pattern via:

ggml_tensor * ffn_up = ggml_mul_mat(ctx, up, cur);
if (with_bias) {
    ffn_up = ggml_add(ctx, ffn_up, up_bias);          // 2nd op
}
ggml_tensor * out = with_gate ? build_gate(ctx, ffn_gate, ffn_up) : ffn_up;
out = ggml_add(ctx, out, bias2);                      // 3rd op  ← residual

Commit 3 adds 100 new cases at decoder-realistic shapes (m=1, n ∈ {1024, 3072}, k=1024) that catch FP-precision and per-row stride
issues that don't surface at the existing m=1, n=32, k=256 cases.

Test phase	Result
`test-backend-ops -o MUL_MAT_VEC_FUSION -b CUDA0` baseline	300/300 PASS
`…` after commit 3 (test-only)	400/400 PASS
`…` after commit 4 (fusion patch, first attempt)	2/400 FAIL at `batch_dims=[4,2]`
Bug found: `x_residual` missing `(sample_dst, channel_bias)` offset in `mmvf.cu`	—
`…` after fix	400/400 PASS

The bug was latent in the original prototype patch (developed against
chatterbox where ne[2]==ne[3]==1 makes the offset 0). Upstream's
broader test coverage caught it immediately when stacked against
batch_dims=[4,2] — a good demonstration of why the test-before-
change discipline matters. Fixed in the same commit (mmvf.cu now
applies the offset symmetric to x_bias).

Files: src/ggml-cuda/common.cuh, src/ggml-cuda/ggml-cuda.cu,
src/ggml-cuda/mmvf.cu, src/ggml-cuda/mmvq.cu,
tests/test-backend-ops.cpp.

3. `GGML_CUDA_PERF_LOGGER` env var (commit 5)

Problem

ggml-vulkan ships GGML_VK_PERF_LOGGER=1 and prints a structured
table after every compute graph; ggml-cuda has no equivalent.
Cross-backend perf characterisation today requires nsys (heavy,
NVIDIA-only, sometimes needs root for hardware counters) or one-off
manual instrumentation.

Solution

A symmetric env var that prints the same output format as
vk_perf_logger:

----------------
CUDA Timings:
MUL_MAT q4_0 m=3072 n=383 k=1024: 24 x 241.979 us = 5807.507 us
FLASH_ATTN_EXT (64,16,411,1): 24 x 5996.571 us = 143917.704 us
…
Total time: 22480.220 us.

Same prefix structure (----------------, <Backend> Timings:,
per-op rows, Total time: summary), same number formatting — so
existing cross-backend grep / awk one-liners work unchanged.

Implementation notes

Meyers-singleton ggml_cuda_perf_logger, off by default. Only the
function-local-static getenv check runs in the hot path when the
env var is unset.
RAII scope helper records cudaEventRecord(start) on
construction and cudaEventRecord(end) on destruction — one
scope per dispatched op, including before any continue taken
by the fusion fast-paths.
cudaStreamSynchronize + flush_and_print is called at the end of
every ggml_backend_cuda_graph_compute so elapsed times are
readable before the events are re-used.
CUDA Graphs auto-disable when the env var is set (otherwise events
would either re-record over still-pending events or get hidden
inside cudaGraphLaunch).
Destructor flushes any pending data but does not call
cudaEventDestroy — the singleton's dtor runs at static
destruction time (after main()), where the CUDA driver may
already be torn down. OS reclaim is safe; pool is bounded.

Tests

The env var is a stderr-only diagnostic that doesn't change op
output, so there's no natural insertion point in test-backend-ops
for testing it. Manual smoke verifies GGML_CUDA_PERF_LOGGER=1 test-backend-ops test -o ADD -b CUDA0 produces well-formed CUDA Timings: blocks (99 blocks across the ADD test sweep, matching
vk_perf_logger's format byte-for-byte).

Functional / integration testing of the env var path lives downstream
in chatterbox.cpp's scripts/test-cuda-perf-logger.sh (4 phases:
default-silent, env-on-output-format, env+graphs-mutual-exclusion,
aggregate-time-bound) — happy to port a stripped-down version if
you'd prefer a unit test in this PR.

Files: src/ggml-cuda/ggml-cuda.cu, src/ggml-cuda/common.cuh.

Combined regression — 12189/12189 PASS

The three changes are independent (different files / different code
paths), but they're stacked on the same branch in this PR. Final
sanity check on the combined pr-cuda-bundle:

$ ./build/bin/test-backend-ops test -b CUDA0
…
  12189/12189 tests passed
  Backend CUDA0: OK
2/2 backends passed
OK

That's 12084 (baseline @ 8be60f8) + 5 (HiFT cases) + 100 (decoder-shape cases) = 12189, no regressions. Bench logs
(per-PR + combined) are available on request.

Reproduction

git fetch origin pr-cuda-bundle
git checkout pr-cuda-bundle
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BUILD_TESTS=ON
cmake --build build --target test-backend-ops -j
./build/bin/test-backend-ops test -b CUDA0

# verify the perf logger:
GGML_CUDA_PERF_LOGGER=1 ./build/bin/test-backend-ops test -o ADD -b CUDA0

Splitting if preferred

If you'd rather review / land these one at a time, the 5 commits map
1:1 onto three separable PRs:

commits 1+2 → conv_transpose_1d kernel rewrite
commits 3+4 → 3-op fusion
commit 5 → perf logger

The corresponding individual PR descriptions (with a bit more detail
per piece) are at the following branches in the same fork, ready to
push:
pr1-conv-transpose-1d-cuda-warp-cooperative,
pr2-mul-mat-vec-add-add-fusion-cuda,
pr3-cuda-perf-logger.
Happy to convert this PR into three separate ones if reviewers
prefer.

Companion findings (not in this PR)

While preparing this work, profiling on RTX 5090 (Blackwell, sm_120)
revealed that ggml_cuda_fattn_mma_get_config has no Blackwell entry
— ampere_mma_available(cc) returns true for any cc >= 800, so
sm_120 silently uses the Ampere config tuned for sm_80. This is the
single biggest remaining FLASH_ATTN_EXT perf gap to ggml-vulkan on
chatterbox-style workloads (~67 ms / utterance). Tuning a Blackwell
config empirically requires either ncu hardware counters or a
multi-day parameter sweep on Blackwell hardware — best done by
maintainers with multi-Blackwell access. Happy to file a separate
issue / discussion thread for this if there's interest.

The new GGML_CUDA_PERF_LOGGER shipped in this PR makes such an
A/B sweep convenient (no rebuild required to switch variants when
combined with a small follow-up env var to override the picker).

Adds 5 HiFT-vocoder-realistic test cases to exercise the warp-cooperative CUDA kernel paths (IC > 32 multi-warp accumulation, K > s0 multi-touch, etc). These pass on the existing scalar kernel and serve as a regression guard for future kernel rewrites. Made-with: Cursor

The current scalar kernel allocates one CUDA thread per output pixel and has each thread loop over all IC*IL input values, with a per-iteration branch that only triggers on a small fraction of iterations. At HiFT-vocoder-realistic shapes (L=303, IC=80, K=16, s0=8) this is ~38400 loop iterations per output pixel, ~99% skip- branch overhead. Warp-cooperative version: * One warp (32 threads) cooperatively computes one output pixel. * Input position range narrowed to i in [ceil((ol-K+1)/s0), floor(ol/s0)] - typically 2 iterations of i instead of IL=O(100). Skip branch eliminated entirely. * IC reduction parallelised across warp lanes; partial sums reduced via __shfl_xor_sync. * Block size 256 -> 32 (one warp per pixel). Measured perf (RTX 5090 + CUDA Toolkit 12.8 + chatterbox HiFT vocoder on a 232-token prompt): * conv_transpose_1d kernel time: 4700 us -> 110 us per call (~42x). * HiFT decode total GPU time: 67% -> 1.6%. Test coverage: * 121/121 cases in test-backend-ops CONV_TRANSPOSE_1D pass on both the legacy scalar kernel and the new warp-cooperative kernel, including 5 new HiFT-realistic shapes added in the previous commit. * No API change; output is identical up to FP-reduction order (sequential -> tree butterfly). Within ggml's default 1e-7 NMSE / 1e-3 abs tolerance for f32. Made-with: Cursor

Adds larger-shape cases (m=1, n=1024, k=1024 and n=3072) to the existing test_mul_mat_vec_fusion sweep. The existing test already exercises the mul_mat + add(bias) + GLU + add(residual) pattern via build_graph; this commit just runs it at autoregressive-decoder scale where: * k>=1024 catches accumulator FP-precision issues that don't show up at k=256 * n>=1024 catches per-row stride bugs that don't show up at n=32 * This is a regression guard for backends that fuse all three ops into a single mul_mat_vec kernel writeback (ggml-vulkan today, ggml-cuda after the matching PR) 400/400 cases pass on current ggml-cuda (which fuses only the first ADD and runs the residual ADD as a separate launch). Made-with: Cursor

Mirrors ggml-vulkan's MUL_MAT_ADD_ADD shader. Pattern is `((mat * y) + bias) + residual`, common in transformer attention-output and FFN-output blocks where a projection is followed by a bias add and a residual connection. Without the fusion, ggml-cuda runs three separate kernels per such block: matmul-vec, bias-ADD, residual-ADD. Each ADD pays ~3-4 us of dispatch overhead on RTX-class hardware in addition to the kernel time itself. Folding the residual ADD into the matmul-vec writeback saves the launch overhead of one stand-alone GGML_OP_ADD per residual. Implementation: * common.cuh: add x_residual field to ggml_cuda_mm_fusion_args_*. When set together with x_bias the kernel performs dst = mat * y + bias + residual in a single dispatch. Same shape rules as x_bias (ne[0] == dst->ne[0], no broadcasting; host-side detection enforces). * ggml-cuda.cu: detect the {MUL_MAT, ADD, ADD} pattern in ggml_cuda_graph_evaluate_and_capture, placed above the existing 2-op {MUL_MAT, ADD} fusion so the greedy match prefers the larger fusion when both apply. Only MUL_MAT (not MUL_MAT_ID) is handled. * mmvf.cu, mmvq.cu: add x_residual processing in the kernel templates and host wrappers. Residual is added AFTER bias and (if any) GLU, matching ggml-vulkan's MUL_MAT_ADD_ADD execution order and the natural graph semantics ((mm + bias) + residual). Measured perf (RTX 5090 + CUDA Toolkit 12.8 + chatterbox text-to-speech, Turbo Q4_0, 232-token prompt): * Total GPU time per utterance: -12 % * MUL_MAT_VEC q4_0 bucket: -47 ms / utterance (residual ADDs folded into matmul-vec writeback) * CUDA <-> Vulkan gap: 1.29x -> 1.13x on long prompts Test coverage: * test_mul_mat_vec_fusion already exercises the ((mm * y) + bias) + residual graph pattern via build_graph when with_bias=true. 100 new test cases added in the previous commit run that pattern at decoder-realistic shapes (m=1, n>=1024, k>=1024). * 400/400 MUL_MAT_VEC_FUSION cases pass on CUDA. * 12184/12184 cases pass across the full CUDA test-backend-ops suite (zero regressions). Bug found and fixed during upstream-prep testing: the original prototype omitted the (sample_dst, channel_bias) offset for x_residual in mmvf.cu (mmvq.cu had it) — fine for the chatterbox-internal use case where ne[2]==ne[3]==1, but caught immediately by test_mul_mat_vec_fusion's batch_dims=[4,2] cases. Fixed by mirroring the existing x_bias offset path. Made-with: Cursor

Mirrors ggml-vulkan's GGML_VK_PERF_LOGGER=1. When set, prints aggregate per-op GPU time + dispatch count after every ggml_backend_cuda_graph_compute() call. Output format intentionally matches ggml-vulkan's so existing cross-backend grep/awk one-liners work for both backends: ---------------- CUDA Timings: MUL_MAT q4_0 m=3072 n=383 k=1024: 24 x 241.979 us = 5807.507 us ... Total time: 22480.220 us. Implementation: * New ggml_cuda_perf_logger class (Meyers singleton, RAII scope helper, cudaEvent_t pool with on-demand growth, aggregation map, sorted print). Per-op scope guard added in the dispatch loop in ggml_cuda_graph_evaluate_and_capture. flush_and_print hook added at the end of ggml_backend_cuda_graph_compute. * common.cuh: ggml_cuda_graph::is_enabled() extended to disable CUDA Graphs when GGML_CUDA_PERF_LOGGER=1. Graph capture would either hide individual-op timings inside cudaGraphLaunch or re-record over still-pending events on subsequent launches. Off by default: zero overhead in normal builds. Only the function-local-static getenv check runs in the hot path when the env var is unset. Useful for cross-backend perf characterisation without needing nsys (heavyweight, NVIDIA-only, sometimes needs root for hardware counters). Same diagnostic value as Vulkan's logger; the identical output format means existing FINDINGS.md-style "which op is the bottleneck on backend X?" tables work for both. Notes on lifetime: * The logger is a Meyers-singleton; its destructor runs at static destruction time (after main() returns and possibly after libcudart's own statics tear down). The destructor flushes any pending data but DOES NOT call cudaEventDestroy — that can crash on a torn-down driver. Letting the OS reclaim the events is safe: this is opt-in via env var, and the event pool is bounded. * If a long-running daemon ever wants to reset the logger mid-run, add an explicit reset() that's called while CUDA is still alive. Test coverage: * 12084/12084 cases pass across the full CUDA test-backend-ops suite with the env var unset (normal path: zero overhead). * Manual smoke: GGML_CUDA_PERF_LOGGER=1 `test-backend-ops test -o ADD -b CUDA0` produces well-formed 'CUDA Timings:' blocks with per-op timings and 'Total time:' summary, format matches ggml-vulkan's vk_perf_logger. * Functional / integration testing of the env var path is in the chatterbox.cpp tree (scripts/test-cuda-perf-logger.sh, 4 phases covering default/env-on/env+graphs/aggregate-bound). Not added here because the env var doesn't change op output, only stderr. Made-with: Cursor

Zbig9000 · 2026-04-27T15:30:20Z

Filed companion issue #1466 documenting the Blackwell flash-attn config gap that I mentioned in this PR's last section. The GGML_CUDA_PERF_LOGGER shipped here is what makes that A/B sweep tractable for maintainers with multi-Blackwell hardware.

The 3-op MUL_MAT_VEC + ADD(bias) + ADD(residual) fusion's mmvf.cu kernel template was missing the (sample_dst, channel_bias) offset for x_residual that x_bias has. Latent for chatterbox where ne[2]==ne[3]==1 makes the offset zero, but exposed by upstream test_mul_mat_vec_fusion(batch_dims=[4,2]) when porting the patch to ggml-org/ggml. Fix mirrors the existing x_bias offset path. Same fix applied upstream in ggml-org/ggml#1465. Made-with: Cursor

QVAC-17873 staging added 5 commits April 27, 2026 17:16

Zbig9000 mentioned this pull request Apr 27, 2026

ggml-cuda: flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config #1466

Open

Zbig9000 mentioned this pull request Apr 27, 2026

Chatterbox optimize cpp backend multilingual model for cuda GustavoA1604/chatterbox.cpp#2

Closed

Zbig9000 closed this by deleting the head repository May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda: warp-cooperative conv_transpose_1d, MUL_MAT_VEC + ADD + ADD fusion, GGML_CUDA_PERF_LOGGER env var#1465

ggml-cuda: warp-cooperative conv_transpose_1d, MUL_MAT_VEC + ADD + ADD fusion, GGML_CUDA_PERF_LOGGER env var#1465
Zbig9000 wants to merge 5 commits into
ggml-org:masterfrom
Zbig9000:pr-cuda-bundle

Zbig9000 commented Apr 27, 2026

Uh oh!

Zbig9000 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zbig9000 commented Apr 27, 2026

ggml-cuda: three independent perf / diagnostic improvements

1. Warp-cooperative conv_transpose_1d kernel (commits 1, 2)

Problem

Solution

Perf (RTX 5090, CUDA 12.8)

Tests

2. MUL_MAT_VEC + ADD(bias) + ADD(residual) 3-op fusion (commits 3, 4)

Problem

Solution

Perf (RTX 5090, CUDA 12.8, chatterbox Turbo Q4_0, 232-token prompt)

Tests

3. GGML_CUDA_PERF_LOGGER env var (commit 5)

Problem

Solution

Implementation notes

Tests

Combined regression — 12189/12189 PASS

Reproduction

Splitting if preferred

Companion findings (not in this PR)

Uh oh!

Zbig9000 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Warp-cooperative `conv_transpose_1d` kernel (commits 1, 2)

2. `MUL_MAT_VEC + ADD(bias) + ADD(residual)` 3-op fusion (commits 3, 4)

3. `GGML_CUDA_PERF_LOGGER` env var (commit 5)