Skip to content

ggml-cuda: warp-cooperative conv_transpose_1d, MUL_MAT_VEC + ADD + ADD fusion, GGML_CUDA_PERF_LOGGER env var#1465

Closed
Zbig9000 wants to merge 5 commits into
ggml-org:masterfrom
Zbig9000:pr-cuda-bundle
Closed

ggml-cuda: warp-cooperative conv_transpose_1d, MUL_MAT_VEC + ADD + ADD fusion, GGML_CUDA_PERF_LOGGER env var#1465
Zbig9000 wants to merge 5 commits into
ggml-org:masterfrom
Zbig9000:pr-cuda-bundle

Conversation

@Zbig9000

Copy link
Copy Markdown

ggml-cuda: three independent perf / diagnostic improvements

This PR bundles three logically-distinct ggml-cuda changes that came
out of profiling / optimising chatterbox.cpp text-to-speech on RTX
5090 (Blackwell, sm_120, CUDA Toolkit 12.8). Each one is a separate
commit so it can be cherry-picked / squashed individually if you
prefer to land them one at a time.

# Commit Headline
1 test: add HiFT-realistic shapes to test_conv_transpose_1d +5 test-backend-ops cases at HiFT-vocoder shapes
2 ggml-cuda: warp-cooperative conv_transpose_1d kernel ~42× faster at HiFT shapes
3 test: extend test_mul_mat_vec_fusion with decoder-realistic shapes +100 test-backend-ops cases (m=1, n>=1024, k>=1024)
4 ggml-cuda: fuse MUL_MAT_VEC + ADD(bias) + ADD(residual) −12 % total GPU time on transformer decoders
5 ggml-cuda: add GGML_CUDA_PERF_LOGGER env var Cross-backend perf diagnostic mirroring GGML_VK_PERF_LOGGER

Total: +571 / -43 lines across src/ggml-cuda/ and
tests/test-backend-ops.cpp.


1. Warp-cooperative conv_transpose_1d kernel (commits 1, 2)

Problem

Current scalar kernel: one CUDA thread per output pixel, scanning all
IC × IL inputs with a per-iteration skip branch:

for (int c = 0; c < src0_ne2; c++) {                // c in [0, IC)
    for (int i = 0; i < src1_ne0; i++) {            // i in [0, IL)
        if (!(idx >= i*s0 && idx < i*s0 + src0_ne0)) {
            continue;                               // skip ~99% of iterations
        }
        accumulator += src0[...] * src1[...];
    }
}

At HiFT-vocoder-realistic shapes (L=303, IC=80, K=16, s0=8) this is
~24 K loop iterations per output pixel with the skip branch firing on
~99 % of iterations. HiFT decode spends 67 % of total GPU time in
this kernel (4 700 µs / call) on RTX 5090.

Solution

  1. Narrow the input range to what actually contributes:
    i ∈ [⌈(ol - K + 1)/s0⌉, ⌊ol/s0⌋] ∩ [0, IL-1]. At K=16, s0=8
    this is 2 iterations of i instead of IL=O(100). Skip
    branch eliminated entirely.
  2. Parallelise IC reduction across the warp (32 threads); reduce
    via 5 stages of __shfl_xor_sync.
  3. Block size: 256 → 32 (one warp per output pixel).

Perf (RTX 5090, CUDA 12.8)

  • conv_transpose_1d kernel time: 4 700 µs → 110 µs per call (~42×)
  • HiFT decode total GPU time: 67 % → 1.6 %
  • End-to-end speedup on chatterbox 232-token utterance: 4.7× faster HiFT

Tests

Existing 116 test_conv_transpose_1d cases continue to pass. Commit
1 of this PR adds 5 HiFT-realistic shapes that exercise the
warp-cooperative reduction at scale (IC > 32 multi-warp accumulation,
K > s0 inner-loop unroll). These pass on both the legacy
scalar kernel and the new warp-cooperative kernel — they're a
regression guard for future changes, not a "fail before, pass after"
pattern.

Test phase Result
test-backend-ops -o CONV_TRANSPOSE_1D -b CUDA0 baseline 116/116 PASS
after commit 1 (test-only) 121/121 PASS
after commit 2 (kernel rewrite) 121/121 PASS

Files: src/ggml-cuda/conv-transpose-1d.cu,
src/ggml-cuda/conv-transpose-1d.cuh, tests/test-backend-ops.cpp.


2. MUL_MAT_VEC + ADD(bias) + ADD(residual) 3-op fusion (commits 3, 4)

Problem

ggml-cuda's fusion engine knows the 2-op pattern MUL_MAT_VEC + ADD(bias) but not the 3-op pattern that includes the residual. The
graph shape ((mat * y) + bias) + residual is common in transformer
attention-output and FFN-output blocks; ggml-vulkan already fuses it
via the MUL_MAT_ADD_ADD shader. Profiling chatterbox vs Vulkan
showed CUDA paid ~67 ms / utterance more in stand-alone ADD launches.

Solution

  1. Extend ggml_cuda_mm_fusion_args_* (in common.cuh) with an
    x_residual field. When set together with x_bias, the matmul-vec
    kernel performs dst = mat * y + bias + residual in a single
    dispatch.
  2. Detect the {MUL_MAT, ADD, ADD} pattern in
    ggml_cuda_graph_evaluate_and_capture (placed above the
    existing 2-op fusion so the greedy match prefers the larger).
    Resolve bias_tensor / residual_tensor from the cgraph;
    reject patterns with broadcasting on either ADD (matches the 2-op
    fusion's existing constraint).
  3. Update mul_mat_vec_q (mmvq.cu) and mul_mat_vec_f (mmvf.cu)
    templates to consume x_residual — added after bias and any GLU,
    matching ggml-vulkan's MUL_MAT_ADD_ADD execution order.

Only MUL_MAT (not MUL_MAT_ID) is handled — the residual ADD
pattern doesn't apply to MoE expert routing. The new code path
falls back gracefully to plain dispatch when the fusion isn't
applicable.

Perf (RTX 5090, CUDA 12.8, chatterbox Turbo Q4_0, 232-token prompt)

  • Total GPU time per utterance: −12 %
  • MUL_MAT_VEC q4_0 op-bucket time: −47 ms / utterance
  • CUDA ↔ Vulkan gap on long prompts: 1.29× → 1.13×

Tests

test_mul_mat_vec_fusion::build_graph already exercises the
((mm * y) + bias) + residual pattern via:

ggml_tensor * ffn_up = ggml_mul_mat(ctx, up, cur);
if (with_bias) {
    ffn_up = ggml_add(ctx, ffn_up, up_bias);          // 2nd op
}
ggml_tensor * out = with_gate ? build_gate(ctx, ffn_gate, ffn_up) : ffn_up;
out = ggml_add(ctx, out, bias2);                      // 3rd op  ← residual

Commit 3 adds 100 new cases at decoder-realistic shapes (m=1, n ∈ {1024, 3072}, k=1024) that catch FP-precision and per-row stride
issues that don't surface at the existing m=1, n=32, k=256 cases.

Test phase Result
test-backend-ops -o MUL_MAT_VEC_FUSION -b CUDA0 baseline 300/300 PASS
after commit 3 (test-only) 400/400 PASS
after commit 4 (fusion patch, first attempt) 2/400 FAIL at batch_dims=[4,2]
Bug found: x_residual missing (sample_dst, channel_bias) offset in mmvf.cu
after fix 400/400 PASS

The bug was latent in the original prototype patch (developed against
chatterbox where ne[2]==ne[3]==1 makes the offset 0). Upstream's
broader test coverage caught it immediately when stacked against
batch_dims=[4,2] — a good demonstration of why the test-before-
change discipline matters. Fixed in the same commit (mmvf.cu now
applies the offset symmetric to x_bias).

Files: src/ggml-cuda/common.cuh, src/ggml-cuda/ggml-cuda.cu,
src/ggml-cuda/mmvf.cu, src/ggml-cuda/mmvq.cu,
tests/test-backend-ops.cpp.


3. GGML_CUDA_PERF_LOGGER env var (commit 5)

Problem

ggml-vulkan ships GGML_VK_PERF_LOGGER=1 and prints a structured
table after every compute graph; ggml-cuda has no equivalent.
Cross-backend perf characterisation today requires nsys (heavy,
NVIDIA-only, sometimes needs root for hardware counters) or one-off
manual instrumentation.

Solution

A symmetric env var that prints the same output format as
vk_perf_logger:

----------------
CUDA Timings:
MUL_MAT q4_0 m=3072 n=383 k=1024: 24 x 241.979 us = 5807.507 us
FLASH_ATTN_EXT (64,16,411,1): 24 x 5996.571 us = 143917.704 us
…
Total time: 22480.220 us.

Same prefix structure (----------------, <Backend> Timings:,
per-op rows, Total time: summary), same number formatting — so
existing cross-backend grep / awk one-liners work unchanged.

Implementation notes

  • Meyers-singleton ggml_cuda_perf_logger, off by default. Only the
    function-local-static getenv check runs in the hot path when the
    env var is unset.
  • RAII scope helper records cudaEventRecord(start) on
    construction and cudaEventRecord(end) on destruction — one
    scope per dispatched op, including before any continue taken
    by the fusion fast-paths.
  • cudaStreamSynchronize + flush_and_print is called at the end of
    every ggml_backend_cuda_graph_compute so elapsed times are
    readable before the events are re-used.
  • CUDA Graphs auto-disable when the env var is set (otherwise events
    would either re-record over still-pending events or get hidden
    inside cudaGraphLaunch).
  • Destructor flushes any pending data but does not call
    cudaEventDestroy — the singleton's dtor runs at static
    destruction time (after main()), where the CUDA driver may
    already be torn down. OS reclaim is safe; pool is bounded.

Tests

The env var is a stderr-only diagnostic that doesn't change op
output, so there's no natural insertion point in test-backend-ops
for testing it. Manual smoke verifies GGML_CUDA_PERF_LOGGER=1 test-backend-ops test -o ADD -b CUDA0 produces well-formed CUDA Timings: blocks (99 blocks across the ADD test sweep, matching
vk_perf_logger's format byte-for-byte).

Functional / integration testing of the env var path lives downstream
in chatterbox.cpp's scripts/test-cuda-perf-logger.sh (4 phases:
default-silent, env-on-output-format, env+graphs-mutual-exclusion,
aggregate-time-bound) — happy to port a stripped-down version if
you'd prefer a unit test in this PR.

Files: src/ggml-cuda/ggml-cuda.cu, src/ggml-cuda/common.cuh.


Combined regression — 12189/12189 PASS

The three changes are independent (different files / different code
paths), but they're stacked on the same branch in this PR. Final
sanity check on the combined pr-cuda-bundle:

$ ./build/bin/test-backend-ops test -b CUDA0
…
  12189/12189 tests passed
  Backend CUDA0: OK
2/2 backends passed
OK

That's 12084 (baseline @ 8be60f8) + 5 (HiFT cases) + 100 (decoder-shape cases) = 12189, no regressions. Bench logs
(per-PR + combined) are available on request.


Reproduction

git fetch origin pr-cuda-bundle
git checkout pr-cuda-bundle
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BUILD_TESTS=ON
cmake --build build --target test-backend-ops -j
./build/bin/test-backend-ops test -b CUDA0

# verify the perf logger:
GGML_CUDA_PERF_LOGGER=1 ./build/bin/test-backend-ops test -o ADD -b CUDA0

Splitting if preferred

If you'd rather review / land these one at a time, the 5 commits map
1:1 onto three separable PRs:

  • commits 1+2 → conv_transpose_1d kernel rewrite
  • commits 3+4 → 3-op fusion
  • commit 5 → perf logger

The corresponding individual PR descriptions (with a bit more detail
per piece) are at the following branches in the same fork, ready to
push:
pr1-conv-transpose-1d-cuda-warp-cooperative,
pr2-mul-mat-vec-add-add-fusion-cuda,
pr3-cuda-perf-logger.
Happy to convert this PR into three separate ones if reviewers
prefer.


Companion findings (not in this PR)

While preparing this work, profiling on RTX 5090 (Blackwell, sm_120)
revealed that ggml_cuda_fattn_mma_get_config has no Blackwell entry
ampere_mma_available(cc) returns true for any cc >= 800, so
sm_120 silently uses the Ampere config tuned for sm_80. This is the
single biggest remaining FLASH_ATTN_EXT perf gap to ggml-vulkan on
chatterbox-style workloads (~67 ms / utterance). Tuning a Blackwell
config empirically requires either ncu hardware counters or a
multi-day parameter sweep on Blackwell hardware — best done by
maintainers with multi-Blackwell access. Happy to file a separate
issue / discussion thread for this if there's interest.

The new GGML_CUDA_PERF_LOGGER shipped in this PR makes such an
A/B sweep convenient (no rebuild required to switch variants when
combined with a small follow-up env var to override the picker).

QVAC-17873 staging added 5 commits April 27, 2026 17:16
Adds 5 HiFT-vocoder-realistic test cases to exercise the
warp-cooperative CUDA kernel paths (IC > 32 multi-warp
accumulation, K > s0 multi-touch, etc).  These pass on the
existing scalar kernel and serve as a regression guard for
future kernel rewrites.

Made-with: Cursor
The current scalar kernel allocates one CUDA thread per output pixel
and has each thread loop over all IC*IL input values, with a
per-iteration branch that only triggers on a small fraction of
iterations.  At HiFT-vocoder-realistic shapes (L=303, IC=80, K=16,
s0=8) this is ~38400 loop iterations per output pixel, ~99% skip-
branch overhead.

Warp-cooperative version:
* One warp (32 threads) cooperatively computes one output pixel.
* Input position range narrowed to i in [ceil((ol-K+1)/s0),
  floor(ol/s0)] - typically 2 iterations of i instead of IL=O(100).
  Skip branch eliminated entirely.
* IC reduction parallelised across warp lanes; partial sums reduced
  via __shfl_xor_sync.
* Block size 256 -> 32 (one warp per pixel).

Measured perf (RTX 5090 + CUDA Toolkit 12.8 + chatterbox HiFT vocoder
on a 232-token prompt):
* conv_transpose_1d kernel time: 4700 us -> 110 us per call (~42x).
* HiFT decode total GPU time: 67% -> 1.6%.

Test coverage:
* 121/121 cases in test-backend-ops CONV_TRANSPOSE_1D pass on both
  the legacy scalar kernel and the new warp-cooperative kernel,
  including 5 new HiFT-realistic shapes added in the previous
  commit.
* No API change; output is identical up to FP-reduction order
  (sequential -> tree butterfly).  Within ggml's default 1e-7 NMSE
  / 1e-3 abs tolerance for f32.

Made-with: Cursor
Adds larger-shape cases (m=1, n=1024, k=1024 and n=3072) to the
existing test_mul_mat_vec_fusion sweep.  The existing test already
exercises the mul_mat + add(bias) + GLU + add(residual) pattern via
build_graph; this commit just runs it at autoregressive-decoder
scale where:

* k>=1024 catches accumulator FP-precision issues that don't show up
  at k=256
* n>=1024 catches per-row stride bugs that don't show up at n=32
* This is a regression guard for backends that fuse all three ops
  into a single mul_mat_vec kernel writeback (ggml-vulkan today,
  ggml-cuda after the matching PR)

400/400 cases pass on current ggml-cuda (which fuses only the first
ADD and runs the residual ADD as a separate launch).

Made-with: Cursor
Mirrors ggml-vulkan's MUL_MAT_ADD_ADD shader.  Pattern is
`((mat * y) + bias) + residual`, common in transformer
attention-output and FFN-output blocks where a projection is
followed by a bias add and a residual connection.

Without the fusion, ggml-cuda runs three separate kernels per
such block: matmul-vec, bias-ADD, residual-ADD.  Each ADD pays
~3-4 us of dispatch overhead on RTX-class hardware in addition
to the kernel time itself.  Folding the residual ADD into the
matmul-vec writeback saves the launch overhead of one
stand-alone GGML_OP_ADD per residual.

Implementation:

* common.cuh: add x_residual field to ggml_cuda_mm_fusion_args_*.
  When set together with x_bias the kernel performs
  dst = mat * y + bias + residual in a single dispatch.
  Same shape rules as x_bias (ne[0] == dst->ne[0],
  no broadcasting; host-side detection enforces).
* ggml-cuda.cu: detect the {MUL_MAT, ADD, ADD} pattern in
  ggml_cuda_graph_evaluate_and_capture, placed above the
  existing 2-op {MUL_MAT, ADD} fusion so the greedy match
  prefers the larger fusion when both apply.  Only MUL_MAT
  (not MUL_MAT_ID) is handled.
* mmvf.cu, mmvq.cu: add x_residual processing in the kernel
  templates and host wrappers.  Residual is added AFTER bias
  and (if any) GLU, matching ggml-vulkan's
  MUL_MAT_ADD_ADD execution order and the natural graph
  semantics ((mm + bias) + residual).

Measured perf (RTX 5090 + CUDA Toolkit 12.8 + chatterbox
text-to-speech, Turbo Q4_0, 232-token prompt):
* Total GPU time per utterance: -12 %
* MUL_MAT_VEC q4_0 bucket: -47 ms / utterance (residual ADDs
  folded into matmul-vec writeback)
* CUDA <-> Vulkan gap: 1.29x -> 1.13x on long prompts

Test coverage:
* test_mul_mat_vec_fusion already exercises the
  ((mm * y) + bias) + residual graph pattern via build_graph
  when with_bias=true.  100 new test cases added in the previous
  commit run that pattern at decoder-realistic shapes
  (m=1, n>=1024, k>=1024).
* 400/400 MUL_MAT_VEC_FUSION cases pass on CUDA.
* 12184/12184 cases pass across the full CUDA test-backend-ops
  suite (zero regressions).

Bug found and fixed during upstream-prep testing: the original
prototype omitted the (sample_dst, channel_bias) offset for
x_residual in mmvf.cu (mmvq.cu had it) — fine for the
chatterbox-internal use case where ne[2]==ne[3]==1, but caught
immediately by test_mul_mat_vec_fusion's batch_dims=[4,2]
cases.  Fixed by mirroring the existing x_bias offset path.

Made-with: Cursor
Mirrors ggml-vulkan's GGML_VK_PERF_LOGGER=1.  When set, prints
aggregate per-op GPU time + dispatch count after every
ggml_backend_cuda_graph_compute() call.  Output format
intentionally matches ggml-vulkan's so existing cross-backend
grep/awk one-liners work for both backends:

    ----------------
    CUDA Timings:
    MUL_MAT q4_0 m=3072 n=383 k=1024: 24 x 241.979 us = 5807.507 us
    ...
    Total time: 22480.220 us.

Implementation:

* New ggml_cuda_perf_logger class (Meyers singleton, RAII scope
  helper, cudaEvent_t pool with on-demand growth, aggregation map,
  sorted print).  Per-op scope guard added in the dispatch loop in
  ggml_cuda_graph_evaluate_and_capture.  flush_and_print hook
  added at the end of ggml_backend_cuda_graph_compute.
* common.cuh: ggml_cuda_graph::is_enabled() extended to disable
  CUDA Graphs when GGML_CUDA_PERF_LOGGER=1.  Graph capture would
  either hide individual-op timings inside cudaGraphLaunch or
  re-record over still-pending events on subsequent launches.

Off by default: zero overhead in normal builds.  Only the
function-local-static getenv check runs in the hot path when the
env var is unset.

Useful for cross-backend perf characterisation without needing
nsys (heavyweight, NVIDIA-only, sometimes needs root for hardware
counters).  Same diagnostic value as Vulkan's logger; the
identical output format means existing FINDINGS.md-style
"which op is the bottleneck on backend X?" tables work for both.

Notes on lifetime:

* The logger is a Meyers-singleton; its destructor runs at static
  destruction time (after main() returns and possibly after
  libcudart's own statics tear down).  The destructor flushes any
  pending data but DOES NOT call cudaEventDestroy — that can crash
  on a torn-down driver.  Letting the OS reclaim the events is
  safe: this is opt-in via env var, and the event pool is bounded.
* If a long-running daemon ever wants to reset the logger mid-run,
  add an explicit reset() that's called while CUDA is still alive.

Test coverage:

* 12084/12084 cases pass across the full CUDA test-backend-ops
  suite with the env var unset (normal path: zero overhead).
* Manual smoke: GGML_CUDA_PERF_LOGGER=1 `test-backend-ops test
  -o ADD -b CUDA0` produces well-formed 'CUDA Timings:' blocks
  with per-op timings and 'Total time:' summary, format matches
  ggml-vulkan's vk_perf_logger.
* Functional / integration testing of the env var path is in the
  chatterbox.cpp tree (scripts/test-cuda-perf-logger.sh, 4 phases
  covering default/env-on/env+graphs/aggregate-bound).  Not added
  here because the env var doesn't change op output, only stderr.

Made-with: Cursor
@Zbig9000

Copy link
Copy Markdown
Author

Filed companion issue #1466 documenting the Blackwell flash-attn config gap that I mentioned in this PR's last section. The GGML_CUDA_PERF_LOGGER shipped here is what makes that A/B sweep tractable for maintainers with multi-Blackwell hardware.

Zbig9000 added a commit to Zbig9000/chatterbox.cpp that referenced this pull request Apr 27, 2026
The 3-op MUL_MAT_VEC + ADD(bias) + ADD(residual) fusion's mmvf.cu
kernel template was missing the (sample_dst, channel_bias) offset
for x_residual that x_bias has. Latent for chatterbox where
ne[2]==ne[3]==1 makes the offset zero, but exposed by upstream
test_mul_mat_vec_fusion(batch_dims=[4,2]) when porting the patch
to ggml-org/ggml.

Fix mirrors the existing x_bias offset path. Same fix applied
upstream in ggml-org/ggml#1465.

Made-with: Cursor
@Zbig9000 Zbig9000 closed this by deleting the head repository May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant