ggml-cuda: warp-cooperative conv_transpose_1d, MUL_MAT_VEC + ADD + ADD fusion, GGML_CUDA_PERF_LOGGER env var#1465
Closed
Zbig9000 wants to merge 5 commits into
Closed
ggml-cuda: warp-cooperative conv_transpose_1d, MUL_MAT_VEC + ADD + ADD fusion, GGML_CUDA_PERF_LOGGER env var#1465Zbig9000 wants to merge 5 commits into
Zbig9000 wants to merge 5 commits into
Conversation
added 5 commits
April 27, 2026 17:16
Adds 5 HiFT-vocoder-realistic test cases to exercise the warp-cooperative CUDA kernel paths (IC > 32 multi-warp accumulation, K > s0 multi-touch, etc). These pass on the existing scalar kernel and serve as a regression guard for future kernel rewrites. Made-with: Cursor
The current scalar kernel allocates one CUDA thread per output pixel and has each thread loop over all IC*IL input values, with a per-iteration branch that only triggers on a small fraction of iterations. At HiFT-vocoder-realistic shapes (L=303, IC=80, K=16, s0=8) this is ~38400 loop iterations per output pixel, ~99% skip- branch overhead. Warp-cooperative version: * One warp (32 threads) cooperatively computes one output pixel. * Input position range narrowed to i in [ceil((ol-K+1)/s0), floor(ol/s0)] - typically 2 iterations of i instead of IL=O(100). Skip branch eliminated entirely. * IC reduction parallelised across warp lanes; partial sums reduced via __shfl_xor_sync. * Block size 256 -> 32 (one warp per pixel). Measured perf (RTX 5090 + CUDA Toolkit 12.8 + chatterbox HiFT vocoder on a 232-token prompt): * conv_transpose_1d kernel time: 4700 us -> 110 us per call (~42x). * HiFT decode total GPU time: 67% -> 1.6%. Test coverage: * 121/121 cases in test-backend-ops CONV_TRANSPOSE_1D pass on both the legacy scalar kernel and the new warp-cooperative kernel, including 5 new HiFT-realistic shapes added in the previous commit. * No API change; output is identical up to FP-reduction order (sequential -> tree butterfly). Within ggml's default 1e-7 NMSE / 1e-3 abs tolerance for f32. Made-with: Cursor
Adds larger-shape cases (m=1, n=1024, k=1024 and n=3072) to the existing test_mul_mat_vec_fusion sweep. The existing test already exercises the mul_mat + add(bias) + GLU + add(residual) pattern via build_graph; this commit just runs it at autoregressive-decoder scale where: * k>=1024 catches accumulator FP-precision issues that don't show up at k=256 * n>=1024 catches per-row stride bugs that don't show up at n=32 * This is a regression guard for backends that fuse all three ops into a single mul_mat_vec kernel writeback (ggml-vulkan today, ggml-cuda after the matching PR) 400/400 cases pass on current ggml-cuda (which fuses only the first ADD and runs the residual ADD as a separate launch). Made-with: Cursor
Mirrors ggml-vulkan's MUL_MAT_ADD_ADD shader. Pattern is
`((mat * y) + bias) + residual`, common in transformer
attention-output and FFN-output blocks where a projection is
followed by a bias add and a residual connection.
Without the fusion, ggml-cuda runs three separate kernels per
such block: matmul-vec, bias-ADD, residual-ADD. Each ADD pays
~3-4 us of dispatch overhead on RTX-class hardware in addition
to the kernel time itself. Folding the residual ADD into the
matmul-vec writeback saves the launch overhead of one
stand-alone GGML_OP_ADD per residual.
Implementation:
* common.cuh: add x_residual field to ggml_cuda_mm_fusion_args_*.
When set together with x_bias the kernel performs
dst = mat * y + bias + residual in a single dispatch.
Same shape rules as x_bias (ne[0] == dst->ne[0],
no broadcasting; host-side detection enforces).
* ggml-cuda.cu: detect the {MUL_MAT, ADD, ADD} pattern in
ggml_cuda_graph_evaluate_and_capture, placed above the
existing 2-op {MUL_MAT, ADD} fusion so the greedy match
prefers the larger fusion when both apply. Only MUL_MAT
(not MUL_MAT_ID) is handled.
* mmvf.cu, mmvq.cu: add x_residual processing in the kernel
templates and host wrappers. Residual is added AFTER bias
and (if any) GLU, matching ggml-vulkan's
MUL_MAT_ADD_ADD execution order and the natural graph
semantics ((mm + bias) + residual).
Measured perf (RTX 5090 + CUDA Toolkit 12.8 + chatterbox
text-to-speech, Turbo Q4_0, 232-token prompt):
* Total GPU time per utterance: -12 %
* MUL_MAT_VEC q4_0 bucket: -47 ms / utterance (residual ADDs
folded into matmul-vec writeback)
* CUDA <-> Vulkan gap: 1.29x -> 1.13x on long prompts
Test coverage:
* test_mul_mat_vec_fusion already exercises the
((mm * y) + bias) + residual graph pattern via build_graph
when with_bias=true. 100 new test cases added in the previous
commit run that pattern at decoder-realistic shapes
(m=1, n>=1024, k>=1024).
* 400/400 MUL_MAT_VEC_FUSION cases pass on CUDA.
* 12184/12184 cases pass across the full CUDA test-backend-ops
suite (zero regressions).
Bug found and fixed during upstream-prep testing: the original
prototype omitted the (sample_dst, channel_bias) offset for
x_residual in mmvf.cu (mmvq.cu had it) — fine for the
chatterbox-internal use case where ne[2]==ne[3]==1, but caught
immediately by test_mul_mat_vec_fusion's batch_dims=[4,2]
cases. Fixed by mirroring the existing x_bias offset path.
Made-with: Cursor
Mirrors ggml-vulkan's GGML_VK_PERF_LOGGER=1. When set, prints
aggregate per-op GPU time + dispatch count after every
ggml_backend_cuda_graph_compute() call. Output format
intentionally matches ggml-vulkan's so existing cross-backend
grep/awk one-liners work for both backends:
----------------
CUDA Timings:
MUL_MAT q4_0 m=3072 n=383 k=1024: 24 x 241.979 us = 5807.507 us
...
Total time: 22480.220 us.
Implementation:
* New ggml_cuda_perf_logger class (Meyers singleton, RAII scope
helper, cudaEvent_t pool with on-demand growth, aggregation map,
sorted print). Per-op scope guard added in the dispatch loop in
ggml_cuda_graph_evaluate_and_capture. flush_and_print hook
added at the end of ggml_backend_cuda_graph_compute.
* common.cuh: ggml_cuda_graph::is_enabled() extended to disable
CUDA Graphs when GGML_CUDA_PERF_LOGGER=1. Graph capture would
either hide individual-op timings inside cudaGraphLaunch or
re-record over still-pending events on subsequent launches.
Off by default: zero overhead in normal builds. Only the
function-local-static getenv check runs in the hot path when the
env var is unset.
Useful for cross-backend perf characterisation without needing
nsys (heavyweight, NVIDIA-only, sometimes needs root for hardware
counters). Same diagnostic value as Vulkan's logger; the
identical output format means existing FINDINGS.md-style
"which op is the bottleneck on backend X?" tables work for both.
Notes on lifetime:
* The logger is a Meyers-singleton; its destructor runs at static
destruction time (after main() returns and possibly after
libcudart's own statics tear down). The destructor flushes any
pending data but DOES NOT call cudaEventDestroy — that can crash
on a torn-down driver. Letting the OS reclaim the events is
safe: this is opt-in via env var, and the event pool is bounded.
* If a long-running daemon ever wants to reset the logger mid-run,
add an explicit reset() that's called while CUDA is still alive.
Test coverage:
* 12084/12084 cases pass across the full CUDA test-backend-ops
suite with the env var unset (normal path: zero overhead).
* Manual smoke: GGML_CUDA_PERF_LOGGER=1 `test-backend-ops test
-o ADD -b CUDA0` produces well-formed 'CUDA Timings:' blocks
with per-op timings and 'Total time:' summary, format matches
ggml-vulkan's vk_perf_logger.
* Functional / integration testing of the env var path is in the
chatterbox.cpp tree (scripts/test-cuda-perf-logger.sh, 4 phases
covering default/env-on/env+graphs/aggregate-bound). Not added
here because the env var doesn't change op output, only stderr.
Made-with: Cursor
Author
|
Filed companion issue #1466 documenting the Blackwell flash-attn config gap that I mentioned in this PR's last section. The |
Zbig9000
added a commit
to Zbig9000/chatterbox.cpp
that referenced
this pull request
Apr 27, 2026
The 3-op MUL_MAT_VEC + ADD(bias) + ADD(residual) fusion's mmvf.cu kernel template was missing the (sample_dst, channel_bias) offset for x_residual that x_bias has. Latent for chatterbox where ne[2]==ne[3]==1 makes the offset zero, but exposed by upstream test_mul_mat_vec_fusion(batch_dims=[4,2]) when porting the patch to ggml-org/ggml. Fix mirrors the existing x_bias offset path. Same fix applied upstream in ggml-org/ggml#1465. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ggml-cuda: three independent perf / diagnostic improvements
This PR bundles three logically-distinct ggml-cuda changes that came
out of profiling / optimising chatterbox.cpp text-to-speech on RTX
5090 (Blackwell, sm_120, CUDA Toolkit 12.8). Each one is a separate
commit so it can be cherry-picked / squashed individually if you
prefer to land them one at a time.
test: add HiFT-realistic shapes to test_conv_transpose_1dggml-cuda: warp-cooperative conv_transpose_1d kerneltest: extend test_mul_mat_vec_fusion with decoder-realistic shapesm=1, n>=1024, k>=1024)ggml-cuda: fuse MUL_MAT_VEC + ADD(bias) + ADD(residual)ggml-cuda: add GGML_CUDA_PERF_LOGGER env varGGML_VK_PERF_LOGGERTotal: +571 / -43 lines across
src/ggml-cuda/andtests/test-backend-ops.cpp.1. Warp-cooperative
conv_transpose_1dkernel (commits 1, 2)Problem
Current scalar kernel: one CUDA thread per output pixel, scanning all
IC × ILinputs with a per-iteration skip branch:At HiFT-vocoder-realistic shapes (
L=303, IC=80, K=16, s0=8) this is~24 K loop iterations per output pixel with the skip branch firing on
~99 % of iterations. HiFT decode spends 67 % of total GPU time in
this kernel (4 700 µs / call) on RTX 5090.
Solution
i ∈ [⌈(ol - K + 1)/s0⌉, ⌊ol/s0⌋] ∩ [0, IL-1]. AtK=16, s0=8this is 2 iterations of
iinstead ofIL=O(100). Skipbranch eliminated entirely.
via 5 stages of
__shfl_xor_sync.Perf (RTX 5090, CUDA 12.8)
conv_transpose_1dkernel time: 4 700 µs → 110 µs per call (~42×)Tests
Existing 116
test_conv_transpose_1dcases continue to pass. Commit1 of this PR adds 5 HiFT-realistic shapes that exercise the
warp-cooperative reduction at scale (
IC > 32multi-warp accumulation,K > s0inner-loop unroll). These pass on both the legacyscalar kernel and the new warp-cooperative kernel — they're a
regression guard for future changes, not a "fail before, pass after"
pattern.
test-backend-ops -o CONV_TRANSPOSE_1D -b CUDA0baseline…after commit 1 (test-only)…after commit 2 (kernel rewrite)Files:
src/ggml-cuda/conv-transpose-1d.cu,src/ggml-cuda/conv-transpose-1d.cuh,tests/test-backend-ops.cpp.2.
MUL_MAT_VEC + ADD(bias) + ADD(residual)3-op fusion (commits 3, 4)Problem
ggml-cuda's fusion engine knows the 2-op pattern
MUL_MAT_VEC + ADD(bias)but not the 3-op pattern that includes the residual. Thegraph shape
((mat * y) + bias) + residualis common in transformerattention-output and FFN-output blocks; ggml-vulkan already fuses it
via the
MUL_MAT_ADD_ADDshader. Profiling chatterbox vs Vulkanshowed CUDA paid ~67 ms / utterance more in stand-alone ADD launches.
Solution
ggml_cuda_mm_fusion_args_*(incommon.cuh) with anx_residualfield. When set together withx_bias, the matmul-veckernel performs
dst = mat * y + bias + residualin a singledispatch.
{MUL_MAT, ADD, ADD}pattern inggml_cuda_graph_evaluate_and_capture(placed above theexisting 2-op fusion so the greedy match prefers the larger).
Resolve
bias_tensor/residual_tensorfrom the cgraph;reject patterns with broadcasting on either ADD (matches the 2-op
fusion's existing constraint).
mul_mat_vec_q(mmvq.cu) andmul_mat_vec_f(mmvf.cu)templates to consume
x_residual— added after bias and any GLU,matching ggml-vulkan's
MUL_MAT_ADD_ADDexecution order.Only
MUL_MAT(notMUL_MAT_ID) is handled — the residual ADDpattern doesn't apply to MoE expert routing. The new code path
falls back gracefully to plain dispatch when the fusion isn't
applicable.
Perf (RTX 5090, CUDA 12.8, chatterbox Turbo Q4_0, 232-token prompt)
MUL_MAT_VEC q4_0op-bucket time: −47 ms / utteranceTests
test_mul_mat_vec_fusion::build_graphalready exercises the((mm * y) + bias) + residualpattern via:Commit 3 adds 100 new cases at decoder-realistic shapes (
m=1, n ∈ {1024, 3072}, k=1024) that catch FP-precision and per-row strideissues that don't surface at the existing
m=1, n=32, k=256cases.test-backend-ops -o MUL_MAT_VEC_FUSION -b CUDA0baseline…after commit 3 (test-only)…after commit 4 (fusion patch, first attempt)batch_dims=[4,2]x_residualmissing(sample_dst, channel_bias)offset inmmvf.cu…after fixThe bug was latent in the original prototype patch (developed against
chatterbox where
ne[2]==ne[3]==1makes the offset 0). Upstream'sbroader test coverage caught it immediately when stacked against
batch_dims=[4,2]— a good demonstration of why the test-before-change discipline matters. Fixed in the same commit (
mmvf.cunowapplies the offset symmetric to
x_bias).Files:
src/ggml-cuda/common.cuh,src/ggml-cuda/ggml-cuda.cu,src/ggml-cuda/mmvf.cu,src/ggml-cuda/mmvq.cu,tests/test-backend-ops.cpp.3.
GGML_CUDA_PERF_LOGGERenv var (commit 5)Problem
ggml-vulkan ships
GGML_VK_PERF_LOGGER=1and prints a structuredtable after every compute graph; ggml-cuda has no equivalent.
Cross-backend perf characterisation today requires
nsys(heavy,NVIDIA-only, sometimes needs root for hardware counters) or one-off
manual instrumentation.
Solution
A symmetric env var that prints the same output format as
vk_perf_logger:Same prefix structure (
----------------,<Backend> Timings:,per-op rows,
Total time:summary), same number formatting — soexisting cross-backend grep / awk one-liners work unchanged.
Implementation notes
ggml_cuda_perf_logger, off by default. Only thefunction-local-static
getenvcheck runs in the hot path when theenv var is unset.
scopehelper recordscudaEventRecord(start)onconstruction and
cudaEventRecord(end)on destruction — onescope per dispatched op, including before any
continuetakenby the fusion fast-paths.
cudaStreamSynchronize+flush_and_printis called at the end ofevery
ggml_backend_cuda_graph_computeso elapsed times arereadable before the events are re-used.
would either re-record over still-pending events or get hidden
inside
cudaGraphLaunch).cudaEventDestroy— the singleton's dtor runs at staticdestruction time (after
main()), where the CUDA driver mayalready be torn down. OS reclaim is safe; pool is bounded.
Tests
The env var is a stderr-only diagnostic that doesn't change op
output, so there's no natural insertion point in
test-backend-opsfor testing it. Manual smoke verifies
GGML_CUDA_PERF_LOGGER=1 test-backend-ops test -o ADD -b CUDA0produces well-formedCUDA Timings:blocks (99 blocks across the ADD test sweep, matchingvk_perf_logger's format byte-for-byte).Functional / integration testing of the env var path lives downstream
in chatterbox.cpp's
scripts/test-cuda-perf-logger.sh(4 phases:default-silent, env-on-output-format, env+graphs-mutual-exclusion,
aggregate-time-bound) — happy to port a stripped-down version if
you'd prefer a unit test in this PR.
Files:
src/ggml-cuda/ggml-cuda.cu,src/ggml-cuda/common.cuh.Combined regression — 12189/12189 PASS
The three changes are independent (different files / different code
paths), but they're stacked on the same branch in this PR. Final
sanity check on the combined
pr-cuda-bundle:That's
12084 (baseline @ 8be60f8) + 5 (HiFT cases) + 100 (decoder-shape cases) = 12189, no regressions. Bench logs(per-PR + combined) are available on request.
Reproduction
Splitting if preferred
If you'd rather review / land these one at a time, the 5 commits map
1:1 onto three separable PRs:
The corresponding individual PR descriptions (with a bit more detail
per piece) are at the following branches in the same fork, ready to
push:
pr1-conv-transpose-1d-cuda-warp-cooperative,pr2-mul-mat-vec-add-add-fusion-cuda,pr3-cuda-perf-logger.Happy to convert this PR into three separate ones if reviewers
prefer.
Companion findings (not in this PR)
While preparing this work, profiling on RTX 5090 (Blackwell, sm_120)
revealed that
ggml_cuda_fattn_mma_get_confighas no Blackwell entry—
ampere_mma_available(cc)returnstruefor anycc >= 800, sosm_120 silently uses the Ampere config tuned for sm_80. This is the
single biggest remaining
FLASH_ATTN_EXTperf gap to ggml-vulkan onchatterbox-style workloads (~67 ms / utterance). Tuning a Blackwell
config empirically requires either ncu hardware counters or a
multi-day parameter sweep on Blackwell hardware — best done by
maintainers with multi-Blackwell access. Happy to file a separate
issue / discussion thread for this if there's interest.
The new
GGML_CUDA_PERF_LOGGERshipped in this PR makes such anA/B sweep convenient (no rebuild required to switch variants when
combined with a small follow-up env var to override the picker).