Chatterbox optimize cpp backend multilingual model for cuda#2
Closed
Zbig9000 wants to merge 9 commits into
Closed
Conversation
The 3-op MUL_MAT_VEC + ADD(bias) + ADD(residual) fusion's mmvf.cu kernel template was missing the (sample_dst, channel_bias) offset for x_residual that x_bias has. Latent for chatterbox where ne[2]==ne[3]==1 makes the offset zero, but exposed by upstream test_mul_mat_vec_fusion(batch_dims=[4,2]) when porting the patch to ggml-org/ggml. Fix mirrors the existing x_bias offset path. Same fix applied upstream in ggml-org/ggml#1465. Made-with: Cursor
Zbig9000
pushed a commit
to Zbig9000/chatterbox.cpp
that referenced
this pull request
Apr 28, 2026
feat: expose tts-cpp as a library and make it consumable via vcpkg
Zbig9000
added a commit
to Zbig9000/chatterbox.cpp
that referenced
this pull request
May 5, 2026
…d 4) PROGRESS.md §3.35 — T3 step-graph cache (multilingual CFG token decode) opt-in via CHATTERBOX_T3_STEP_CACHE. Per-(n_past, is_uncond) std::list-LRU cache (cap 256) for build_step_graph_mtl; saves ~3 ms per cache hit. Single-utterance default-OFF (no hits-to-amortise on synth GustavoA1604#1) keeps the existing path regression-free; server-mode opt-in shows ~15 % per-pass speedup (~256 ms / synth GustavoA1604#2 of multilingual at 136 tokens). Tests: src/test_t3_caches.cpp NEW with 99 checks (lifecycle + bit-exact cold/warm logits + multi-synth amortisation timing). Lifecycle wired into free_t3 (CLI, both paths), Impl::free_model (Engine), and an atexit fallback — all firing BEFORE ggml_backend_free. Total cache test suite green: 80 + 99 + 6 + 99 = 284 / 284.
Zbig9000
added a commit
to Zbig9000/chatterbox.cpp
that referenced
this pull request
May 6, 2026
…d 4) PROGRESS.md §3.35 — T3 step-graph cache (multilingual CFG token decode) opt-in via CHATTERBOX_T3_STEP_CACHE. Per-(n_past, is_uncond) std::list-LRU cache (cap 256) for build_step_graph_mtl; saves ~3 ms per cache hit. Single-utterance default-OFF (no hits-to-amortise on synth GustavoA1604#1) keeps the existing path regression-free; server-mode opt-in shows ~15 % per-pass speedup (~256 ms / synth GustavoA1604#2 of multilingual at 136 tokens). Tests: src/test_t3_caches.cpp NEW with 99 checks (lifecycle + bit-exact cold/warm logits + multi-synth amortisation timing). Lifecycle wired into free_t3 (CLI, both paths), Impl::free_model (Engine), and an atexit fallback — all firing BEFORE ggml_backend_free. Total cache test suite green: 80 + 99 + 6 + 99 = 284 / 284.
Zbig9000
added a commit
to Zbig9000/chatterbox.cpp
that referenced
this pull request
May 6, 2026
…d 4) PROGRESS.md §3.35 — T3 step-graph cache (multilingual CFG token decode) opt-in via CHATTERBOX_T3_STEP_CACHE. Per-(n_past, is_uncond) std::list-LRU cache (cap 256) for build_step_graph_mtl; saves ~3 ms per cache hit. Single-utterance default-OFF (no hits-to-amortise on synth GustavoA1604#1) keeps the existing path regression-free; server-mode opt-in shows ~15 % per-pass speedup (~256 ms / synth GustavoA1604#2 of multilingual at 136 tokens). Tests: src/test_t3_caches.cpp NEW with 99 checks (lifecycle + bit-exact cold/warm logits + multi-synth amortisation timing). Lifecycle wired into free_t3 (CLI, both paths), Impl::free_model (Engine), and an atexit fallback — all firing BEFORE ggml_backend_free. Total cache test suite green: 80 + 99 + 6 + 99 = 284 / 284.
Owner
|
Closing as we wont target CUDA for now |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
# PR:
QVAC-17873 [TTS GGML] Optimize cpp backend multilingual model for CUDASource:
Zbig9000/chatterbox.cpp:chatterbox-Optimize-cpp-backend-multilingual-model-for-CUDATarget:
GustavoA1604/chatterbox.cpp:main(flows intomultilingualon the next merge-down, same way the Metal patches did and the QVAC-17872 Vulkan patch does)Companion upstream PR:
ggml-org/ggml#1465— open and mergeable, 12189/12189test-backend-ops -b CUDA0PASSCompanion upstream issue:
ggml-org/ggml#1466— Blackwell flash-attn config gap, the documented next stepCompare:
main…chatterbox-Optimize-cpp-backend-multilingual-model-for-CUDA## What problem does this PR solve?
Five related ggml-cuda issues that show up on chatterbox-style
workloads, plus the test/diagnostic infrastructure to keep them
fixed. Sibling work to the QVAC-17872 Vulkan PR; together they
get CUDA from
1.40×slower than Vulkan (pre-patch) to1.13×on long prompts (post-patch) on the same RTX 5090.
(a) Slow
conv_transpose_1dkernel — HiFT vocoder bottleneckggml-cuda's
conv_transpose_1dships a textbook scalar kernel:one CUDA thread per output pixel scanning the full
IC × ILinputgrid with a per-iteration skip conditional that only triggers on
~K/s0of iterations. For HiFT shapes (L=303, IC=80, K=16, s0=8)this is 67 % of total GPU time in the entire S3Gen graph
(135.98 ms across 4 calls on RTX 5090, single biggest call alone
runs 101 ms). HiFT total = 144 ms vs Vulkan's 34 ms on the same
hardware — the single biggest backend-vs-backend perf gap in any
stage of chatterbox.
(b)
ggml_backend_cuda_graph_computewarmup never lands for autoregressive decodeThe graph cache is gated by a 2-call warmup that requires every
property of every node to be byte-identical. Right default for
llama.cpp; wrong default for chatterbox T3 which builds a
fresh-but-topologically-identical cgraph per token with growing
K/V views.
K'sne[1]grows by 1 per token, view offsets shift,so
warmup_completekeeps resetting and the captured graph isnever used. The ~90 ms gap between T3 GPU time (~70 ms) and T3
wall (163 ms) on RTX 5090 lives here.
(c) No cross-backend per-op timing logger on CUDA
ggml-vulkanshipsGGML_VK_PERF_LOGGER=1and we already haveparsing scripts for its output baked into FINDINGS.md.
ggml-cudahas no equivalent — characterising backend-level perfrequired
nsys(heavyweight, NVIDIA-only, sometimes needs root forhardware counters) or one-off manual instrumentation in a debug
build.
(d)
MUL_MAT_VEC + ADD + ADD3-op fusion missingggml-vulkanalready fusesMUL_MAT_VEC + ADD(bias) + ADD(residual)via theMUL_MAT_ADD_ADDshader; ggml-cuda onlyfuses the 2-op pattern, so the residual ADD runs as a separate
launch. At chatterbox shapes (24 layers × 2 ADDs/layer per token)
this is a measurable ~67 ms / utterance difference — the top
remaining gap to ggml-vulkan after (a) lands.
(e) FlashAttention picker has no per-shape diagnostic override
The
MMA_F16variant ggml-cuda picks for chatterbox prompt-phaseattention is ~2× slower than ggml-vulkan's flash-attn shader on the
same shape. To rule this out as a picker-choice issue (vs a
kernel-quality issue) we needed a safe way to A/B the four
variants (
tile,mma,wmma,vec) without rebuilding —including arch / shape fall-back so the dispatcher doesn't
ABORTwhen an unsupported variant is forced.
## How does it solve it?
All five fixes ship as a single vendored patch
(
patches/ggml-cuda-chatterbox-ops.patch, 1 046 lines) applied ontop of the same pinned
ggml@58c38058the Vulkan / Metal patchestarget.
Fix for (a): warp-cooperative
conv_transpose_1dkernelModelled on the Metal-patch design (one threadgroup per output
pixel + simdgroup reduction across input channels), translated to
CUDA primitives:
(OL, OC, 1)× block(32, 1, 1)— one CUDA warp peroutput pixel. Block-size constant drops 256 → 32.
i_start = ⌈(ol − K + 1) / s0⌉,i_end = ⌊ol / s0⌋analytically; skip conditional eliminated entirely. Inner
iloop iterates over at most
K/s0 + 1 = 3positions instead ofIL = 100+.ICreduction across the warp (each lanehandles a strided slice
ic = tid, tid+32, …); reduce acrossthe warp with
__shfl_xor_sync(0xFFFFFFFFu, v, …). Thread 0writes the output pixel.
~110 lines diff in
ggml/src/ggml-cuda/conv-transpose-1d.{cu,cuh},no API change.
Fix for (b):
GGML_CUDA_FORCE_GRAPHS=1opt-inExtends the early-exit branch in
ggml_backend_cuda_graph_computewith an opt-in path that always uses the captured graph and relies
on the existing
cudaGraphExecUpdate(with re-instantiate-on-failure) wiring to absorb per-call data-pointer changes:
Default behaviour unchanged when the env var is unset (every
non-chatterbox consumer is byte-identical to today). ~25 line
addition to
ggml/src/ggml-cuda/ggml-cuda.cu.Fix for (c):
GGML_CUDA_PERF_LOGGER=1opt-inMirrors
GGML_VK_PERF_LOGGERbyte-for-byte in output format soexisting cross-backend grep / awk one-liners (
FINDINGS.md/FINDINGS_CUDA.mdreproduction recipes) work for both backends:Implementation: ~280-line
ggml_cuda_perf_loggerMeyers-singletonclass with RAII
scopehelper around per-op dispatches,cudaEventRecordpairs, aggregation by(op, dtype, shape)key,sorted print at the end of each
ggml_backend_cuda_graph_compute.CUDA Graphs auto-disable when the env var is set (events would
either re-record on subsequent launches or hide inside
cudaGraphLaunch). Off by default; zero overhead in normal builds.Fix for (d):
MUL_MAT_VEC + ADD(bias) + ADD(residual)fusionDirect port of the ggml-vulkan
MUL_MAT_ADD_ADDshader fusion:x_residualfield onggml_cuda_mm_fusion_args_*(sameshape rules as
x_bias; broadcasting rejected by the host-side detection logic).
ggml_cuda_graph_evaluate_and_capture, placed above theexisting 2-op
{MUL_MAT, ADD}fusion so the greedy matchprefers the larger fusion when both apply.
mmvq.cu/mmvf.cukernel templates extended to fold theresidual into the matmul-vec writeback after bias and any GLU,
matching ggml-vulkan's execution order.
Only
MUL_MAT(notMUL_MAT_ID) is handled — the residual ADDpattern doesn't appear in MoE expert routing in any model the
author has seen. ~150 line addition across
common.cuh,mmvq.cu,mmvf.cu,ggml-cuda.cu.Fix for (e):
GGML_CUDA_FATTN_KERNEL=tile|mma|wmma|vecopt-inWraps the existing FlashAttention picker
(
ggml_cuda_get_best_fattn_kernel, renamed to_default) andapplies the env var only when the default heuristic chose
MMA_F16(the documented A/B target). Two safety gates beforeoverride takes effect:
turing_mma_available/volta_mma_available/should_use_wmma_fattnchecks. WMMA on Blackwell falls backto default with a one-shot
GGML_LOG_WARNinstead oftripping the dispatcher's
GGML_ABORTfor "no compiled SASS".only instantiate for
Q.ne[1] <= 2 && K.ne[1] % FATTN_KQ_STRIDE == 0. Forcing VEC on chatterbox's prompt-phase or growing-KV step-decode shapes would otherwise trip
CUDA error: invalid configuration argument. Falls backinstead.
The empirical finding from the variant sweep on RTX 5090 + Turbo
Q4_0:
defaultand overrides tomma/wmma/vecare allbit-identical (only
tileactually changes kernel choice— and it's 4 % slower than MMA). Conclusion: the picker is
already optimal for chatterbox on Blackwell; the remaining 67 ms /
utterance flash-attn gap is kernel-quality intrinsic to MMA_F16,
not a picker-selection issue. Documented as upstream issue
ggml-org/ggml#1466(Blackwell-tuned config table missing — thepicker uses Ampere's sm_80 config on Blackwell because
ampere_mma_available(cc)returns true for anycc >= 800).~140 line addition to
ggml/src/ggml-cuda/fattn.cu.One non-obvious design decision worth calling out
ggml_cuda_perf_logger's destructor runs at static destruction time(after
main(), possibly after libcudart's own statics tear down).The dtor flushes any pending data but does not call
cudaEventDestroy— that can crash on a torn-down driver.Letting the OS reclaim the events is safe: the logger is opt-in
via env var, the leaked memory is process-lifetime regardless,
and the event pool is bounded. Same lifetime model as the
Vulkan-pipeline-cache flush from the QVAC-17872 PR.
## Build system changes
scripts/setup-ggml.sh: now iterates over aPATCHES=(…)array(metal + cuda), stacks them on the same pinned commit
58c38058,remains idempotent. Idempotency check uses
git apply --reverse --check(more discriminating than plain--check— survives manually-corrupted working trees).patches/README.md: refreshed to list both patches, documentthe five CUDA opt-ins, and updated to reflect the 7 modified
files under
src/ggml-cuda/(was 2 in earlier rounds).CMakeLists.txt: adds the newtest-cuda-opstarget — Apple'stest-metal-ops-style kernel-level CPU-vs-CUDA correctness test.ggml/is not checked in (gitignored) —setup-ggml.shappliesboth patches to a pristine clone, same model as the Metal patch.
## Risk assessment
-58 dBFS / SNR 58.5 dB — the same kind of FP-reduction-order
variance that Metal's
simd_sumkernel introduces. Belowperceptual tolerance.
mmvf.cu'sx_residualfield was missing the(sample_dst, channel_bias)offset thatx_biashas. Latent for chatterbox(
ne[2]==ne[3]==1makes the offset zero) but exposedimmediately by upstream's
test_mul_mat_vec_fusion(batch_dims= [4,2])test. Fixed in commitbd37318— back-port fromggml-org/ggml#1465.(
__shfl_xor_sync,__restrict__,cudaEvent_t,cudaGraphExecUpdate) have been in CUDA since ComputeCapability 3.0 (Kepler, 2012) at oldest, sm_70+ for the rest.
sm_120 SASS eliminates a 27 s cold-start PTX-JIT compile.
12.0 still works at runtime via driver JIT but is a regression
on first-launch latency.
cudaEventCreate,cudaGraphExecUpdate, the existing fusion-engine paths areall already single-threaded per
ggml_backend_cuda_context;no new sync introduced.
GGML_CUDA_FORCE_GRAPHS,GGML_CUDA_PERF_LOGGER,GGML_CUDA_FATTN_KERNELall read onceon first use via function-local statics. Unset = byte-identical
to today.
## How was it tested?
End-to-end on Linux x86-64, Ryzen 9 9950X3D, RTX 5090 32 GB,
NVIDIA driver 590.48.01, CUDA Toolkit 12.8 (Blackwell sm_120
native SASS).
Validation harness — 7 test artefacts, ~54 assertions, ~5 min total
All 7 PASS on the post-merge head of this branch
(
bench-logs-cuda/regression-phase4-r9.log).Performance — round 1 + round 2 + round 6 cumulative
5 fresh-process runs each, median of runs 2-5 (NVIDIA driver cache
warm), Turbo Q4_0:
conv_transpose_1d_kernel(HiFT)[hift_total]S3GEN_INFER_MSFORCE_GRAPHS=1T3End-to-end audio output is bit-identical with vs without the
FORCE_GRAPHSenv var (md5summatches across 50-run soak).CUDA ↔ Vulkan gap on long-prompt utterance went from 1.40× pre-
patch → 1.13× post-patch. Remaining gap is
FLASH_ATTN_EXTkernel-quality at chatterbox shapes — documented as the upstream
issue
ggml-org/ggml#1466, with the diagnostic infrastructure(
GGML_CUDA_PERF_LOGGER,GGML_CUDA_FATTN_KERNEL) shipped here somulti-Blackwell maintainers can A/B candidate Blackwell configs
without a rebuild.
CUDA Toolkit 12.0 → 12.8 cold-start delta (validated 2026-04-27)
~/.nv/ComputeCache)12.0 still works at runtime; 12.8 is strongly recommended as
the vcpkg
ggmlbuild dependency to eliminate the 27 s tax.Platforms not tested locally
(no Blackwell-specific code path); same caveat as the Vulkan
PR's "Android Vulkan should show 15-25 % T3 win" — needs a
follow-up bench on at least one mobile RTX SKU before claiming
universal applicability.
conv_transpose rewrite (the kernel is bandwidth-starved on Orin
too) and from (b) FORCE_GRAPHS on autoregressive workloads.
Other ops are at parity with the Ampere config.
Out of scope (documented as follow-ups, not shipped here)
The companion
inputFilesForAI/qvac-17872-findings/FINDINGS_CUDA.mdcaptures the full investigation including:
table is missing (
ggml_cuda_fattn_mma_get_config_blackwelldoesn't exist; sm_120 silently uses the Ampere sm_80 config).
Filed as upstream issue
ggml-org/ggml#1466with full codereferences and reproducer. The 67 ms / utterance gap to
ggml-vulkan flash-attn lives here and requires either NVIDIA
Nsight Compute hardware counters or a multi-Blackwell A/B
sweep best done by upstream maintainers.
(speculative, large; not justified at chatterbox's current
desktop perf level — RTF 0.09 on RTX 5090).
change, just bench data needed).
test-stability.sh) — DONE in this PR.The companion
bench-logs-cuda/directory contains every raw logreferenced above (
warm-run-*.log,cold-run-*.log,nsys-kernels-*.csv,perf-logger-sample.log,fattn-variants-bench.log,stability-soak50.log,diversity-test.log,regression-{baseline-r8,phase4-r9}.log,upstream-{pr1,pr2,pr3,bundle}-*.log).## Companion upstream work (
ggml-org/ggml)Three of the five fixes were also prepared as a single upstream PR
to ggml-org so they benefit everyone, not just chatterbox:
PR
#1465—ggml-cuda: warp-cooperative conv_transpose_1d, MUL_MAT_VEC + ADD + ADD fusion, GGML_CUDA_PERF_LOGGER env var.5 commits, +575 / -43, 12189/12189
test-backend-ops -b CUDA0PASS including 105 new test cases added by the PR itself.
OPEN, MERGEABLE, awaiting maintainer review. The mmvf.cu
x_residualoffset bug fix that's commitbd37318here isalso in this upstream PR (caught and fixed during the upstream
test-before-change cycle).
Issue
#1466—
ggml-cuda: flash-attn MMA picker has no Blackwell (sm_120) entry — silently uses Ampere config. Documents the remainingbiggest gap with code references and reproducer.
GGML_CUDA_FORCE_GRAPHSandGGML_CUDA_FATTN_KERNELareintentionally kept chatterbox-local for now — both are niche
(growing-KV step decode and diagnostic A/B respectively) and
upstream maintainers may reasonably ask for a different framing
before they ship as default. They're available behind the env vars
in this PR, and we can split off upstream PRs later if there's
interest after
#1465lands.