QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan by Zbig9000 · Pull Request #8 · GustavoA1604/chatterbox.cpp

Zbig9000 · 2026-05-06T12:03:56Z

multilingual on Vulkan, RTX 5090, warm-state seg 2..6

Metric	upstream/main baseline	this PR (3 commits)	Δ
S3GEN_INFER	169.9 ms	140.8 ms	−29.1 ms (−17.1 %)
cfm_total	132.5 ms	118.7 ms	−13.8 ms (−10.4 %)
cfm_step0	24.1 ms	13.2 ms	−10.9 ms (−45.2 %)
Cold-start	~2.7 s	~250 ms	−2.4 s (−91 %)

Bit-exact preserving on multilingual: locked MD5 invariants (single-shot c65d98f1…, 6-segment multi-synth 0b374c74…, Turbo single-shot 6219f433…) match byte-for-byte across 4 successive iterations of the test-first regression harness.

The biggest remaining single piece of S3GEN_INFER (~120 ms cfm_total) is the actual GPU CFM compute — not host-cacheable. Closing that requires shader-side work (e.g. tensor-core engagement via cooperative_matrix2); listed as a deferred follow-up below.

What this PR does

Round 1 — `f6893b2` (squashed source port from the closed PR #1)

Component	What it is	Multilingual benefit
`patches/ggml-vulkan-pipeline-cache.patch` (199 lines, NEW)	Persistent `VkPipelineCache` keyed by `<vendorID>-<deviceID>-<driverVersion>`, opt-in via `GGML_VK_PIPELINE_CACHE_DIR`.	Cold-start ~2.7 s → ~250 ms. Mesa / Adreno / Mali biggest target (no driver-side binary cache).
`patches/ggml-vulkan-eager-cache-save.patch` (104 lines, NEW)	Crash-safe pipeline-cache flush. Skips disk write when cache size unchanged.	Stacks on the first patch; perf-neutral on warm runs.
`g_cfm_estimator_cache` (process-wide global)	`cfm_estimator_cache` was the last graph-builder still local-scope in `s3gen_synthesize_to_wav` — every synth call paid the full ~50 ms graph rebuild cost.	`cache.b2` flag handles Turbo (batch=1) ↔ multilingual (batch=2 CFG) mode flip transparently. `cfm_step0` 24 → 13 ms.
`g_time_mlp_results` + `g_time_emb_results`	Two-layer cache by t-value (Turbo + multilingual) and (t, r) pair (Turbo only).	Multilingual fires more: 10 distinct t-values per inference (cosine schedule) vs Turbo's 2-3. 9–19 graph submissions / inf → 0.
`g_weight_cpu_mirror` (`cached_cpu_weights_f32`)	CPU mirror of `flow/input_embedding` + `flow/spk_embed_affine/{w,b}`.	Multilingual is bigger: ~28 MB embedding vs Turbo's ~13 MB. Saves the device→host transfer on every synth.
3 HiFT `ggml_cont` removals	`conv_transpose_1d_f32` exit, ISTFT `y_trim` exit, `f0_predictor` `xp` permute.	Code-quality + future-proofing; CONT total in HiFT is only 0.13 % of HiFT runtime per the perf logger.
G2 dump-script gap closure (`scripts/dump-s3gen-reference.py` + 1-line `test_s3gen.cpp` fix)	`regress-tensor-compare.sh` was aborting at stage G2 with `cannot open cfm_concat.npy`.	Full Python ↔ C++ stage-compare pipeline now runs end-to-end through G2 / G3 / G4 / H1 / H3 / H4 / H5; max rel err 7.92e-3 on STFT (PyTorch FFT vs hand-built DFT, expected), max ≤ 4.7e-5 elsewhere; final waveform `max_abs = 8.20e-08`.

Verification — `5084ee4` (PROGRESS.md only)

After QVAC-18422 §3.34's converter shipped chatterbox-s3gen-mtl-q4_0.gguf (788 MB) + chatterbox-t3-mtl-q4_0.gguf (345 MB), I built a fresh upstream/multilingual_merged HEAD baseline (no Vulkan patches) and ran identical multilingual synthesis on both:

Bit-exact — single-shot MD5 c65d98f15a59b8fe9cad98e46eb3fb30 and 6-segment multi-synth MD5 0b374c7474895a3387b9f1df10b3c1b8 match between this PR and the upstream baseline byte-for-byte. These are the first locked multilingual F32 invariants on the multilingual_merged/main Vulkan base.
Multilingual perf (n=15 warm-state samples per build): −9.5 % S3GEN_INFER, −13.4 % cfm_total, −47.7 % cfm_step0 vs upstream baseline (round-1-only measurement).

Round 2 — `d5c261c` (multilingual-targeted host-side caches)

Targets the per-synth host-CPU overhead that round 1 didn't address. All seven caches sit alongside the round-1 caches; same destroy()-before-ggml_backend_free discipline.

Cache	Keyed on	Purpose / multilingual benefit
`g_encoder_graph_cache`	`T` (encoder input length)	Full `run_encoder` graph + gallocator. Multilingual T~350+ → bigger encoder graph rebuild was being repeated every synth.
`g_hift_graph_cache` (+ `g_hift_inv_alpha_entries`)	`pack(T_mel, T_stft)`	Full `run_hift_decode` graph + gallocator. Parallel `(graph-input-name, source-tensor-ptr)` metadata lets cache hits re-feed every alpha-input slot from `g_inv_alpha_results` without rebuilding. HiFT audio length scales with prompt length → multilingual is the biggest beneficiary.
`g_f0_graph_cache`	`T_mel`	Full `run_f0_predictor` graph + gallocator.
`g_pos_emb_results` (`cached_pos_emb`)	`pack(T, D)`	`compute_pos_emb` output. Pure CPU compute (~`T × D × 5` trig ops); fired twice per encoder run (`T` and `2T`). Multilingual T=350+ at D=512 was the dominant scaffolding cost.
`g_inv_alpha_results` (`cached_inv_alpha`)	`ggml_tensor *`	HiFT calls `invert_alpha_cpu` ~72× per synth (12 ResBlocks × 6 alpha tensors); each is a `tensor_get` + per-element reciprocal. Alpha tensors are constant for the model lifetime.
`g_hann_window_cache` / `g_istft_kernel_cache`	`n_fft`	Pure functions of `n_fft` (constant 16).
`g_window_sum_cache` (`cached_window_sum`)	`pack(n_fft, hop, T_stft)`	Stable across same-shape synth calls.

A new graph_cache struct (used by encoder / HiFT / F0) and a pack_hift_key helper centralise the explicit destroy()-on-teardown pattern so future per-stage caches plug in with one struct + one mutex acquisition. Destroy path is unified into s3gen_release_synth_caches() (renames the old g_cfm_estimator_cache_destroy()), called from s3gen_model_cache_release / cache-miss / s3gen_unload.

Within-process win on top of round 1 (warm-state seg 2..6, n=5 per build):

Metric	round 1 alone	+ round 2	Δ
S3GEN_INFER	159.8 ms	140.8 ms	−19.0 ms (−11.9 %)
hift_total	17.96 ms	16.30 ms	−1.7 ms (−9.4 %)

Negative result documented (bug caught and fixed during dev)

First implementation of the HiFT cache hung indefinitely on the very first synth call. Root cause: the alpha-input refresh loop held g_synth_caches_mu while calling cached_inv_alpha, which itself takes the same mutex internally — classic re-entrant deadlock. Fix: snapshot g_hift_inv_alpha_entries under the mutex into a local vector, then iterate without the lock. Inline comment in run_hift_decode documents the rule for future investigators: never hold a cache-state mutex while calling any other cached_* helper.

Bit-exactness — locked invariants

Locked F32 MD5s on RTX 5090 + NVIDIA 590.48 + Vulkan 1.3.275:

Test	Locked MD5	Verification
Multilingual single-shot (seed 42)	`c65d98f15a59b8fe9cad98e46eb3fb30`	✓ across 4 iters of `regress-mtl-vk.sh verify`
Multilingual 6-segment multi-synth	`0b374c7474895a3387b9f1df10b3c1b8`	✓ across 4 iters
Turbo single-shot (seed 42)	`6219f4338b1b4fb9dc60481216153b49`	✓ across 4 iters

The test-first regression harness bench-logs-vk-mtl/regress-mtl-vk.sh (in the qvac monorepo, out-of-tree) locks a snapshot before any change then verifys after every cache addition. Every round-2 cache passed verify immediately after addition.

Tensor-level Python ↔ C++ stage compare (added via the round-1 G2 fix) runs end-to-end through G2 / G3 / G4 / H1 / H3 / H4 / H5 with max rel err 7.92e-3 on STFT (expected — PyTorch torch.stft vs hand-built DFT-via-conv1d), max ≤ 4.7e-5 everywhere else, final waveform max_abs = 8.20e-08.

How to validate

cd <chatterbox.cpp>

# 1. Apply the Metal + OpenCL + 2 new Vulkan patches
bash scripts/setup-ggml.sh

# 2. Build (Vulkan)
cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build-vk -j --target tts-cli test-s3gen

# 3. Bit-exactness (locked MD5 invariants — multilingual + Turbo)
#    Snapshot lives in inputFilesForAI/qvac-17872-findings/bench-logs-vk-mtl/
#    in the qvac monorepo; the harness verifies all 3 invariants in one shot.
bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-mtl/regress-mtl-vk.sh \
     build-vk verify-pr8 verify
# Expected:
#   PASS  multilingual single-shot  md5=c65d98f15a59b8fe9cad98e46eb3fb30
#   PASS  multilingual auto-split    md5=0b374c7474895a3387b9f1df10b3c1b8
#   PASS  turbo single-shot          md5=6219f4338b1b4fb9dc60481216153b49

# 4. Cold-start (round-1 VkPipelineCache patch)
rm -rf ~/.cache/ggml/vulkan
./build-vk/tts-cli ...   # first run: ~2.7 s cold
./build-vk/tts-cli ...   # second run: ~250 ms (ggml cache warm)

# 5. Multilingual perf — 6-segment auto-split (within-process warm caches)
./build-vk/tts-cli \
    --model models/chatterbox-t3-mtl-q4_0.gguf \
    --s3gen-gguf models/chatterbox-s3gen-mtl-q4_0.gguf \
    --language en \
    --text "Hello from ggml first synthesis. Second synthesis run here now. \
            Third sentence here. Fourth sentence runs too. Fifth sentence wraps." \
    --max-sentence-chars 32 --out /tmp/mtl.wav \
    --n-gpu-layers 99 --threads 4 --seed 42 --temp 0 --top-k 1 --verbose
# Expected segments 2..6 average:
#   S3GEN_INFER ~141 ms, cfm_total ~119 ms, cfm_step0 ~13 ms, hift_total ~16 ms
# Compare a fresh upstream/main baseline build via the same recipe; multilingual
# WAV must produce IDENTICAL md5sum (PR makes synthesis faster, not different).

The full reproduction recipe (including how to build the upstream baseline in a separate git worktree for comparison) is in PROGRESS.md §3.32 ► "Round 2 ► Reproduction".

Risk assessment

Bit-exact preserving — verified across multilingual + Turbo invariants; verified across 4 successive iterations on the test-first harness.
Default behaviour unchanged unless explicit env vars are set:
- Round-1 cache reads/writes from $XDG_CACHE_HOME/ggml/vulkan / $HOME/.cache/ggml/vulkan — opt-out via empty GGML_VK_PIPELINE_CACHE_DIR="".
- All seven round-2 caches are always-on (no env var gate). They never change output, only avoid recomputing it.
No GGUF format change — existing chatterbox-s3gen-{turbo,mtl-q4_0}.gguf work as-is.
No public-API change — include/tts-cpp/chatterbox/*.h untouched.
Two ggml-vulkan patches (round-1 + round-2) shipped under patches/, applied in scripts/setup-ggml.sh (same vendoring model as the existing Metal + OpenCL patches).
Same cmake -DGGML_VULKAN=ON invocation as before — no new dependencies.
Memory cap: every cache is bounded by the number of distinct shape keys it sees across the process lifetime (typically 1-2 entries each). Steady-state per-process overhead: ~280 MB total (the bulk is the three 64 MB graph arenas). Streaming sessions with many distinct chunk sizes can grow these caches; a future LRU bound is documented in PROGRESS.md §3.32 as a deferred follow-up.
Teardown ordering: s3gen_release_synth_caches() runs before ggml_backend_free (same constraint as the pre-existing thread_local time_mlp_cache); registered via atexit() on first cache insertion + called explicitly from the cache-miss / s3gen_unload paths.

Files

PROGRESS.md                                ~+540 / -10  (§3.32 entry: round-1 + verification + round-2)
src/chatterbox_tts.cpp                     ~+625 / -98  (round-1 + round-HIFT + round-2 graph + scaffolding caches)
patches/ggml-vulkan-pipeline-cache.patch    +199         (NEW)
patches/ggml-vulkan-eager-cache-save.patch  +104         (NEW)
scripts/dump-s3gen-reference.py             +65
scripts/setup-ggml.sh                       +20 / -8     (applies the two new Vulkan patches)
patches/README.md                           +13 / -8     (documents the new patches)
src/test_s3gen.cpp                          +6           (G2 set_output(xc) fix)
                                            -----------
Total: 8 files, ~+1500 / −115, 3 commits on upstream/main.

CHANGELOG.md deliberately not added (per PR #1 review feedback) — the investigation entry lives in PROGRESS.md §3.32 instead.

The inputFilesForAI/qvac-17872-findings/{FINDINGS,PR_DESCRIPTION,bench-logs-vk-mtl/}* companion docs stay in the qvac monorepo (out-of-tree) — same arrangement as the QVAC-18422 sister PR.

Deferred follow-ups (separate PRs)

Candidate	Estimated multilingual win	Why deferred
C1 — F16 CFM matmul weights (opt-in `CHATTERBOX_F16_CFM`)	~125 MB device-memory + bandwidth-bound mobile win	Multilingual CFM is ~6× larger than Turbo, so this is the bandwidth-bound mobile lever. `multilingual_merged`'s `load_s3gen_gguf` uses `ggml_dup_tensor + ggml_backend_alloc_ctx_tensors` — different from the `main`-base path our F16 conversion was written against. ~100 lines of adaptation + new locked MD5 baselines (NVIDIA + AMD, F32 + F16).
`cooperative_matrix2` (CM2) tensor-core engagement for narrow CFM matmuls	−8.6 % cfm_total measured in the prior `main`-base CM2 Tier-3 close-out	Requires LunarG SDK 1.4.341.1 + `glslc` 2026.1; politically blocked behind a cmake flag pending project-wide baseline-set sign-off. See `inputFilesForAI/qvac-17872-findings/FINDINGS_ROUND_CM2.md`.
Round-4 / 6 Q/K/V batched matmul fusion composition with `multilingual_merged`'s zero-cont strided 3D Q/K/V views (`849507a`)	~1.3 ms RTX 5090 + larger on bandwidth-starved targets	Pick-one-approach decision deferred; needs Vulkan `flash_attn_ext` stride-tolerance verification.
T3 step-graph cache (multilingual fires `build_step_graph_mtl` 2× per token via CFG)	~12 % T3 wall reduction on synth #2+ in long-running processes	Covered by QVAC-18422 §3.35 on the CPU branch; same pattern would port.
Mobile validation (Adreno / Mali / Apple)	unknown — biggest remaining evidence gap	Hardware-bound. AMD/RADV proxy refuted the original `main`-base mobile-bandwidth projections on rounds 2 / 3 / 5 / 6 / C1, so real mobile runs would either confirm or force revision.
CI integration of `regress-mtl-vk.sh` + the existing `regress-c1.sh` / `regress-amd.sh` / `regress-tensor-compare.sh`	n/a — test-infra	Now unblocked since round-1 closed the G2 gap. Catches future regressions like the deadlock that surfaced (and was fixed) during round-2 dev.

Re-bases the closed PR GustavoA1604#1 work onto upstream/multilingual_merged (was previously on upstream/main). Addresses the PR GustavoA1604#1 review: 1. Base is now multilingual_merged (was main). 2. CHANGELOG.md dropped — investigation entry lives in PROGRESS.md §3.32 instead. 3. Optimisations are model-agnostic by construction, so they benefit BOTH the Turbo (meanflow) and the multilingual (standard CFM with CFG) variants — see PROGRESS.md §3.32 "Why this is model-agnostic by construction". Two ggml-vulkan patches + four host-side optimisations in src/chatterbox_tts.cpp. All bit-exact on F32 across NVIDIA + AMD/ RADV. No public-API change, no GGUF format change, no new build-system requirement. Round-coverage on top of multilingual_merged --------------------------------------------- This squashed port carries only the optimisations that remain measurable on the multilingual_merged base. The full per-round investigation (8 rounds + AMD validation + LunarG SDK / coopmat2 Tier-3 close-out) is preserved in the qvac monorepo at inputFilesForAI/qvac-17872-findings/FINDINGS_ROUND*.md and PR_DESCRIPTION_FULL.md. Carried forward (in this commit): * patches/ggml-vulkan-pipeline-cache.patch (199 lines NEW) Persistent VkPipelineCache, opt-in via GGML_VK_PIPELINE_CACHE_DIR. Recovers ~91 % of the cold→warm gap on the first warm run. * patches/ggml-vulkan-eager-cache-save.patch (104 lines NEW) Crash-safe pipeline-cache flush, stacks on the first patch. * Persistent CFM estimator graph cache (g_cfm_estimator_cache) Was the last graph-builder still local-scope in s3gen_synthesize_to_wav. cache.b2 flag handles the Turbo (batch=1) ↔ multilingual (batch=2 CFG) mode switch. Per-step verbose: chunk 1 cfm_total=80 ms; chunks 2..16 cfm_total=30 ms. Also eliminates a latent process-exit crash risk (Vulkan dylib static-destructor ordering). * Time-embedding result memoisation (g_time_mlp_results, g_time_emb_results) Two-layer cache by t-value (Turbo + multilingual) and (t, r) pair (Turbo only). 6 graph submissions/inf → 0 for Turbo; 9–19 → 0 for the multilingual 10-step cosine schedule. * CPU mirror cache for large per-synth weight downloads (g_weight_cpu_mirror) flow/input_embedding (~13.4 MB Turbo / ~28 MB multilingual) + spk_embed_affine/{w,b} were re-downloaded GPU→CPU on every synth. Cleared on backend-swap and model-cache release. * 3 HiFT cont sites removed (perf-neutral, code quality) conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor xp permute. All consumers tolerate strided sources. * G2 dump-script gap closure (regress-tensor-compare.sh now runs end-to-end through G2/G3/G4/H1/H3/H4/H5) cfm_concat / cfm_h_conv / cfm_h_ln / hift_s_stft .npy files now produced; ggml_set_output(xc) added to stage_G2 so the gallocator preserves the diagnostic intermediate. Deferred (separate follow-ups): * C1 — F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM). multilingual_merged's load_s3gen_gguf uses ggml_dup_tensor + ggml_backend_alloc_ctx_tensors; needs ~100 lines adapting our F32→F16 conversion path + new MD5 baselines (NVIDIA + AMD, F32 + F16). * Round-4 / 6 Q/K/V batched matmul fusion. multilingual_merged uses zero-cont strided 3D Q/K/V views (their 849507a) — alternative optimisation for the same code; composing them is non-trivial and needs Vulkan flash_attn_ext stride-tolerance verification. * HiFT decoder graph caching. multilingual_merged's run_hift_decode rebuilds gallocr_t + ctx fresh on every call (no g_hift_cache equivalent); same persistent-cache pattern would save another ~5–10 ms / chunk on the multilingual variant. * Multilingual GGUF cross-validation. May 4 measurement was on Turbo because the multilingual GGUF was not available locally then. After QVAC-18422 §3.34's converter shipped chatterbox-s3gen-mtl-q4_0.gguf, this is a follow-up cross-check; by construction every cache should hit ≥ as often as on Turbo (multilingual has more distinct t-values per inference and a larger input_embedding). Performance — RTX 5090, regress-tight aggregate, n=75 chunks, Turbo ------------------------------------------------------------------- metric | upstream/multilingual_merged | + this PR | Δ S3GEN_INFER | 76.6 ms | 65.4 ms | -11.2 ms (-14.6 %) cfm_total | 40.3 ms | 28.7 ms | -11.6 ms (-28.8 %) encoder | 19.9 ms | 20.7 ms | noise hift_decode | 10.9 ms | 11.6 ms | noise cfm_total ranges fully separated on n=120 samples (base [38.3, 42.8] vs final [27.1, 30.1]). Smaller absolute saving than the original upstream/main base measurement (~-45 ms / -41 % S3GEN_INFER) because multilingual_merged already contains the zero-cont strided Q/K/V views, the reduced 256 MB → 64 MB CFM buf, the thread_local time_mlp_cache, and the dropped redundant gallocr_reserve in HiFT/time_mlp — all of which originally contributed to the larger headline number on the main base. Bit-exactness ------------- * RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325: 3/3 F32 invariants PASS (round-1 single-shot WAV; round-2 multi-synth identical; round-3 multi-synth varied). * AMD iGPU (RADV RAPHAEL_MENDOCINO, Mesa 25.2.8): 3/3 F32 invariants PASS. * F16 invariants are not in this commit (C1 deferred). * Tensor-level Python ↔ C++ stage compare runs end-to-end through G2/G3/G4/H1/H3/H4/H5; max relative error 7.92e-3 on STFT (PyTorch FFT vs hand-built DFT, expected; ISTFT roundtrip recovers to bit-exact); max ≤ 4.7e-5 elsewhere; final waveform max_abs = 8.20e-08. Files ----- PROGRESS.md +297 (§3.32 entry) src/chatterbox_tts.cpp +212 / -19 patches/ggml-vulkan-pipeline-cache.patch +199 (NEW) patches/ggml-vulkan-eager-cache-save.patch +104 (NEW) scripts/dump-s3gen-reference.py +65 scripts/setup-ggml.sh +20 / -8 patches/README.md +13 / -8 src/test_s3gen.cpp +6 Total +890 / -22, 8 files How to validate --------------- cd <chatterbox.cpp> bash scripts/setup-ggml.sh # applies Metal + OpenCL + 2 Vulkan patches cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON cmake --build build-vk -j --target tts-cli test-s3gen # Cold start (ggml-vulkan-pipeline-cache.patch) rm -rf ~/.cache/ggml/vulkan ./build-vk/tts-cli ... # first run: ~2.7 s cold ./build-vk/tts-cli ... # second run: ~250 ms (ggml cache warm) # Bit-exactness (3 F32 invariants from the qvac monorepo harness) bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-c1/regress-c1.sh build-vk 1 VK_LOADER_DRIVERS_SELECT='radeon_icd*' \ bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-amd/regress-amd.sh build-vk 1 # Aggregate perf bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-round3/regress-tight.sh build-vk mtl-final 5 # Expected: S3GEN_INFER ~65 ms, cfm_total ~29 ms, n=75 # vs upstream/multilingual_merged baseline: S3GEN_INFER ~77 ms, cfm_total ~40 ms Co-authored-by: Cursor <cursoragent@cursor.com>

… Vulkan Closes the multilingual-applicability gap that the May 4 squashed port (commit ac4748a) left open. The May 4 measurement was on Turbo only because the multilingual GGUF was not available locally then; after QVAC-18422 §3.34's converter shipped chatterbox-s3gen-mtl-q4_0.gguf (788 MB) and chatterbox-t3-mtl-q4_0.gguf (345 MB), the actual multilingual verification is now feasible. Test methodology ---------------- Six-segment auto-split via --max-sentence-chars 32 (the multilingual T3 GGUF doesn't embed the tokenizer needed for the --input-file streaming pattern; --max-sentence-chars triggers multiple within-process synth calls which is what the persistent host caches actually need to fire). Three iterations × five warm-state segments = n=15 samples per build. Comparison build: a fresh upstream/multilingual_merged HEAD (b074399) worktree at /tmp/cb-base-mtl-merged with only the Metal + OpenCL patches applied (NOT the two new Vulkan patches in this PR). Both builds use the same vendored ggml commit 58c38058 and the same Vulkan 1.3.275 / RTX 5090 + NVIDIA 590.48 host. Bit-exactness — first locked multilingual F32 invariants -------------------------------------------------------- Both single-shot and 6-segment multi-synth produce byte-identical multilingual WAV vs the upstream/multilingual_merged baseline: Single-shot (seed 42, --temp 0): c65d98f15a59b8fe9cad98e46eb3fb30 Multi-synth 6 segments (seed 42): 0b374c7474895a3387b9f1df10b3c1b8 These are the FIRST locked multilingual F32 invariants for the Vulkan path on the multilingual_merged base (the previously locked RTX 5090 invariants in regress-c1.sh were captured against the older main-base branch and don't apply to this base). Performance — RTX 5090, n=15 warm-state samples per build --------------------------------------------------------- metric | upstream/mtl_merged | this PR | Δ S3GEN_INFER | 169.9 ms | 153.7 ms | -16.2 ms (-9.5 %) cfm_total | 132.5 ms | 114.7 ms | -17.8 ms (-13.4 %) cfm_step0 | 24.1 ms | 12.6 ms | -11.5 ms (-47.7 %) cfm_step0 is the strongest multilingual signal: the persistent CFM estimator graph cache eliminates ~half of the per-segment graph-rebuild cost on warm-state synth. The -9.5 % S3GEN_INFER win is below the Turbo wins because: 1. Multilingual CFM is ~6× larger in absolute terms (more layers, larger hidden dims, default 10-step cosine schedule vs Turbo's 2-step meanflow), so the cached host overhead is a smaller fraction of the wall. 2. The multilingual baseline absorbs more per-synth fixed cost than Turbo does — multilingual hits compute_time_mlp 10 times per inference but each time only touches a tiny graph; the cached CFM estimator graph matters more. First-segment cold cost ----------------------- Within a single process, the first segment pays a one-time cache-warm-up overhead: PR 210-236 ms vs baseline 195-241 ms (no statistically significant first-segment penalty given run-to-run variance). Subsequent segments are where the caches actually pay off and the win is consistently visible. Across processes, the persistent VkPipelineCache patch (round-1) collapses the cold-process startup: cfm_step0 on a fresh process drops from ~133 ms (no cache, full shader compile) to ~30 ms (cache hit) — the headline mobile / Mesa win. Files: PROGRESS.md +125 / -6 lines. No source-code changes — this commit is purely the verification write-up that confirms the May 4 port's optimisations work correctly and meaningfully on the multilingual model on Vulkan, exactly as predicted by the "model-agnostic by construction" analysis in PROGRESS.md §3.32. Co-authored-by: Cursor <cursoragent@cursor.com>

… + scaffolding caches (multilingual Vulkan) Targets the per-synth host-CPU overhead that round 1 / round-HIFT didn't address, on top of upstream/multilingual_merged (now in main via PR GustavoA1604#7). Test-first: bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac monorepo locks the pre-change MD5 baseline, then re-verifies after every cache. All 3 invariants (multilingual single-shot, multilingual 6-segment multi-synth, Turbo single-shot) PASS bit-exact. Seven new caches ---------------- All host-side, model-agnostic, no GGUF-format change, no public-API change. Same teardown discipline as the existing g_cfm_estimator_cache (destroy() before ggml_backend_free). Sit alongside the existing round-1 caches. - g_encoder_graph_cache (keyed on T): full run_encoder graph + gallocator. Streaming chunks of varying length still produce correct output (rebuilds on key change). - g_hift_graph_cache (keyed on pack(T_mel, T_stft)) + g_hift_inv_alpha_entries: full run_hift_decode graph + gallocator. Parallel (graph-input-name, source-tensor-ptr) metadata lets cache hits re-feed each alpha-input slot from g_inv_alpha_results without rebuilding the graph. - g_f0_graph_cache (keyed on T_mel): full run_f0_predictor graph + gallocator. - cached_pos_emb (g_pos_emb_results, keyed on pack(T, D)): compute_pos_emb is pure CPU compute (~T * D * 5 trig ops); fired twice per encoder run (T and 2T). Multilingual T~350+ at D=512 is a real wedge of per-synth host time. - cached_inv_alpha (g_inv_alpha_results, keyed on ggml_tensor*): HiFT calls invert_alpha_cpu ~72x per synth (12 ResBlocks × 6 alpha tensors); each is a tensor_get + per-element reciprocal. Alpha tensors are constant for the model lifetime. - cached_hann_window / cached_istft_kernel (g_hann_window_cache / g_istft_kernel_cache, keyed on n_fft): pure functions of n_fft (constant 16 in the chatterbox HiFT path). - cached_window_sum (g_window_sum_cache, keyed on pack(n_fft, hop, T_stft)): T_stft × n_fft adds; stable across same-shape synth calls. A new graph_cache struct (used by encoder / HiFT / F0) and a pack_hift_key helper centralise the explicit destroy()-on-teardown pattern so future per-stage caches can plug in with one struct + one mutex acquisition. The destroy path is unified into a renamed s3gen_release_synth_caches() (replaces the old g_cfm_estimator_cache_destroy()), called from s3gen_model_cache_release, the cache-miss backend-swap path, and the explicit s3gen_unload(). Negative result documented (bug caught and fixed during dev) ------------------------------------------------------------ First implementation of the HiFT cache hung indefinitely on the very first synth call. Root cause: the alpha-input refresh loop held g_synth_caches_mu while calling cached_inv_alpha, which itself takes the same mutex internally — classic re-entrant deadlock. Fix: snapshot g_hift_inv_alpha_entries under the mutex into a local vector, then iterate without the lock (cached_inv_alpha re-acquires the mutex per call but with no nesting). General rule kept as an inline comment: never hold a cache-state mutex while calling any other cached_* helper. Performance — RTX 5090, multilingual auto-split, warm-state seg 2..6 ------------------------------------------------------------------- Within-process win on top of round 1 + round-HIFT: metric | pre-round-2 | post-round-2 | Δ S3GEN_INFER | 159.8 ms | 140.8 ms | -19.0 ms (-11.9 %) cfm_total | 122.2 ms | 118.7 ms | -3.5 ms (-2.9 %) cfm_step0 | 13.24 ms| 13.18 ms | noise (already cached round 1) hift_total | 17.96 ms| 16.30 ms | -1.7 ms (-9.4 %) Combined cumulative win vs upstream/multilingual_merged baseline (round 1 + round-HIFT + round 2): metric | upstream/mtl_merged | this PR (full) | Δ S3GEN_INFER | 169.9 ms | 140.8 ms | -29.1 ms (-17.1 %) cfm_total | 132.5 ms | 118.7 ms | -13.8 ms (-10.4 %) cfm_step0 | 24.1 ms | 13.2 ms | -10.9 ms (-45.2 %) The biggest remaining single piece of S3GEN_INFER (~120 ms cfm) is the actual GPU CFM compute — not host-cacheable; would need shader-side optimisation (e.g. tensor-core engagement via cooperative_matrix2; deferred — see "Next" in PROGRESS.md §3.32). Bit-exactness ------------- Locked invariants pass byte-for-byte vs the pre-change baseline: Multilingual single-shot c65d98f15a59b8fe9cad98e46eb3fb30 ✓ Multilingual 6-segment multi 0b374c7474895a3387b9f1df10b3c1b8 ✓ Turbo single-shot 6219f4338b1b4fb9dc60481216153b49 ✓ Verified across 4 successive iterations on RTX 5090 + NVIDIA 590.48 + Vulkan 1.3.275; bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac monorepo is the test-first harness. Files ----- src/chatterbox_tts.cpp +373 / -79 (net diff vs round-1 head) PROGRESS.md §3.32 round-2 subsection (~+200 lines) The +373 lines in chatterbox_tts.cpp are entirely the new cache infrastructure: graph_cache struct, seven new globals, the s3gen_release_synth_caches lifecycle hook, the five cached_* scaffolding helpers, and the build_graph / cache-hit branches in run_encoder / run_hift_decode / run_f0_predictor. Co-authored-by: Cursor <cursoragent@cursor.com>

PR #8 (QVAC-17872 round-HIFT) dropped the trailing ggml_cont on `xp` before ggml_mul_mat in run_f0_predictor on the rationale that "Vulkan / Metal / CUDA mul_mat shaders all iterate by stride and accept strided src1 for f32 matmul". That holds for those GPU backends but ggml-cpu's mul_mat enforces nb10 == ggml_type_size(src1->type), so the bare permute aborts the process on CPU during HiFT decode (visible as 4x repeated GGML_ASSERT(nb10 == ggml_type_size(src1->type)) failed when running tts-cli on CPU with --threads 4 against any Chatterbox GGUF; the parity test test-cpu-caches reproduces the same crash on its warm-cache lifecycle pass). The other two cont removals from PR #8 (conv_transpose_1d_f32 exit into ggml_add, ISTFT y_trim into ggml_clamp) consume into element- wise ops that DO accept strided sources on every backend and stay removed. Only the f0_predictor site ever feeds a permuted tensor into mul_mat src1, so the unconditional cont is the minimal fix. Validated locally on Windows / MSVC / qvac-ext-ggml/speech: chatterbox-t3-turbo + chatterbox-s3gen run end-to-end on CPU, T3 1817 ms, S3Gen 2061 ms, no asserts. Co-authored-by: Cursor <cursoragent@cursor.com>

…INE_CACHE_DIR Adds an opt-in persistent shader cache to ggml-vulkan. Enabled only when the caller sets GGML_VK_PIPELINE_CACHE_DIR to a non-empty path; when unset or empty behaviour is byte-identical to upstream ggml-vulkan. No auto-discovery of $XDG_CACHE_HOME or $HOME. ggml is a library distributed through package managers (vcpkg) and consumed by applications that should decide whether and where to persist Vulkan artefacts. Writing to the user's home directory without being asked is a side effect library consumers cannot see from the API surface. When enabled, createPipelineCache is seeded from the path at init and getPipelineCacheData is written back from ggml_vk_cleanup() (not ~vk_device_struct which is unreliable at process exit due to shared_ptr ref cycles). File keyed on vendorID/deviceID/driverVersion; Vulkan validates the blob header and silently ignores stale data if the shader bundle or driver changed. Atomic save via tmp+rename. Recovers ~91% of the cold->warm shader-compile gap on the first warm run on drivers without an aggressive per-app system cache (Mesa/RADV, Android Adreno/Mali, fresh NVIDIA installs, containers). Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8 (QVAC-17872, round-1). Co-authored-by: Cursor <cursoragent@cursor.com>

Stacks on the previous patch. Writes back the on-disk pipeline-cache blob after every ggml_vk_load_shaders compile batch instead of only at ggml_vk_cleanup() time, so a process killed mid-graph (SIGKILL, abort, OS shutdown) doesn't lose the freshly compiled pipelines. Adds pipeline_cache_last_size book-keeping so warm runs short-circuit the disk write: the eager path only flushes when the cache actually grew (blob.size() > last_size), and the cleanup path skips when size matches last_size. This avoided a +90 ms WALL regression measured during dev when the flush was unconditional. Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8 (QVAC-17872, round-2). Co-authored-by: Cursor <cursoragent@cursor.com>

Removes references to internal QVAC ticket numbers, fork-PR numbers, and 'round-HIFT / round 2 / round 4 / round 5 / (this PR)' phase markers that document the development history rather than the code's behaviour. Tightens the surrounding prose so each comment reads as 'what this code does and why' instead of 'how we got here'. Specific edits: - src/chatterbox_tts.cpp: 18 comment blocks rewritten. The big CPU-side persistent-cache header at line 446 now describes the caches (a..j) as one homogeneous set instead of 'Round 1' + 'Round 2'. The PR #8 vs CPU-correctness explanation around the f0_predictor ggml_cont keeps the technical rationale (CPU mul_mat asserts on strided src1, GPU shaders accept it) but drops the 'PR #8 / QVAC-17872 round-HIFT optimised for' prefix. - src/t3_mtl.cpp: 5 comments around the T3 step-graph cache. - src/chatterbox_engine.cpp + src/chatterbox_cli.cpp: 3 'drop the T3 step-graph cache before backend free' comments. - src/chatterbox_tts_test_hooks.h: 2 references rewritten as a more generic 'persistent-cache work for the CPU-side multilingual TTS path; see PROGRESS.md for design notes' framing. - CMakeLists.txt: 2 test-registration comment annotations. Vendored upstream content (src/dr_wav.h's own changelog dates) is untouched. Pure comment-only change; rebuilt tts-cpp under MSVC Release with no new warnings or errors. No source code, public API, or build behaviour changes. Co-authored-by: Cursor <cursoragent@cursor.com>

Zbig9000 and others added 2 commits May 6, 2026 14:55

Zbig9000 force-pushed the chatterbox-QVAC-17872-TTS-GGML-Optimize-cpp-backend-multilingual-for-Vulkan branch from ac4748a to 5084ee4 Compare May 6, 2026 12:55

GustavoA1604 merged commit 1cc7dae into GustavoA1604:main May 6, 2026

Zbig9000 deleted the chatterbox-QVAC-17872-TTS-GGML-Optimize-cpp-backend-multilingual-for-Vulkan branch May 7, 2026 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan#8

QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan#8
GustavoA1604 merged 3 commits into
GustavoA1604:mainfrom
Zbig9000:chatterbox-QVAC-17872-TTS-GGML-Optimize-cpp-backend-multilingual-for-Vulkan

Zbig9000 commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zbig9000 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

multilingual on Vulkan, RTX 5090, warm-state seg 2..6

What this PR does

Round 1 — f6893b2 (squashed source port from the closed PR #1)

Verification — 5084ee4 (PROGRESS.md only)

Round 2 — d5c261c (multilingual-targeted host-side caches)

Negative result documented (bug caught and fixed during dev)

Bit-exactness — locked invariants

How to validate

Risk assessment

Files

Deferred follow-ups (separate PRs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zbig9000 commented May 6, 2026 •

edited

Loading

Round 1 — `f6893b2` (squashed source port from the closed PR #1)

Verification — `5084ee4` (PROGRESS.md only)

Round 2 — `d5c261c` (multilingual-targeted host-side caches)