Skip to content

QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan#8

Merged
GustavoA1604 merged 3 commits into
GustavoA1604:mainfrom
Zbig9000:chatterbox-QVAC-17872-TTS-GGML-Optimize-cpp-backend-multilingual-for-Vulkan
May 6, 2026
Merged

QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan#8
GustavoA1604 merged 3 commits into
GustavoA1604:mainfrom
Zbig9000:chatterbox-QVAC-17872-TTS-GGML-Optimize-cpp-backend-multilingual-for-Vulkan

Conversation

@Zbig9000

@Zbig9000 Zbig9000 commented May 6, 2026

Copy link
Copy Markdown

multilingual on Vulkan, RTX 5090, warm-state seg 2..6

Metric upstream/main baseline this PR (3 commits) Δ
S3GEN_INFER 169.9 ms 140.8 ms −29.1 ms (−17.1 %)
cfm_total 132.5 ms 118.7 ms −13.8 ms (−10.4 %)
cfm_step0 24.1 ms 13.2 ms −10.9 ms (−45.2 %)
Cold-start ~2.7 s ~250 ms −2.4 s (−91 %)

Bit-exact preserving on multilingual: locked MD5 invariants (single-shot c65d98f1…, 6-segment multi-synth 0b374c74…, Turbo single-shot 6219f433…) match byte-for-byte across 4 successive iterations of the test-first regression harness.

The biggest remaining single piece of S3GEN_INFER (~120 ms cfm_total) is the actual GPU CFM compute — not host-cacheable. Closing that requires shader-side work (e.g. tensor-core engagement via cooperative_matrix2); listed as a deferred follow-up below.


What this PR does

Round 1 — f6893b2 (squashed source port from the closed PR #1)

Component What it is Multilingual benefit
patches/ggml-vulkan-pipeline-cache.patch (199 lines, NEW) Persistent VkPipelineCache keyed by <vendorID>-<deviceID>-<driverVersion>, opt-in via GGML_VK_PIPELINE_CACHE_DIR. Cold-start ~2.7 s → ~250 ms. Mesa / Adreno / Mali biggest target (no driver-side binary cache).
patches/ggml-vulkan-eager-cache-save.patch (104 lines, NEW) Crash-safe pipeline-cache flush. Skips disk write when cache size unchanged. Stacks on the first patch; perf-neutral on warm runs.
g_cfm_estimator_cache (process-wide global) cfm_estimator_cache was the last graph-builder still local-scope in s3gen_synthesize_to_wav — every synth call paid the full ~50 ms graph rebuild cost. cache.b2 flag handles Turbo (batch=1) ↔ multilingual (batch=2 CFG) mode flip transparently. cfm_step0 24 → 13 ms.
g_time_mlp_results + g_time_emb_results Two-layer cache by t-value (Turbo + multilingual) and (t, r) pair (Turbo only). Multilingual fires more: 10 distinct t-values per inference (cosine schedule) vs Turbo's 2-3. 9–19 graph submissions / inf → 0.
g_weight_cpu_mirror (cached_cpu_weights_f32) CPU mirror of flow/input_embedding + flow/spk_embed_affine/{w,b}. Multilingual is bigger: ~28 MB embedding vs Turbo's ~13 MB. Saves the device→host transfer on every synth.
3 HiFT ggml_cont removals conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor xp permute. Code-quality + future-proofing; CONT total in HiFT is only 0.13 % of HiFT runtime per the perf logger.
G2 dump-script gap closure (scripts/dump-s3gen-reference.py + 1-line test_s3gen.cpp fix) regress-tensor-compare.sh was aborting at stage G2 with cannot open cfm_concat.npy. Full Python ↔ C++ stage-compare pipeline now runs end-to-end through G2 / G3 / G4 / H1 / H3 / H4 / H5; max rel err 7.92e-3 on STFT (PyTorch FFT vs hand-built DFT, expected), max ≤ 4.7e-5 elsewhere; final waveform max_abs = 8.20e-08.

Verification — 5084ee4 (PROGRESS.md only)

After QVAC-18422 §3.34's converter shipped chatterbox-s3gen-mtl-q4_0.gguf (788 MB) + chatterbox-t3-mtl-q4_0.gguf (345 MB), I built a fresh upstream/multilingual_merged HEAD baseline (no Vulkan patches) and ran identical multilingual synthesis on both:

  • Bit-exact — single-shot MD5 c65d98f15a59b8fe9cad98e46eb3fb30 and 6-segment multi-synth MD5 0b374c7474895a3387b9f1df10b3c1b8 match between this PR and the upstream baseline byte-for-byte. These are the first locked multilingual F32 invariants on the multilingual_merged/main Vulkan base.
  • Multilingual perf (n=15 warm-state samples per build): −9.5 % S3GEN_INFER, −13.4 % cfm_total, −47.7 % cfm_step0 vs upstream baseline (round-1-only measurement).

Round 2 — d5c261c (multilingual-targeted host-side caches)

Targets the per-synth host-CPU overhead that round 1 didn't address. All seven caches sit alongside the round-1 caches; same destroy()-before-ggml_backend_free discipline.

Cache Keyed on Purpose / multilingual benefit
g_encoder_graph_cache T (encoder input length) Full run_encoder graph + gallocator. Multilingual T~350+ → bigger encoder graph rebuild was being repeated every synth.
g_hift_graph_cache (+ g_hift_inv_alpha_entries) pack(T_mel, T_stft) Full run_hift_decode graph + gallocator. Parallel (graph-input-name, source-tensor-ptr) metadata lets cache hits re-feed every alpha-input slot from g_inv_alpha_results without rebuilding. HiFT audio length scales with prompt length → multilingual is the biggest beneficiary.
g_f0_graph_cache T_mel Full run_f0_predictor graph + gallocator.
g_pos_emb_results (cached_pos_emb) pack(T, D) compute_pos_emb output. Pure CPU compute (~T × D × 5 trig ops); fired twice per encoder run (T and 2T). Multilingual T=350+ at D=512 was the dominant scaffolding cost.
g_inv_alpha_results (cached_inv_alpha) ggml_tensor * HiFT calls invert_alpha_cpu ~72× per synth (12 ResBlocks × 6 alpha tensors); each is a tensor_get + per-element reciprocal. Alpha tensors are constant for the model lifetime.
g_hann_window_cache / g_istft_kernel_cache n_fft Pure functions of n_fft (constant 16).
g_window_sum_cache (cached_window_sum) pack(n_fft, hop, T_stft) Stable across same-shape synth calls.

A new graph_cache struct (used by encoder / HiFT / F0) and a pack_hift_key helper centralise the explicit destroy()-on-teardown pattern so future per-stage caches plug in with one struct + one mutex acquisition. Destroy path is unified into s3gen_release_synth_caches() (renames the old g_cfm_estimator_cache_destroy()), called from s3gen_model_cache_release / cache-miss / s3gen_unload.

Within-process win on top of round 1 (warm-state seg 2..6, n=5 per build):

Metric round 1 alone + round 2 Δ
S3GEN_INFER 159.8 ms 140.8 ms −19.0 ms (−11.9 %)
hift_total 17.96 ms 16.30 ms −1.7 ms (−9.4 %)

Negative result documented (bug caught and fixed during dev)

First implementation of the HiFT cache hung indefinitely on the very first synth call. Root cause: the alpha-input refresh loop held g_synth_caches_mu while calling cached_inv_alpha, which itself takes the same mutex internally — classic re-entrant deadlock. Fix: snapshot g_hift_inv_alpha_entries under the mutex into a local vector, then iterate without the lock. Inline comment in run_hift_decode documents the rule for future investigators: never hold a cache-state mutex while calling any other cached_* helper.


Bit-exactness — locked invariants

Locked F32 MD5s on RTX 5090 + NVIDIA 590.48 + Vulkan 1.3.275:

Test Locked MD5 Verification
Multilingual single-shot (seed 42) c65d98f15a59b8fe9cad98e46eb3fb30 ✓ across 4 iters of regress-mtl-vk.sh verify
Multilingual 6-segment multi-synth 0b374c7474895a3387b9f1df10b3c1b8 ✓ across 4 iters
Turbo single-shot (seed 42) 6219f4338b1b4fb9dc60481216153b49 ✓ across 4 iters

The test-first regression harness bench-logs-vk-mtl/regress-mtl-vk.sh (in the qvac monorepo, out-of-tree) locks a snapshot before any change then verifys after every cache addition. Every round-2 cache passed verify immediately after addition.

Tensor-level Python ↔ C++ stage compare (added via the round-1 G2 fix) runs end-to-end through G2 / G3 / G4 / H1 / H3 / H4 / H5 with max rel err 7.92e-3 on STFT (expected — PyTorch torch.stft vs hand-built DFT-via-conv1d), max ≤ 4.7e-5 everywhere else, final waveform max_abs = 8.20e-08.


How to validate

cd <chatterbox.cpp>

# 1. Apply the Metal + OpenCL + 2 new Vulkan patches
bash scripts/setup-ggml.sh

# 2. Build (Vulkan)
cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build-vk -j --target tts-cli test-s3gen

# 3. Bit-exactness (locked MD5 invariants — multilingual + Turbo)
#    Snapshot lives in inputFilesForAI/qvac-17872-findings/bench-logs-vk-mtl/
#    in the qvac monorepo; the harness verifies all 3 invariants in one shot.
bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-mtl/regress-mtl-vk.sh \
     build-vk verify-pr8 verify
# Expected:
#   PASS  multilingual single-shot  md5=c65d98f15a59b8fe9cad98e46eb3fb30
#   PASS  multilingual auto-split    md5=0b374c7474895a3387b9f1df10b3c1b8
#   PASS  turbo single-shot          md5=6219f4338b1b4fb9dc60481216153b49

# 4. Cold-start (round-1 VkPipelineCache patch)
rm -rf ~/.cache/ggml/vulkan
./build-vk/tts-cli ...   # first run: ~2.7 s cold
./build-vk/tts-cli ...   # second run: ~250 ms (ggml cache warm)

# 5. Multilingual perf — 6-segment auto-split (within-process warm caches)
./build-vk/tts-cli \
    --model models/chatterbox-t3-mtl-q4_0.gguf \
    --s3gen-gguf models/chatterbox-s3gen-mtl-q4_0.gguf \
    --language en \
    --text "Hello from ggml first synthesis. Second synthesis run here now. \
            Third sentence here. Fourth sentence runs too. Fifth sentence wraps." \
    --max-sentence-chars 32 --out /tmp/mtl.wav \
    --n-gpu-layers 99 --threads 4 --seed 42 --temp 0 --top-k 1 --verbose
# Expected segments 2..6 average:
#   S3GEN_INFER ~141 ms, cfm_total ~119 ms, cfm_step0 ~13 ms, hift_total ~16 ms
# Compare a fresh upstream/main baseline build via the same recipe; multilingual
# WAV must produce IDENTICAL md5sum (PR makes synthesis faster, not different).

The full reproduction recipe (including how to build the upstream baseline in a separate git worktree for comparison) is in PROGRESS.md §3.32 ► "Round 2 ► Reproduction".


Risk assessment

  • Bit-exact preserving — verified across multilingual + Turbo invariants; verified across 4 successive iterations on the test-first harness.
  • Default behaviour unchanged unless explicit env vars are set:
    • Round-1 cache reads/writes from $XDG_CACHE_HOME/ggml/vulkan / $HOME/.cache/ggml/vulkan — opt-out via empty GGML_VK_PIPELINE_CACHE_DIR="".
    • All seven round-2 caches are always-on (no env var gate). They never change output, only avoid recomputing it.
  • No GGUF format change — existing chatterbox-s3gen-{turbo,mtl-q4_0}.gguf work as-is.
  • No public-API changeinclude/tts-cpp/chatterbox/*.h untouched.
  • Two ggml-vulkan patches (round-1 + round-2) shipped under patches/, applied in scripts/setup-ggml.sh (same vendoring model as the existing Metal + OpenCL patches).
  • Same cmake -DGGML_VULKAN=ON invocation as before — no new dependencies.
  • Memory cap: every cache is bounded by the number of distinct shape keys it sees across the process lifetime (typically 1-2 entries each). Steady-state per-process overhead: ~280 MB total (the bulk is the three 64 MB graph arenas). Streaming sessions with many distinct chunk sizes can grow these caches; a future LRU bound is documented in PROGRESS.md §3.32 as a deferred follow-up.
  • Teardown ordering: s3gen_release_synth_caches() runs before ggml_backend_free (same constraint as the pre-existing thread_local time_mlp_cache); registered via atexit() on first cache insertion + called explicitly from the cache-miss / s3gen_unload paths.

Files

PROGRESS.md                                ~+540 / -10  (§3.32 entry: round-1 + verification + round-2)
src/chatterbox_tts.cpp                     ~+625 / -98  (round-1 + round-HIFT + round-2 graph + scaffolding caches)
patches/ggml-vulkan-pipeline-cache.patch    +199         (NEW)
patches/ggml-vulkan-eager-cache-save.patch  +104         (NEW)
scripts/dump-s3gen-reference.py             +65
scripts/setup-ggml.sh                       +20 / -8     (applies the two new Vulkan patches)
patches/README.md                           +13 / -8     (documents the new patches)
src/test_s3gen.cpp                          +6           (G2 set_output(xc) fix)
                                            -----------
Total: 8 files, ~+1500 / −115, 3 commits on upstream/main.

CHANGELOG.md deliberately not added (per PR #1 review feedback) — the investigation entry lives in PROGRESS.md §3.32 instead.

The inputFilesForAI/qvac-17872-findings/{FINDINGS,PR_DESCRIPTION,bench-logs-vk-mtl/}* companion docs stay in the qvac monorepo (out-of-tree) — same arrangement as the QVAC-18422 sister PR.


Deferred follow-ups (separate PRs)

Candidate Estimated multilingual win Why deferred
C1 — F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM) ~125 MB device-memory + bandwidth-bound mobile win Multilingual CFM is ~6× larger than Turbo, so this is the bandwidth-bound mobile lever. multilingual_merged's load_s3gen_gguf uses ggml_dup_tensor + ggml_backend_alloc_ctx_tensors — different from the main-base path our F16 conversion was written against. ~100 lines of adaptation + new locked MD5 baselines (NVIDIA + AMD, F32 + F16).
cooperative_matrix2 (CM2) tensor-core engagement for narrow CFM matmuls −8.6 % cfm_total measured in the prior main-base CM2 Tier-3 close-out Requires LunarG SDK 1.4.341.1 + glslc 2026.1; politically blocked behind a cmake flag pending project-wide baseline-set sign-off. See inputFilesForAI/qvac-17872-findings/FINDINGS_ROUND_CM2.md.
Round-4 / 6 Q/K/V batched matmul fusion composition with multilingual_merged's zero-cont strided 3D Q/K/V views (849507a) ~1.3 ms RTX 5090 + larger on bandwidth-starved targets Pick-one-approach decision deferred; needs Vulkan flash_attn_ext stride-tolerance verification.
T3 step-graph cache (multilingual fires build_step_graph_mtl 2× per token via CFG) ~12 % T3 wall reduction on synth #2+ in long-running processes Covered by QVAC-18422 §3.35 on the CPU branch; same pattern would port.
Mobile validation (Adreno / Mali / Apple) unknown — biggest remaining evidence gap Hardware-bound. AMD/RADV proxy refuted the original main-base mobile-bandwidth projections on rounds 2 / 3 / 5 / 6 / C1, so real mobile runs would either confirm or force revision.
CI integration of regress-mtl-vk.sh + the existing regress-c1.sh / regress-amd.sh / regress-tensor-compare.sh n/a — test-infra Now unblocked since round-1 closed the G2 gap. Catches future regressions like the deadlock that surfaced (and was fixed) during round-2 dev.

Zbig9000 and others added 2 commits May 6, 2026 14:55
Re-bases the closed PR GustavoA1604#1 work onto upstream/multilingual_merged
(was previously on upstream/main).  Addresses the PR GustavoA1604#1 review:

  1. Base is now multilingual_merged (was main).
  2. CHANGELOG.md dropped — investigation entry lives in
     PROGRESS.md §3.32 instead.
  3. Optimisations are model-agnostic by construction, so they
     benefit BOTH the Turbo (meanflow) and the multilingual
     (standard CFM with CFG) variants — see PROGRESS.md §3.32
     "Why this is model-agnostic by construction".

Two ggml-vulkan patches + four host-side optimisations in
src/chatterbox_tts.cpp.  All bit-exact on F32 across NVIDIA + AMD/
RADV.  No public-API change, no GGUF format change, no new
build-system requirement.

Round-coverage on top of multilingual_merged
---------------------------------------------

This squashed port carries only the optimisations that remain
measurable on the multilingual_merged base.  The full per-round
investigation (8 rounds + AMD validation + LunarG SDK / coopmat2
Tier-3 close-out) is preserved in the qvac monorepo at
inputFilesForAI/qvac-17872-findings/FINDINGS_ROUND*.md and
PR_DESCRIPTION_FULL.md.

Carried forward (in this commit):

  * patches/ggml-vulkan-pipeline-cache.patch     (199 lines NEW)
    Persistent VkPipelineCache, opt-in via
    GGML_VK_PIPELINE_CACHE_DIR.  Recovers ~91 % of the cold→warm
    gap on the first warm run.

  * patches/ggml-vulkan-eager-cache-save.patch   (104 lines NEW)
    Crash-safe pipeline-cache flush, stacks on the first patch.

  * Persistent CFM estimator graph cache (g_cfm_estimator_cache)
    Was the last graph-builder still local-scope in
    s3gen_synthesize_to_wav.  cache.b2 flag handles the
    Turbo (batch=1) ↔ multilingual (batch=2 CFG) mode switch.
    Per-step verbose: chunk 1 cfm_total=80 ms; chunks 2..16
    cfm_total=30 ms.  Also eliminates a latent process-exit crash
    risk (Vulkan dylib static-destructor ordering).

  * Time-embedding result memoisation (g_time_mlp_results,
    g_time_emb_results)
    Two-layer cache by t-value (Turbo + multilingual) and (t, r)
    pair (Turbo only).  6 graph submissions/inf → 0 for Turbo;
    9–19 → 0 for the multilingual 10-step cosine schedule.

  * CPU mirror cache for large per-synth weight downloads
    (g_weight_cpu_mirror)
    flow/input_embedding (~13.4 MB Turbo / ~28 MB multilingual)
    + spk_embed_affine/{w,b} were re-downloaded GPU→CPU on every
    synth.  Cleared on backend-swap and model-cache release.

  * 3 HiFT cont sites removed (perf-neutral, code quality)
    conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor
    xp permute.  All consumers tolerate strided sources.

  * G2 dump-script gap closure (regress-tensor-compare.sh now
    runs end-to-end through G2/G3/G4/H1/H3/H4/H5)
    cfm_concat / cfm_h_conv / cfm_h_ln / hift_s_stft .npy files
    now produced; ggml_set_output(xc) added to stage_G2 so the
    gallocator preserves the diagnostic intermediate.

Deferred (separate follow-ups):

  * C1 — F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM).
    multilingual_merged's load_s3gen_gguf uses
    ggml_dup_tensor + ggml_backend_alloc_ctx_tensors; needs
    ~100 lines adapting our F32→F16 conversion path + new MD5
    baselines (NVIDIA + AMD, F32 + F16).
  * Round-4 / 6 Q/K/V batched matmul fusion.
    multilingual_merged uses zero-cont strided 3D Q/K/V views
    (their 849507a) — alternative optimisation for the same code;
    composing them is non-trivial and needs Vulkan
    flash_attn_ext stride-tolerance verification.
  * HiFT decoder graph caching.
    multilingual_merged's run_hift_decode rebuilds gallocr_t +
    ctx fresh on every call (no g_hift_cache equivalent); same
    persistent-cache pattern would save another ~5–10 ms / chunk
    on the multilingual variant.
  * Multilingual GGUF cross-validation.
    May 4 measurement was on Turbo because the multilingual GGUF
    was not available locally then.  After QVAC-18422 §3.34's
    converter shipped chatterbox-s3gen-mtl-q4_0.gguf, this is a
    follow-up cross-check; by construction every cache should hit
    ≥ as often as on Turbo (multilingual has more distinct
    t-values per inference and a larger input_embedding).

Performance — RTX 5090, regress-tight aggregate, n=75 chunks, Turbo
-------------------------------------------------------------------

  metric        | upstream/multilingual_merged |  + this PR  |          Δ
  S3GEN_INFER   |                      76.6 ms |   65.4 ms   |  -11.2 ms (-14.6 %)
  cfm_total     |                      40.3 ms |   28.7 ms   |  -11.6 ms (-28.8 %)
  encoder       |                      19.9 ms |   20.7 ms   |  noise
  hift_decode   |                      10.9 ms |   11.6 ms   |  noise

cfm_total ranges fully separated on n=120 samples (base [38.3, 42.8]
vs final [27.1, 30.1]).  Smaller absolute saving than the original
upstream/main base measurement (~-45 ms / -41 % S3GEN_INFER) because
multilingual_merged already contains the zero-cont strided Q/K/V
views, the reduced 256 MB → 64 MB CFM buf, the thread_local
time_mlp_cache, and the dropped redundant gallocr_reserve in
HiFT/time_mlp — all of which originally contributed to the larger
headline number on the main base.

Bit-exactness
-------------

  * RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325: 3/3 F32 invariants
    PASS (round-1 single-shot WAV; round-2 multi-synth identical;
    round-3 multi-synth varied).
  * AMD iGPU (RADV RAPHAEL_MENDOCINO, Mesa 25.2.8): 3/3 F32
    invariants PASS.
  * F16 invariants are not in this commit (C1 deferred).
  * Tensor-level Python ↔ C++ stage compare runs end-to-end
    through G2/G3/G4/H1/H3/H4/H5; max relative error 7.92e-3 on
    STFT (PyTorch FFT vs hand-built DFT, expected; ISTFT
    roundtrip recovers to bit-exact); max ≤ 4.7e-5 elsewhere;
    final waveform max_abs = 8.20e-08.

Files
-----

  PROGRESS.md                                +297     (§3.32 entry)
  src/chatterbox_tts.cpp                     +212 / -19
  patches/ggml-vulkan-pipeline-cache.patch   +199     (NEW)
  patches/ggml-vulkan-eager-cache-save.patch +104     (NEW)
  scripts/dump-s3gen-reference.py            +65
  scripts/setup-ggml.sh                      +20 / -8
  patches/README.md                          +13 / -8
  src/test_s3gen.cpp                         +6
  Total                                      +890 / -22, 8 files

How to validate
---------------

  cd <chatterbox.cpp>
  bash scripts/setup-ggml.sh   # applies Metal + OpenCL + 2 Vulkan patches
  cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
  cmake --build build-vk -j --target tts-cli test-s3gen

  # Cold start (ggml-vulkan-pipeline-cache.patch)
  rm -rf ~/.cache/ggml/vulkan
  ./build-vk/tts-cli ...   # first run: ~2.7 s cold
  ./build-vk/tts-cli ...   # second run: ~250 ms (ggml cache warm)

  # Bit-exactness (3 F32 invariants from the qvac monorepo harness)
  bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-c1/regress-c1.sh build-vk 1
  VK_LOADER_DRIVERS_SELECT='radeon_icd*' \
      bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-amd/regress-amd.sh build-vk 1

  # Aggregate perf
  bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-round3/regress-tight.sh build-vk mtl-final 5
  # Expected: S3GEN_INFER ~65 ms, cfm_total ~29 ms, n=75
  # vs upstream/multilingual_merged baseline: S3GEN_INFER ~77 ms, cfm_total ~40 ms

Co-authored-by: Cursor <cursoragent@cursor.com>
… Vulkan

Closes the multilingual-applicability gap that the May 4 squashed
port (commit ac4748a) left open.  The May 4 measurement was on
Turbo only because the multilingual GGUF was not available
locally then; after QVAC-18422 §3.34's converter shipped
chatterbox-s3gen-mtl-q4_0.gguf (788 MB) and
chatterbox-t3-mtl-q4_0.gguf (345 MB), the actual multilingual
verification is now feasible.

Test methodology
----------------

Six-segment auto-split via --max-sentence-chars 32 (the
multilingual T3 GGUF doesn't embed the tokenizer needed for the
--input-file streaming pattern; --max-sentence-chars triggers
multiple within-process synth calls which is what the persistent
host caches actually need to fire).  Three iterations × five
warm-state segments = n=15 samples per build.

Comparison build: a fresh upstream/multilingual_merged HEAD
(b074399) worktree at /tmp/cb-base-mtl-merged with only the
Metal + OpenCL patches applied (NOT the two new Vulkan patches
in this PR).  Both builds use the same vendored ggml commit
58c38058 and the same Vulkan 1.3.275 / RTX 5090 + NVIDIA 590.48
host.

Bit-exactness — first locked multilingual F32 invariants
--------------------------------------------------------

Both single-shot and 6-segment multi-synth produce byte-identical
multilingual WAV vs the upstream/multilingual_merged baseline:

  Single-shot (seed 42, --temp 0):      c65d98f15a59b8fe9cad98e46eb3fb30
  Multi-synth 6 segments (seed 42):     0b374c7474895a3387b9f1df10b3c1b8

These are the FIRST locked multilingual F32 invariants for the
Vulkan path on the multilingual_merged base (the previously
locked RTX 5090 invariants in regress-c1.sh were captured against
the older main-base branch and don't apply to this base).

Performance — RTX 5090, n=15 warm-state samples per build
---------------------------------------------------------

  metric        | upstream/mtl_merged | this PR  |          Δ
  S3GEN_INFER   |           169.9 ms  | 153.7 ms |  -16.2 ms (-9.5 %)
  cfm_total     |           132.5 ms  | 114.7 ms |  -17.8 ms (-13.4 %)
  cfm_step0     |            24.1 ms  |  12.6 ms |  -11.5 ms (-47.7 %)

cfm_step0 is the strongest multilingual signal: the persistent
CFM estimator graph cache eliminates ~half of the per-segment
graph-rebuild cost on warm-state synth.  The -9.5 % S3GEN_INFER
win is below the Turbo wins because:

  1. Multilingual CFM is ~6× larger in absolute terms (more
     layers, larger hidden dims, default 10-step cosine schedule
     vs Turbo's 2-step meanflow), so the cached host overhead
     is a smaller fraction of the wall.
  2. The multilingual baseline absorbs more per-synth fixed cost
     than Turbo does — multilingual hits compute_time_mlp 10
     times per inference but each time only touches a tiny
     graph; the cached CFM estimator graph matters more.

First-segment cold cost
-----------------------

Within a single process, the first segment pays a one-time
cache-warm-up overhead: PR 210-236 ms vs baseline 195-241 ms (no
statistically significant first-segment penalty given run-to-run
variance).  Subsequent segments are where the caches actually
pay off and the win is consistently visible.

Across processes, the persistent VkPipelineCache patch (round-1)
collapses the cold-process startup: cfm_step0 on a fresh process
drops from ~133 ms (no cache, full shader compile) to ~30 ms
(cache hit) — the headline mobile / Mesa win.

Files: PROGRESS.md +125 / -6 lines.

No source-code changes — this commit is purely the verification
write-up that confirms the May 4 port's optimisations work
correctly and meaningfully on the multilingual model on Vulkan,
exactly as predicted by the "model-agnostic by construction"
analysis in PROGRESS.md §3.32.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000 Zbig9000 force-pushed the chatterbox-QVAC-17872-TTS-GGML-Optimize-cpp-backend-multilingual-for-Vulkan branch from ac4748a to 5084ee4 Compare May 6, 2026 12:55
… + scaffolding caches (multilingual Vulkan)

Targets the per-synth host-CPU overhead that round 1 / round-HIFT
didn't address, on top of upstream/multilingual_merged (now in main
via PR GustavoA1604#7).  Test-first: bench-logs-vk-mtl/regress-mtl-vk.sh in the
qvac monorepo locks the pre-change MD5 baseline, then re-verifies
after every cache.  All 3 invariants (multilingual single-shot,
multilingual 6-segment multi-synth, Turbo single-shot) PASS bit-exact.

Seven new caches
----------------

All host-side, model-agnostic, no GGUF-format change, no public-API
change.  Same teardown discipline as the existing g_cfm_estimator_cache
(destroy() before ggml_backend_free).  Sit alongside the existing
round-1 caches.

  - g_encoder_graph_cache (keyed on T): full run_encoder graph +
    gallocator.  Streaming chunks of varying length still produce
    correct output (rebuilds on key change).

  - g_hift_graph_cache (keyed on pack(T_mel, T_stft)) +
    g_hift_inv_alpha_entries: full run_hift_decode graph + gallocator.
    Parallel (graph-input-name, source-tensor-ptr) metadata lets
    cache hits re-feed each alpha-input slot from g_inv_alpha_results
    without rebuilding the graph.

  - g_f0_graph_cache (keyed on T_mel): full run_f0_predictor graph +
    gallocator.

  - cached_pos_emb (g_pos_emb_results, keyed on pack(T, D)):
    compute_pos_emb is pure CPU compute (~T * D * 5 trig ops); fired
    twice per encoder run (T and 2T).  Multilingual T~350+ at D=512
    is a real wedge of per-synth host time.

  - cached_inv_alpha (g_inv_alpha_results, keyed on ggml_tensor*):
    HiFT calls invert_alpha_cpu ~72x per synth (12 ResBlocks × 6
    alpha tensors); each is a tensor_get + per-element reciprocal.
    Alpha tensors are constant for the model lifetime.

  - cached_hann_window / cached_istft_kernel (g_hann_window_cache /
    g_istft_kernel_cache, keyed on n_fft): pure functions of n_fft
    (constant 16 in the chatterbox HiFT path).

  - cached_window_sum (g_window_sum_cache, keyed on
    pack(n_fft, hop, T_stft)): T_stft × n_fft adds; stable across
    same-shape synth calls.

A new graph_cache struct (used by encoder / HiFT / F0) and a
pack_hift_key helper centralise the explicit destroy()-on-teardown
pattern so future per-stage caches can plug in with one struct + one
mutex acquisition.  The destroy path is unified into a renamed
s3gen_release_synth_caches() (replaces the old
g_cfm_estimator_cache_destroy()), called from
s3gen_model_cache_release, the cache-miss backend-swap path, and the
explicit s3gen_unload().

Negative result documented (bug caught and fixed during dev)
------------------------------------------------------------

First implementation of the HiFT cache hung indefinitely on the very
first synth call.  Root cause: the alpha-input refresh loop held
g_synth_caches_mu while calling cached_inv_alpha, which itself takes
the same mutex internally — classic re-entrant deadlock.  Fix:
snapshot g_hift_inv_alpha_entries under the mutex into a local vector,
then iterate without the lock (cached_inv_alpha re-acquires the mutex
per call but with no nesting).  General rule kept as an inline comment:
never hold a cache-state mutex while calling any other cached_* helper.

Performance — RTX 5090, multilingual auto-split, warm-state seg 2..6
-------------------------------------------------------------------

Within-process win on top of round 1 + round-HIFT:

  metric        | pre-round-2 |  post-round-2  |          Δ
  S3GEN_INFER   |    159.8 ms |    140.8 ms    |  -19.0 ms (-11.9 %)
  cfm_total     |    122.2 ms |    118.7 ms    |   -3.5 ms (-2.9 %)
  cfm_step0     |     13.24 ms|     13.18 ms   |   noise (already cached round 1)
  hift_total    |     17.96 ms|     16.30 ms   |   -1.7 ms (-9.4 %)

Combined cumulative win vs upstream/multilingual_merged baseline
(round 1 + round-HIFT + round 2):

  metric        | upstream/mtl_merged |  this PR (full) |          Δ
  S3GEN_INFER   |          169.9 ms   |     140.8 ms    |  -29.1 ms (-17.1 %)
  cfm_total     |          132.5 ms   |     118.7 ms    |  -13.8 ms (-10.4 %)
  cfm_step0     |           24.1 ms   |      13.2 ms    |  -10.9 ms (-45.2 %)

The biggest remaining single piece of S3GEN_INFER (~120 ms cfm) is
the actual GPU CFM compute — not host-cacheable; would need
shader-side optimisation (e.g. tensor-core engagement via
cooperative_matrix2; deferred — see "Next" in PROGRESS.md §3.32).

Bit-exactness
-------------

Locked invariants pass byte-for-byte vs the pre-change baseline:

  Multilingual single-shot      c65d98f15a59b8fe9cad98e46eb3fb30  ✓
  Multilingual 6-segment multi  0b374c7474895a3387b9f1df10b3c1b8  ✓
  Turbo single-shot             6219f4338b1b4fb9dc60481216153b49  ✓

Verified across 4 successive iterations on RTX 5090 + NVIDIA 590.48
+ Vulkan 1.3.275; bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac
monorepo is the test-first harness.

Files
-----

  src/chatterbox_tts.cpp         +373 / -79 (net diff vs round-1 head)
  PROGRESS.md                    §3.32 round-2 subsection (~+200 lines)

The +373 lines in chatterbox_tts.cpp are entirely the new cache
infrastructure: graph_cache struct, seven new globals, the
s3gen_release_synth_caches lifecycle hook, the five cached_*
scaffolding helpers, and the build_graph / cache-hit branches in
run_encoder / run_hift_decode / run_f0_predictor.

Co-authored-by: Cursor <cursoragent@cursor.com>
@GustavoA1604 GustavoA1604 merged commit 1cc7dae into GustavoA1604:main May 6, 2026
GustavoA1604 added a commit that referenced this pull request May 6, 2026
PR #8 (QVAC-17872 round-HIFT) dropped the trailing ggml_cont on `xp`
before ggml_mul_mat in run_f0_predictor on the rationale that
"Vulkan / Metal / CUDA mul_mat shaders all iterate by stride and
accept strided src1 for f32 matmul".  That holds for those GPU
backends but ggml-cpu's mul_mat enforces
nb10 == ggml_type_size(src1->type), so the bare permute aborts the
process on CPU during HiFT decode (visible as 4x repeated
GGML_ASSERT(nb10 == ggml_type_size(src1->type)) failed when running
tts-cli on CPU with --threads 4 against any Chatterbox GGUF; the
parity test test-cpu-caches reproduces the same crash on its
warm-cache lifecycle pass).

The other two cont removals from PR #8 (conv_transpose_1d_f32 exit
into ggml_add, ISTFT y_trim into ggml_clamp) consume into element-
wise ops that DO accept strided sources on every backend and stay
removed.  Only the f0_predictor site ever feeds a permuted tensor
into mul_mat src1, so the unconditional cont is the minimal fix.

Validated locally on Windows / MSVC / qvac-ext-ggml/speech:
chatterbox-t3-turbo + chatterbox-s3gen run end-to-end on CPU,
T3 1817 ms, S3Gen 2061 ms, no asserts.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 pushed a commit to GustavoA1604/qvac-ext-ggml that referenced this pull request May 6, 2026
…INE_CACHE_DIR

Adds an opt-in persistent shader cache to ggml-vulkan.  Enabled only
when the caller sets GGML_VK_PIPELINE_CACHE_DIR to a non-empty path;
when unset or empty behaviour is byte-identical to upstream ggml-vulkan.

No auto-discovery of $XDG_CACHE_HOME or $HOME.  ggml is a library
distributed through package managers (vcpkg) and consumed by
applications that should decide whether and where to persist Vulkan
artefacts.  Writing to the user's home directory without being asked
is a side effect library consumers cannot see from the API surface.

When enabled, createPipelineCache is seeded from the path at init and
getPipelineCacheData is written back from ggml_vk_cleanup() (not
~vk_device_struct which is unreliable at process exit due to
shared_ptr ref cycles).  File keyed on vendorID/deviceID/driverVersion;
Vulkan validates the blob header and silently ignores stale data if the
shader bundle or driver changed.  Atomic save via tmp+rename.

Recovers ~91% of the cold->warm shader-compile gap on the first warm
run on drivers without an aggressive per-app system cache (Mesa/RADV,
Android Adreno/Mali, fresh NVIDIA installs, containers).

Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8
(QVAC-17872, round-1).

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 pushed a commit to GustavoA1604/qvac-ext-ggml that referenced this pull request May 6, 2026
Stacks on the previous patch.  Writes back the on-disk pipeline-cache
blob after every ggml_vk_load_shaders compile batch instead of only at
ggml_vk_cleanup() time, so a process killed mid-graph (SIGKILL,
abort, OS shutdown) doesn't lose the freshly compiled pipelines.

Adds pipeline_cache_last_size book-keeping so warm runs short-circuit
the disk write: the eager path only flushes when the cache actually
grew (blob.size() > last_size), and the cleanup path skips when size
matches last_size.  This avoided a +90 ms WALL regression measured
during dev when the flush was unconditional.

Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8
(QVAC-17872, round-2).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000 Zbig9000 deleted the chatterbox-QVAC-17872-TTS-GGML-Optimize-cpp-backend-multilingual-for-Vulkan branch May 7, 2026 07:57
GustavoA1604 added a commit that referenced this pull request May 7, 2026
Removes references to internal QVAC ticket numbers, fork-PR numbers,
and 'round-HIFT / round 2 / round 4 / round 5 / (this PR)' phase
markers that document the development history rather than the code's
behaviour.  Tightens the surrounding prose so each comment reads as
'what this code does and why' instead of 'how we got here'.

Specific edits:

- src/chatterbox_tts.cpp: 18 comment blocks rewritten.  The big
  CPU-side persistent-cache header at line 446 now describes the
  caches (a..j) as one homogeneous set instead of 'Round 1' +
  'Round 2'.  The PR #8 vs CPU-correctness explanation around the
  f0_predictor ggml_cont keeps the technical rationale (CPU mul_mat
  asserts on strided src1, GPU shaders accept it) but drops the
  'PR #8 / QVAC-17872 round-HIFT optimised for' prefix.
- src/t3_mtl.cpp: 5 comments around the T3 step-graph cache.
- src/chatterbox_engine.cpp + src/chatterbox_cli.cpp: 3 'drop the
  T3 step-graph cache before backend free' comments.
- src/chatterbox_tts_test_hooks.h: 2 references rewritten as a more
  generic 'persistent-cache work for the CPU-side multilingual TTS
  path; see PROGRESS.md for design notes' framing.
- CMakeLists.txt: 2 test-registration comment annotations.

Vendored upstream content (src/dr_wav.h's own changelog dates) is
untouched.

Pure comment-only change; rebuilt tts-cpp under MSVC Release with no
new warnings or errors.  No source code, public API, or build
behaviour changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants