QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan#8
Merged
Conversation
Re-bases the closed PR GustavoA1604#1 work onto upstream/multilingual_merged (was previously on upstream/main). Addresses the PR GustavoA1604#1 review: 1. Base is now multilingual_merged (was main). 2. CHANGELOG.md dropped — investigation entry lives in PROGRESS.md §3.32 instead. 3. Optimisations are model-agnostic by construction, so they benefit BOTH the Turbo (meanflow) and the multilingual (standard CFM with CFG) variants — see PROGRESS.md §3.32 "Why this is model-agnostic by construction". Two ggml-vulkan patches + four host-side optimisations in src/chatterbox_tts.cpp. All bit-exact on F32 across NVIDIA + AMD/ RADV. No public-API change, no GGUF format change, no new build-system requirement. Round-coverage on top of multilingual_merged --------------------------------------------- This squashed port carries only the optimisations that remain measurable on the multilingual_merged base. The full per-round investigation (8 rounds + AMD validation + LunarG SDK / coopmat2 Tier-3 close-out) is preserved in the qvac monorepo at inputFilesForAI/qvac-17872-findings/FINDINGS_ROUND*.md and PR_DESCRIPTION_FULL.md. Carried forward (in this commit): * patches/ggml-vulkan-pipeline-cache.patch (199 lines NEW) Persistent VkPipelineCache, opt-in via GGML_VK_PIPELINE_CACHE_DIR. Recovers ~91 % of the cold→warm gap on the first warm run. * patches/ggml-vulkan-eager-cache-save.patch (104 lines NEW) Crash-safe pipeline-cache flush, stacks on the first patch. * Persistent CFM estimator graph cache (g_cfm_estimator_cache) Was the last graph-builder still local-scope in s3gen_synthesize_to_wav. cache.b2 flag handles the Turbo (batch=1) ↔ multilingual (batch=2 CFG) mode switch. Per-step verbose: chunk 1 cfm_total=80 ms; chunks 2..16 cfm_total=30 ms. Also eliminates a latent process-exit crash risk (Vulkan dylib static-destructor ordering). * Time-embedding result memoisation (g_time_mlp_results, g_time_emb_results) Two-layer cache by t-value (Turbo + multilingual) and (t, r) pair (Turbo only). 6 graph submissions/inf → 0 for Turbo; 9–19 → 0 for the multilingual 10-step cosine schedule. * CPU mirror cache for large per-synth weight downloads (g_weight_cpu_mirror) flow/input_embedding (~13.4 MB Turbo / ~28 MB multilingual) + spk_embed_affine/{w,b} were re-downloaded GPU→CPU on every synth. Cleared on backend-swap and model-cache release. * 3 HiFT cont sites removed (perf-neutral, code quality) conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor xp permute. All consumers tolerate strided sources. * G2 dump-script gap closure (regress-tensor-compare.sh now runs end-to-end through G2/G3/G4/H1/H3/H4/H5) cfm_concat / cfm_h_conv / cfm_h_ln / hift_s_stft .npy files now produced; ggml_set_output(xc) added to stage_G2 so the gallocator preserves the diagnostic intermediate. Deferred (separate follow-ups): * C1 — F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM). multilingual_merged's load_s3gen_gguf uses ggml_dup_tensor + ggml_backend_alloc_ctx_tensors; needs ~100 lines adapting our F32→F16 conversion path + new MD5 baselines (NVIDIA + AMD, F32 + F16). * Round-4 / 6 Q/K/V batched matmul fusion. multilingual_merged uses zero-cont strided 3D Q/K/V views (their 849507a) — alternative optimisation for the same code; composing them is non-trivial and needs Vulkan flash_attn_ext stride-tolerance verification. * HiFT decoder graph caching. multilingual_merged's run_hift_decode rebuilds gallocr_t + ctx fresh on every call (no g_hift_cache equivalent); same persistent-cache pattern would save another ~5–10 ms / chunk on the multilingual variant. * Multilingual GGUF cross-validation. May 4 measurement was on Turbo because the multilingual GGUF was not available locally then. After QVAC-18422 §3.34's converter shipped chatterbox-s3gen-mtl-q4_0.gguf, this is a follow-up cross-check; by construction every cache should hit ≥ as often as on Turbo (multilingual has more distinct t-values per inference and a larger input_embedding). Performance — RTX 5090, regress-tight aggregate, n=75 chunks, Turbo ------------------------------------------------------------------- metric | upstream/multilingual_merged | + this PR | Δ S3GEN_INFER | 76.6 ms | 65.4 ms | -11.2 ms (-14.6 %) cfm_total | 40.3 ms | 28.7 ms | -11.6 ms (-28.8 %) encoder | 19.9 ms | 20.7 ms | noise hift_decode | 10.9 ms | 11.6 ms | noise cfm_total ranges fully separated on n=120 samples (base [38.3, 42.8] vs final [27.1, 30.1]). Smaller absolute saving than the original upstream/main base measurement (~-45 ms / -41 % S3GEN_INFER) because multilingual_merged already contains the zero-cont strided Q/K/V views, the reduced 256 MB → 64 MB CFM buf, the thread_local time_mlp_cache, and the dropped redundant gallocr_reserve in HiFT/time_mlp — all of which originally contributed to the larger headline number on the main base. Bit-exactness ------------- * RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325: 3/3 F32 invariants PASS (round-1 single-shot WAV; round-2 multi-synth identical; round-3 multi-synth varied). * AMD iGPU (RADV RAPHAEL_MENDOCINO, Mesa 25.2.8): 3/3 F32 invariants PASS. * F16 invariants are not in this commit (C1 deferred). * Tensor-level Python ↔ C++ stage compare runs end-to-end through G2/G3/G4/H1/H3/H4/H5; max relative error 7.92e-3 on STFT (PyTorch FFT vs hand-built DFT, expected; ISTFT roundtrip recovers to bit-exact); max ≤ 4.7e-5 elsewhere; final waveform max_abs = 8.20e-08. Files ----- PROGRESS.md +297 (§3.32 entry) src/chatterbox_tts.cpp +212 / -19 patches/ggml-vulkan-pipeline-cache.patch +199 (NEW) patches/ggml-vulkan-eager-cache-save.patch +104 (NEW) scripts/dump-s3gen-reference.py +65 scripts/setup-ggml.sh +20 / -8 patches/README.md +13 / -8 src/test_s3gen.cpp +6 Total +890 / -22, 8 files How to validate --------------- cd <chatterbox.cpp> bash scripts/setup-ggml.sh # applies Metal + OpenCL + 2 Vulkan patches cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON cmake --build build-vk -j --target tts-cli test-s3gen # Cold start (ggml-vulkan-pipeline-cache.patch) rm -rf ~/.cache/ggml/vulkan ./build-vk/tts-cli ... # first run: ~2.7 s cold ./build-vk/tts-cli ... # second run: ~250 ms (ggml cache warm) # Bit-exactness (3 F32 invariants from the qvac monorepo harness) bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-c1/regress-c1.sh build-vk 1 VK_LOADER_DRIVERS_SELECT='radeon_icd*' \ bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-amd/regress-amd.sh build-vk 1 # Aggregate perf bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-round3/regress-tight.sh build-vk mtl-final 5 # Expected: S3GEN_INFER ~65 ms, cfm_total ~29 ms, n=75 # vs upstream/multilingual_merged baseline: S3GEN_INFER ~77 ms, cfm_total ~40 ms Co-authored-by: Cursor <cursoragent@cursor.com>
… Vulkan Closes the multilingual-applicability gap that the May 4 squashed port (commit ac4748a) left open. The May 4 measurement was on Turbo only because the multilingual GGUF was not available locally then; after QVAC-18422 §3.34's converter shipped chatterbox-s3gen-mtl-q4_0.gguf (788 MB) and chatterbox-t3-mtl-q4_0.gguf (345 MB), the actual multilingual verification is now feasible. Test methodology ---------------- Six-segment auto-split via --max-sentence-chars 32 (the multilingual T3 GGUF doesn't embed the tokenizer needed for the --input-file streaming pattern; --max-sentence-chars triggers multiple within-process synth calls which is what the persistent host caches actually need to fire). Three iterations × five warm-state segments = n=15 samples per build. Comparison build: a fresh upstream/multilingual_merged HEAD (b074399) worktree at /tmp/cb-base-mtl-merged with only the Metal + OpenCL patches applied (NOT the two new Vulkan patches in this PR). Both builds use the same vendored ggml commit 58c38058 and the same Vulkan 1.3.275 / RTX 5090 + NVIDIA 590.48 host. Bit-exactness — first locked multilingual F32 invariants -------------------------------------------------------- Both single-shot and 6-segment multi-synth produce byte-identical multilingual WAV vs the upstream/multilingual_merged baseline: Single-shot (seed 42, --temp 0): c65d98f15a59b8fe9cad98e46eb3fb30 Multi-synth 6 segments (seed 42): 0b374c7474895a3387b9f1df10b3c1b8 These are the FIRST locked multilingual F32 invariants for the Vulkan path on the multilingual_merged base (the previously locked RTX 5090 invariants in regress-c1.sh were captured against the older main-base branch and don't apply to this base). Performance — RTX 5090, n=15 warm-state samples per build --------------------------------------------------------- metric | upstream/mtl_merged | this PR | Δ S3GEN_INFER | 169.9 ms | 153.7 ms | -16.2 ms (-9.5 %) cfm_total | 132.5 ms | 114.7 ms | -17.8 ms (-13.4 %) cfm_step0 | 24.1 ms | 12.6 ms | -11.5 ms (-47.7 %) cfm_step0 is the strongest multilingual signal: the persistent CFM estimator graph cache eliminates ~half of the per-segment graph-rebuild cost on warm-state synth. The -9.5 % S3GEN_INFER win is below the Turbo wins because: 1. Multilingual CFM is ~6× larger in absolute terms (more layers, larger hidden dims, default 10-step cosine schedule vs Turbo's 2-step meanflow), so the cached host overhead is a smaller fraction of the wall. 2. The multilingual baseline absorbs more per-synth fixed cost than Turbo does — multilingual hits compute_time_mlp 10 times per inference but each time only touches a tiny graph; the cached CFM estimator graph matters more. First-segment cold cost ----------------------- Within a single process, the first segment pays a one-time cache-warm-up overhead: PR 210-236 ms vs baseline 195-241 ms (no statistically significant first-segment penalty given run-to-run variance). Subsequent segments are where the caches actually pay off and the win is consistently visible. Across processes, the persistent VkPipelineCache patch (round-1) collapses the cold-process startup: cfm_step0 on a fresh process drops from ~133 ms (no cache, full shader compile) to ~30 ms (cache hit) — the headline mobile / Mesa win. Files: PROGRESS.md +125 / -6 lines. No source-code changes — this commit is purely the verification write-up that confirms the May 4 port's optimisations work correctly and meaningfully on the multilingual model on Vulkan, exactly as predicted by the "model-agnostic by construction" analysis in PROGRESS.md §3.32. Co-authored-by: Cursor <cursoragent@cursor.com>
ac4748a to
5084ee4
Compare
… + scaffolding caches (multilingual Vulkan) Targets the per-synth host-CPU overhead that round 1 / round-HIFT didn't address, on top of upstream/multilingual_merged (now in main via PR GustavoA1604#7). Test-first: bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac monorepo locks the pre-change MD5 baseline, then re-verifies after every cache. All 3 invariants (multilingual single-shot, multilingual 6-segment multi-synth, Turbo single-shot) PASS bit-exact. Seven new caches ---------------- All host-side, model-agnostic, no GGUF-format change, no public-API change. Same teardown discipline as the existing g_cfm_estimator_cache (destroy() before ggml_backend_free). Sit alongside the existing round-1 caches. - g_encoder_graph_cache (keyed on T): full run_encoder graph + gallocator. Streaming chunks of varying length still produce correct output (rebuilds on key change). - g_hift_graph_cache (keyed on pack(T_mel, T_stft)) + g_hift_inv_alpha_entries: full run_hift_decode graph + gallocator. Parallel (graph-input-name, source-tensor-ptr) metadata lets cache hits re-feed each alpha-input slot from g_inv_alpha_results without rebuilding the graph. - g_f0_graph_cache (keyed on T_mel): full run_f0_predictor graph + gallocator. - cached_pos_emb (g_pos_emb_results, keyed on pack(T, D)): compute_pos_emb is pure CPU compute (~T * D * 5 trig ops); fired twice per encoder run (T and 2T). Multilingual T~350+ at D=512 is a real wedge of per-synth host time. - cached_inv_alpha (g_inv_alpha_results, keyed on ggml_tensor*): HiFT calls invert_alpha_cpu ~72x per synth (12 ResBlocks × 6 alpha tensors); each is a tensor_get + per-element reciprocal. Alpha tensors are constant for the model lifetime. - cached_hann_window / cached_istft_kernel (g_hann_window_cache / g_istft_kernel_cache, keyed on n_fft): pure functions of n_fft (constant 16 in the chatterbox HiFT path). - cached_window_sum (g_window_sum_cache, keyed on pack(n_fft, hop, T_stft)): T_stft × n_fft adds; stable across same-shape synth calls. A new graph_cache struct (used by encoder / HiFT / F0) and a pack_hift_key helper centralise the explicit destroy()-on-teardown pattern so future per-stage caches can plug in with one struct + one mutex acquisition. The destroy path is unified into a renamed s3gen_release_synth_caches() (replaces the old g_cfm_estimator_cache_destroy()), called from s3gen_model_cache_release, the cache-miss backend-swap path, and the explicit s3gen_unload(). Negative result documented (bug caught and fixed during dev) ------------------------------------------------------------ First implementation of the HiFT cache hung indefinitely on the very first synth call. Root cause: the alpha-input refresh loop held g_synth_caches_mu while calling cached_inv_alpha, which itself takes the same mutex internally — classic re-entrant deadlock. Fix: snapshot g_hift_inv_alpha_entries under the mutex into a local vector, then iterate without the lock (cached_inv_alpha re-acquires the mutex per call but with no nesting). General rule kept as an inline comment: never hold a cache-state mutex while calling any other cached_* helper. Performance — RTX 5090, multilingual auto-split, warm-state seg 2..6 ------------------------------------------------------------------- Within-process win on top of round 1 + round-HIFT: metric | pre-round-2 | post-round-2 | Δ S3GEN_INFER | 159.8 ms | 140.8 ms | -19.0 ms (-11.9 %) cfm_total | 122.2 ms | 118.7 ms | -3.5 ms (-2.9 %) cfm_step0 | 13.24 ms| 13.18 ms | noise (already cached round 1) hift_total | 17.96 ms| 16.30 ms | -1.7 ms (-9.4 %) Combined cumulative win vs upstream/multilingual_merged baseline (round 1 + round-HIFT + round 2): metric | upstream/mtl_merged | this PR (full) | Δ S3GEN_INFER | 169.9 ms | 140.8 ms | -29.1 ms (-17.1 %) cfm_total | 132.5 ms | 118.7 ms | -13.8 ms (-10.4 %) cfm_step0 | 24.1 ms | 13.2 ms | -10.9 ms (-45.2 %) The biggest remaining single piece of S3GEN_INFER (~120 ms cfm) is the actual GPU CFM compute — not host-cacheable; would need shader-side optimisation (e.g. tensor-core engagement via cooperative_matrix2; deferred — see "Next" in PROGRESS.md §3.32). Bit-exactness ------------- Locked invariants pass byte-for-byte vs the pre-change baseline: Multilingual single-shot c65d98f15a59b8fe9cad98e46eb3fb30 ✓ Multilingual 6-segment multi 0b374c7474895a3387b9f1df10b3c1b8 ✓ Turbo single-shot 6219f4338b1b4fb9dc60481216153b49 ✓ Verified across 4 successive iterations on RTX 5090 + NVIDIA 590.48 + Vulkan 1.3.275; bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac monorepo is the test-first harness. Files ----- src/chatterbox_tts.cpp +373 / -79 (net diff vs round-1 head) PROGRESS.md §3.32 round-2 subsection (~+200 lines) The +373 lines in chatterbox_tts.cpp are entirely the new cache infrastructure: graph_cache struct, seven new globals, the s3gen_release_synth_caches lifecycle hook, the five cached_* scaffolding helpers, and the build_graph / cache-hit branches in run_encoder / run_hift_decode / run_f0_predictor. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
added a commit
that referenced
this pull request
May 6, 2026
PR #8 (QVAC-17872 round-HIFT) dropped the trailing ggml_cont on `xp` before ggml_mul_mat in run_f0_predictor on the rationale that "Vulkan / Metal / CUDA mul_mat shaders all iterate by stride and accept strided src1 for f32 matmul". That holds for those GPU backends but ggml-cpu's mul_mat enforces nb10 == ggml_type_size(src1->type), so the bare permute aborts the process on CPU during HiFT decode (visible as 4x repeated GGML_ASSERT(nb10 == ggml_type_size(src1->type)) failed when running tts-cli on CPU with --threads 4 against any Chatterbox GGUF; the parity test test-cpu-caches reproduces the same crash on its warm-cache lifecycle pass). The other two cont removals from PR #8 (conv_transpose_1d_f32 exit into ggml_add, ISTFT y_trim into ggml_clamp) consume into element- wise ops that DO accept strided sources on every backend and stay removed. Only the f0_predictor site ever feeds a permuted tensor into mul_mat src1, so the unconditional cont is the minimal fix. Validated locally on Windows / MSVC / qvac-ext-ggml/speech: chatterbox-t3-turbo + chatterbox-s3gen run end-to-end on CPU, T3 1817 ms, S3Gen 2061 ms, no asserts. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
pushed a commit
to GustavoA1604/qvac-ext-ggml
that referenced
this pull request
May 6, 2026
…INE_CACHE_DIR Adds an opt-in persistent shader cache to ggml-vulkan. Enabled only when the caller sets GGML_VK_PIPELINE_CACHE_DIR to a non-empty path; when unset or empty behaviour is byte-identical to upstream ggml-vulkan. No auto-discovery of $XDG_CACHE_HOME or $HOME. ggml is a library distributed through package managers (vcpkg) and consumed by applications that should decide whether and where to persist Vulkan artefacts. Writing to the user's home directory without being asked is a side effect library consumers cannot see from the API surface. When enabled, createPipelineCache is seeded from the path at init and getPipelineCacheData is written back from ggml_vk_cleanup() (not ~vk_device_struct which is unreliable at process exit due to shared_ptr ref cycles). File keyed on vendorID/deviceID/driverVersion; Vulkan validates the blob header and silently ignores stale data if the shader bundle or driver changed. Atomic save via tmp+rename. Recovers ~91% of the cold->warm shader-compile gap on the first warm run on drivers without an aggressive per-app system cache (Mesa/RADV, Android Adreno/Mali, fresh NVIDIA installs, containers). Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8 (QVAC-17872, round-1). Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
pushed a commit
to GustavoA1604/qvac-ext-ggml
that referenced
this pull request
May 6, 2026
Stacks on the previous patch. Writes back the on-disk pipeline-cache blob after every ggml_vk_load_shaders compile batch instead of only at ggml_vk_cleanup() time, so a process killed mid-graph (SIGKILL, abort, OS shutdown) doesn't lose the freshly compiled pipelines. Adds pipeline_cache_last_size book-keeping so warm runs short-circuit the disk write: the eager path only flushes when the cache actually grew (blob.size() > last_size), and the cleanup path skips when size matches last_size. This avoided a +90 ms WALL regression measured during dev when the flush was unconditional. Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8 (QVAC-17872, round-2). Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
added a commit
that referenced
this pull request
May 7, 2026
Removes references to internal QVAC ticket numbers, fork-PR numbers, and 'round-HIFT / round 2 / round 4 / round 5 / (this PR)' phase markers that document the development history rather than the code's behaviour. Tightens the surrounding prose so each comment reads as 'what this code does and why' instead of 'how we got here'. Specific edits: - src/chatterbox_tts.cpp: 18 comment blocks rewritten. The big CPU-side persistent-cache header at line 446 now describes the caches (a..j) as one homogeneous set instead of 'Round 1' + 'Round 2'. The PR #8 vs CPU-correctness explanation around the f0_predictor ggml_cont keeps the technical rationale (CPU mul_mat asserts on strided src1, GPU shaders accept it) but drops the 'PR #8 / QVAC-17872 round-HIFT optimised for' prefix. - src/t3_mtl.cpp: 5 comments around the T3 step-graph cache. - src/chatterbox_engine.cpp + src/chatterbox_cli.cpp: 3 'drop the T3 step-graph cache before backend free' comments. - src/chatterbox_tts_test_hooks.h: 2 references rewritten as a more generic 'persistent-cache work for the CPU-side multilingual TTS path; see PROGRESS.md for design notes' framing. - CMakeLists.txt: 2 test-registration comment annotations. Vendored upstream content (src/dr_wav.h's own changelog dates) is untouched. Pure comment-only change; rebuilt tts-cpp under MSVC Release with no new warnings or errors. No source code, public API, or build behaviour changes. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
multilingual on Vulkan, RTX 5090, warm-state seg 2..6
Bit-exact preserving on multilingual: locked MD5 invariants (single-shot
c65d98f1…, 6-segment multi-synth0b374c74…, Turbo single-shot6219f433…) match byte-for-byte across 4 successive iterations of the test-first regression harness.The biggest remaining single piece of
S3GEN_INFER(~120 mscfm_total) is the actual GPU CFM compute — not host-cacheable. Closing that requires shader-side work (e.g. tensor-core engagement viacooperative_matrix2); listed as a deferred follow-up below.What this PR does
Round 1 —
f6893b2(squashed source port from the closed PR #1)patches/ggml-vulkan-pipeline-cache.patch(199 lines, NEW)VkPipelineCachekeyed by<vendorID>-<deviceID>-<driverVersion>, opt-in viaGGML_VK_PIPELINE_CACHE_DIR.patches/ggml-vulkan-eager-cache-save.patch(104 lines, NEW)g_cfm_estimator_cache(process-wide global)cfm_estimator_cachewas the last graph-builder still local-scope ins3gen_synthesize_to_wav— every synth call paid the full ~50 ms graph rebuild cost.cache.b2flag handles Turbo (batch=1) ↔ multilingual (batch=2 CFG) mode flip transparently.cfm_step024 → 13 ms.g_time_mlp_results+g_time_emb_resultsg_weight_cpu_mirror(cached_cpu_weights_f32)flow/input_embedding+flow/spk_embed_affine/{w,b}.ggml_contremovalsconv_transpose_1d_f32exit, ISTFTy_trimexit,f0_predictorxppermute.scripts/dump-s3gen-reference.py+ 1-linetest_s3gen.cppfix)regress-tensor-compare.shwas aborting at stage G2 withcannot open cfm_concat.npy.max_abs = 8.20e-08.Verification —
5084ee4(PROGRESS.md only)After QVAC-18422 §3.34's converter shipped
chatterbox-s3gen-mtl-q4_0.gguf(788 MB) +chatterbox-t3-mtl-q4_0.gguf(345 MB), I built a freshupstream/multilingual_mergedHEAD baseline (no Vulkan patches) and ran identical multilingual synthesis on both:c65d98f15a59b8fe9cad98e46eb3fb30and 6-segment multi-synth MD50b374c7474895a3387b9f1df10b3c1b8match between this PR and the upstream baseline byte-for-byte. These are the first locked multilingual F32 invariants on themultilingual_merged/mainVulkan base.Round 2 —
d5c261c(multilingual-targeted host-side caches)Targets the per-synth host-CPU overhead that round 1 didn't address. All seven caches sit alongside the round-1 caches; same
destroy()-before-ggml_backend_freediscipline.g_encoder_graph_cacheT(encoder input length)run_encodergraph + gallocator. Multilingual T~350+ → bigger encoder graph rebuild was being repeated every synth.g_hift_graph_cache(+g_hift_inv_alpha_entries)pack(T_mel, T_stft)run_hift_decodegraph + gallocator. Parallel(graph-input-name, source-tensor-ptr)metadata lets cache hits re-feed every alpha-input slot fromg_inv_alpha_resultswithout rebuilding. HiFT audio length scales with prompt length → multilingual is the biggest beneficiary.g_f0_graph_cacheT_melrun_f0_predictorgraph + gallocator.g_pos_emb_results(cached_pos_emb)pack(T, D)compute_pos_emboutput. Pure CPU compute (~T × D × 5trig ops); fired twice per encoder run (Tand2T). Multilingual T=350+ at D=512 was the dominant scaffolding cost.g_inv_alpha_results(cached_inv_alpha)ggml_tensor *invert_alpha_cpu~72× per synth (12 ResBlocks × 6 alpha tensors); each is atensor_get+ per-element reciprocal. Alpha tensors are constant for the model lifetime.g_hann_window_cache/g_istft_kernel_cachen_fftn_fft(constant 16).g_window_sum_cache(cached_window_sum)pack(n_fft, hop, T_stft)A new
graph_cachestruct (used by encoder / HiFT / F0) and apack_hift_keyhelper centralise the explicitdestroy()-on-teardown pattern so future per-stage caches plug in with one struct + one mutex acquisition. Destroy path is unified intos3gen_release_synth_caches()(renames the oldg_cfm_estimator_cache_destroy()), called froms3gen_model_cache_release/ cache-miss /s3gen_unload.Within-process win on top of round 1 (warm-state seg 2..6, n=5 per build):
Negative result documented (bug caught and fixed during dev)
First implementation of the HiFT cache hung indefinitely on the very first synth call. Root cause: the alpha-input refresh loop held
g_synth_caches_muwhile callingcached_inv_alpha, which itself takes the same mutex internally — classic re-entrant deadlock. Fix: snapshotg_hift_inv_alpha_entriesunder the mutex into a local vector, then iterate without the lock. Inline comment inrun_hift_decodedocuments the rule for future investigators: never hold a cache-state mutex while calling any othercached_*helper.Bit-exactness — locked invariants
Locked F32 MD5s on RTX 5090 + NVIDIA 590.48 + Vulkan 1.3.275:
c65d98f15a59b8fe9cad98e46eb3fb30regress-mtl-vk.sh verify0b374c7474895a3387b9f1df10b3c1b86219f4338b1b4fb9dc60481216153b49The test-first regression harness
bench-logs-vk-mtl/regress-mtl-vk.sh(in the qvac monorepo, out-of-tree) locks a snapshot before any change thenverifys after every cache addition. Every round-2 cache passedverifyimmediately after addition.Tensor-level Python ↔ C++ stage compare (added via the round-1 G2 fix) runs end-to-end through G2 / G3 / G4 / H1 / H3 / H4 / H5 with max rel err 7.92e-3 on STFT (expected — PyTorch
torch.stftvs hand-built DFT-via-conv1d), max ≤ 4.7e-5 everywhere else, final waveformmax_abs = 8.20e-08.How to validate
The full reproduction recipe (including how to build the upstream baseline in a separate
git worktreefor comparison) is inPROGRESS.md§3.32 ► "Round 2 ► Reproduction".Risk assessment
$XDG_CACHE_HOME/ggml/vulkan/$HOME/.cache/ggml/vulkan— opt-out via emptyGGML_VK_PIPELINE_CACHE_DIR="".chatterbox-s3gen-{turbo,mtl-q4_0}.ggufwork as-is.include/tts-cpp/chatterbox/*.huntouched.patches/, applied inscripts/setup-ggml.sh(same vendoring model as the existing Metal + OpenCL patches).cmake -DGGML_VULKAN=ONinvocation as before — no new dependencies.PROGRESS.md§3.32 as a deferred follow-up.s3gen_release_synth_caches()runs beforeggml_backend_free(same constraint as the pre-existingthread_local time_mlp_cache); registered viaatexit()on first cache insertion + called explicitly from the cache-miss /s3gen_unloadpaths.Files
CHANGELOG.mddeliberately not added (per PR #1 review feedback) — the investigation entry lives inPROGRESS.md§3.32 instead.The
inputFilesForAI/qvac-17872-findings/{FINDINGS,PR_DESCRIPTION,bench-logs-vk-mtl/}*companion docs stay in the qvac monorepo (out-of-tree) — same arrangement as the QVAC-18422 sister PR.Deferred follow-ups (separate PRs)
CHATTERBOX_F16_CFM)multilingual_merged'sload_s3gen_ggufusesggml_dup_tensor + ggml_backend_alloc_ctx_tensors— different from themain-base path our F16 conversion was written against. ~100 lines of adaptation + new locked MD5 baselines (NVIDIA + AMD, F32 + F16).cooperative_matrix2(CM2) tensor-core engagement for narrow CFM matmulsmain-base CM2 Tier-3 close-outglslc2026.1; politically blocked behind a cmake flag pending project-wide baseline-set sign-off. SeeinputFilesForAI/qvac-17872-findings/FINDINGS_ROUND_CM2.md.multilingual_merged's zero-cont strided 3D Q/K/V views (849507a)flash_attn_extstride-tolerance verification.build_step_graph_mtl2× per token via CFG)main-base mobile-bandwidth projections on rounds 2 / 3 / 5 / 6 / C1, so real mobile runs would either confirm or force revision.regress-mtl-vk.sh+ the existingregress-c1.sh/regress-amd.sh/regress-tensor-compare.sh