From f6893b2877a7042ec941ca9fd14d1e4bb24ebb1a Mon Sep 17 00:00:00 2001
From: Zbigniew Herman <zbigniew.herman@tether.io>
Date: Wed, 6 May 2026 12:55:24 +0200
Subject: [PATCH 1/3] QVAC-17872 [TTS GGML] Optimize cpp backend multilingual
 model for Vulkan
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Re-bases the closed PR #1 work onto upstream/multilingual_merged
(was previously on upstream/main).  Addresses the PR #1 review:

  1. Base is now multilingual_merged (was main).
  2. CHANGELOG.md dropped — investigation entry lives in
     PROGRESS.md §3.32 instead.
  3. Optimisations are model-agnostic by construction, so they
     benefit BOTH the Turbo (meanflow) and the multilingual
     (standard CFM with CFG) variants — see PROGRESS.md §3.32
     "Why this is model-agnostic by construction".

Two ggml-vulkan patches + four host-side optimisations in
src/chatterbox_tts.cpp.  All bit-exact on F32 across NVIDIA + AMD/
RADV.  No public-API change, no GGUF format change, no new
build-system requirement.

Round-coverage on top of multilingual_merged
---------------------------------------------

This squashed port carries only the optimisations that remain
measurable on the multilingual_merged base.  The full per-round
investigation (8 rounds + AMD validation + LunarG SDK / coopmat2
Tier-3 close-out) is preserved in the qvac monorepo at
inputFilesForAI/qvac-17872-findings/FINDINGS_ROUND*.md and
PR_DESCRIPTION_FULL.md.

Carried forward (in this commit):

  * patches/ggml-vulkan-pipeline-cache.patch     (199 lines NEW)
    Persistent VkPipelineCache, opt-in via
    GGML_VK_PIPELINE_CACHE_DIR.  Recovers ~91 % of the cold→warm
    gap on the first warm run.

  * patches/ggml-vulkan-eager-cache-save.patch   (104 lines NEW)
    Crash-safe pipeline-cache flush, stacks on the first patch.

  * Persistent CFM estimator graph cache (g_cfm_estimator_cache)
    Was the last graph-builder still local-scope in
    s3gen_synthesize_to_wav.  cache.b2 flag handles the
    Turbo (batch=1) ↔ multilingual (batch=2 CFG) mode switch.
    Per-step verbose: chunk 1 cfm_total=80 ms; chunks 2..16
    cfm_total=30 ms.  Also eliminates a latent process-exit crash
    risk (Vulkan dylib static-destructor ordering).

  * Time-embedding result memoisation (g_time_mlp_results,
    g_time_emb_results)
    Two-layer cache by t-value (Turbo + multilingual) and (t, r)
    pair (Turbo only).  6 graph submissions/inf → 0 for Turbo;
    9–19 → 0 for the multilingual 10-step cosine schedule.

  * CPU mirror cache for large per-synth weight downloads
    (g_weight_cpu_mirror)
    flow/input_embedding (~13.4 MB Turbo / ~28 MB multilingual)
    + spk_embed_affine/{w,b} were re-downloaded GPU→CPU on every
    synth.  Cleared on backend-swap and model-cache release.

  * 3 HiFT cont sites removed (perf-neutral, code quality)
    conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor
    xp permute.  All consumers tolerate strided sources.

  * G2 dump-script gap closure (regress-tensor-compare.sh now
    runs end-to-end through G2/G3/G4/H1/H3/H4/H5)
    cfm_concat / cfm_h_conv / cfm_h_ln / hift_s_stft .npy files
    now produced; ggml_set_output(xc) added to stage_G2 so the
    gallocator preserves the diagnostic intermediate.

Deferred (separate follow-ups):

  * C1 — F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM).
    multilingual_merged's load_s3gen_gguf uses
    ggml_dup_tensor + ggml_backend_alloc_ctx_tensors; needs
    ~100 lines adapting our F32→F16 conversion path + new MD5
    baselines (NVIDIA + AMD, F32 + F16).
  * Round-4 / 6 Q/K/V batched matmul fusion.
    multilingual_merged uses zero-cont strided 3D Q/K/V views
    (their 849507a) — alternative optimisation for the same code;
    composing them is non-trivial and needs Vulkan
    flash_attn_ext stride-tolerance verification.
  * HiFT decoder graph caching.
    multilingual_merged's run_hift_decode rebuilds gallocr_t +
    ctx fresh on every call (no g_hift_cache equivalent); same
    persistent-cache pattern would save another ~5–10 ms / chunk
    on the multilingual variant.
  * Multilingual GGUF cross-validation.
    May 4 measurement was on Turbo because the multilingual GGUF
    was not available locally then.  After QVAC-18422 §3.34's
    converter shipped chatterbox-s3gen-mtl-q4_0.gguf, this is a
    follow-up cross-check; by construction every cache should hit
    ≥ as often as on Turbo (multilingual has more distinct
    t-values per inference and a larger input_embedding).

Performance — RTX 5090, regress-tight aggregate, n=75 chunks, Turbo
-------------------------------------------------------------------

  metric        | upstream/multilingual_merged |  + this PR  |          Δ
  S3GEN_INFER   |                      76.6 ms |   65.4 ms   |  -11.2 ms (-14.6 %)
  cfm_total     |                      40.3 ms |   28.7 ms   |  -11.6 ms (-28.8 %)
  encoder       |                      19.9 ms |   20.7 ms   |  noise
  hift_decode   |                      10.9 ms |   11.6 ms   |  noise

cfm_total ranges fully separated on n=120 samples (base [38.3, 42.8]
vs final [27.1, 30.1]).  Smaller absolute saving than the original
upstream/main base measurement (~-45 ms / -41 % S3GEN_INFER) because
multilingual_merged already contains the zero-cont strided Q/K/V
views, the reduced 256 MB → 64 MB CFM buf, the thread_local
time_mlp_cache, and the dropped redundant gallocr_reserve in
HiFT/time_mlp — all of which originally contributed to the larger
headline number on the main base.

Bit-exactness
-------------

  * RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325: 3/3 F32 invariants
    PASS (round-1 single-shot WAV; round-2 multi-synth identical;
    round-3 multi-synth varied).
  * AMD iGPU (RADV RAPHAEL_MENDOCINO, Mesa 25.2.8): 3/3 F32
    invariants PASS.
  * F16 invariants are not in this commit (C1 deferred).
  * Tensor-level Python ↔ C++ stage compare runs end-to-end
    through G2/G3/G4/H1/H3/H4/H5; max relative error 7.92e-3 on
    STFT (PyTorch FFT vs hand-built DFT, expected; ISTFT
    roundtrip recovers to bit-exact); max ≤ 4.7e-5 elsewhere;
    final waveform max_abs = 8.20e-08.

Files
-----

  PROGRESS.md                                +297     (§3.32 entry)
  src/chatterbox_tts.cpp                     +212 / -19
  patches/ggml-vulkan-pipeline-cache.patch   +199     (NEW)
  patches/ggml-vulkan-eager-cache-save.patch +104     (NEW)
  scripts/dump-s3gen-reference.py            +65
  scripts/setup-ggml.sh                      +20 / -8
  patches/README.md                          +13 / -8
  src/test_s3gen.cpp                         +6
  Total                                      +890 / -22, 8 files

How to validate
---------------

  cd <chatterbox.cpp>
  bash scripts/setup-ggml.sh   # applies Metal + OpenCL + 2 Vulkan patches
  cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
  cmake --build build-vk -j --target tts-cli test-s3gen

  # Cold start (ggml-vulkan-pipeline-cache.patch)
  rm -rf ~/.cache/ggml/vulkan
  ./build-vk/tts-cli ...   # first run: ~2.7 s cold
  ./build-vk/tts-cli ...   # second run: ~250 ms (ggml cache warm)

  # Bit-exactness (3 F32 invariants from the qvac monorepo harness)
  bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-c1/regress-c1.sh build-vk 1
  VK_LOADER_DRIVERS_SELECT='radeon_icd*' \
      bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-amd/regress-amd.sh build-vk 1

  # Aggregate perf
  bash inputFilesForAI/qvac-17872-findings/bench-logs-vk-round3/regress-tight.sh build-vk mtl-final 5
  # Expected: S3GEN_INFER ~65 ms, cfm_total ~29 ms, n=75
  # vs upstream/multilingual_merged baseline: S3GEN_INFER ~77 ms, cfm_total ~40 ms

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 PROGRESS.md                                | 297 +++++++++++++++++++++
 patches/README.md                          |  13 +-
 patches/ggml-vulkan-eager-cache-save.patch | 104 ++++++++
 patches/ggml-vulkan-pipeline-cache.patch   | 199 ++++++++++++++
 scripts/dump-s3gen-reference.py            |  65 +++++
 scripts/setup-ggml.sh                      |  16 +-
 src/chatterbox_tts.cpp                     | 212 +++++++++++++--
 src/test_s3gen.cpp                         |   6 +
 8 files changed, 890 insertions(+), 22 deletions(-)
 create mode 100644 patches/ggml-vulkan-eager-cache-save.patch
 create mode 100644 patches/ggml-vulkan-pipeline-cache.patch

diff --git a/PROGRESS.md b/PROGRESS.md
index 2046325..1c78abb 100644
--- a/PROGRESS.md
+++ b/PROGRESS.md
@@ -4054,6 +4054,303 @@ scp + run on any M4 / M3 / M2 box.
 - If M4 results confirm the prediction: update the §3.27 / §3.28 / §3.30 sections with the M4 numbers alongside M3U.
 - If M4 results contradict the prediction: file a follow-up to revisit the fusion costs on smaller Apple silicon.
 
+### 3.32  Vulkan multilingual port — `VkPipelineCache` + chatterbox-side persistent caches (QVAC-17872)
+
+Ports the Vulkan-side optimisation work originally landed on
+`upstream/main` (closed PR #1) onto the `multilingual_merged` base.
+Two `ggml-vulkan` patches + four host-side optimisations in
+`src/chatterbox_tts.cpp`.  All bit-exact-preserving (F32 invariants
+on both NVIDIA and AMD/RADV); model-agnostic by construction so they
+benefit **both** the Turbo (meanflow) and the multilingual (standard
+CFM with CFG) variants.  No public-API change, no GGUF format
+change, no new build-system requirement.
+
+The full per-round investigation (eight rounds + AMD validation +
+LunarG SDK / `cooperative_matrix2` Tier-3 close-out) lives in the
+qvac monorepo at
+`inputFilesForAI/qvac-17872-findings/FINDINGS_ROUND*.md` and
+`inputFilesForAI/qvac-17872-findings/PR_DESCRIPTION_FULL.md` for
+context.  This squashed port carries only the optimisations that
+remain measurable on the `multilingual_merged` base — many of the
+original rounds (notably the round-4 / round-6 Q/K/V batched matmul
+fusion) overlap with `multilingual_merged`'s own zero-cont strided
+Q/K/V views (commit `849507a`) and were deferred rather than
+double-applied.  C1 (F16 CFM matmul weights) was also deferred —
+`multilingual_merged`'s `load_s3gen_gguf` uses
+`ggml_dup_tensor + ggml_backend_alloc_ctx_tensors` and would need a
+separate adaption pass plus new locked MD5 baselines.
+
+#### 1. `patches/ggml-vulkan-pipeline-cache.patch` — persistent `VkPipelineCache` (199 lines)
+
+Adds an opt-in persistent shader cache to ggml-vulkan, keyed by
+`<vendorID>-<deviceID>-<driverVersion>` and rooted at
+`$GGML_VK_PIPELINE_CACHE_DIR` →
+`$XDG_CACHE_HOME/ggml/vulkan` → `$HOME/.cache/ggml/vulkan`.
+Disabled by setting the env var to the empty string (byte-identical
+to upstream).  Recovers ~91 % of the cold→warm gap on the first warm
+run.
+
+```text
+fresh-process wall, RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325:
+  both caches cold (fresh machine / Mesa)  : ~2 690 ms
+  ggml cache warm, NVIDIA cache cold       :  ~250 ms     ← round-1 alone
+  both caches warm (steady state)          :  ~225 ms
+```
+
+The headline mobile / Mesa win — there's no per-driver shader cache
+to fall back on outside of NVIDIA's binary-blob path.
+
+#### 2. `patches/ggml-vulkan-eager-cache-save.patch` — crash-safe pipeline-cache flush (104 lines)
+
+Stacks on the first patch.  Writes back the pipeline-cache blob
+after every `compiles.wait()` batch in `ggml_vk_load_shaders`, with
+a `pipeline_cache_last_size` guard so warm-cache hits skip the disk
+write (caught a +90 ms regression during dev).  Crash-safety only;
+perf-neutral on warm runs.
+
+#### 3. Persistent CFM estimator graph cache (`g_cfm_estimator_cache`)
+
+`cfm_estimator_cache` was the last graph-builder still local-scope
+in `s3gen_synthesize_to_wav` — every synth call paid the full
+~50 ms graph rebuild cost (256 MB buf alloc + ~5500-node CFM
+graph build + `ggml_gallocr_reserve`).  Refactored to follow the
+same explicit-`destroy()` global-lifetime pattern as the existing
+`thread_local time_mlp_cache` / `g_encoder_cache` / per-stage
+caches.
+
+Both batch=1 (Turbo / meanflow) and batch=2 (multilingual CFG)
+paths reuse the same cache; the `cache.b2` flag triggers a rebuild
+when the mode changes.  Cache cleared in `s3gen_model_cache_release`
+**before** the backend is freed (Vulkan / Metal device-teardown
+ordering matters), and in `s3gen_model_cache_get` cache-miss
+(backend swap).
+
+```text
+per-step verbose verification, 5 utterances × 16 chunks (Turbo, RTX 5090):
+  chunk 1 (cold): cfm_step0 = 64 ms, cfm_step1 = 15 ms,  cfm_total = 80 ms
+  chunks 2..16  : cfm_step0 = 15 ms, cfm_step1 = 15 ms,  cfm_total = 30 ms
+```
+
+Also eliminates a latent process-exit crash risk: the previous
+`~cfm_estimator_cache()` destructor fired *after* the Vulkan dylib's
+static destructor (residency-set non-empty assert pattern).  The
+new explicit `destroy()` runs *before* the backend is freed.
+
+#### 4. Time-embedding result memoisation (`g_time_mlp_results`, `g_time_emb_results`)
+
+Both Turbo (`t_span = [0, 0.5, 1]`) and multilingual (cosine-
+scheduled, default 10 steps) emit the same set of t-values across
+all subsequent synth calls.  Each tiny graph (3 dispatches,
+~18 µs GPU compute) pays ~700 µs of fixed cmd-buffer + submit +
+sync + `tensor_get` overhead — per-graph fixed cost is **30× actual
+compute**.
+
+Two-layer cache:
+- `g_time_mlp_results` — keyed by `uint32_t` bitcast of `t_val`
+- `g_time_emb_results` — keyed by `uint64_t = (kt << 32) | kr`
+  (Turbo only; multilingual skips the mixer)
+
+`compute_time_mlp_cached` + `compute_time_emb_cached` wrappers at
+the synthesize call site collapse the 3-line `t_mlp / r_mlp /
+t_mixed` sequence to one line.  6 graph submissions / inference →
+0 after first inference for Turbo; 9–19 → 0 for the multilingual
+10-step schedule.  Caches cleared in `s3gen_model_cache_release`
+alongside the graph caches.
+
+#### 5. CPU mirror cache for large per-synth weight downloads (`g_weight_cpu_mirror`)
+
+`s3gen_synthesize_to_wav` reads three large model tensors via
+`ggml_backend_tensor_get` on every call:
+
+| Tensor                          | Turbo size | Multilingual size |
+|---------------------------------|-----------:|------------------:|
+| `flow/input_embedding`          | 13.4 MB    | ~28 MB            |
+| `flow/spk_embed_affine/w`       | 60 KB      | 60 KB             |
+| `flow/spk_embed_affine/b`       | 320 B      | 320 B             |
+
+On a GPU backend each is a real device→host transfer plus sync.
+~600–1000 µs per call for `input_embedding` alone on RTX 5090.
+These weights are **constant for the model lifetime** — cache them.
+
+New `cached_cpu_weights_f32(t)` helper + `g_weight_cpu_mirror` map
+(keyed by `ggml_tensor *`).  Cleared in `s3gen_model_cache_release`
+and on `s3gen_model_cache_get` cache-miss because the tensor
+pointers belong to the soon-to-be-freed model context.
+
+The multilingual variant benefits *more* than Turbo here because
+the larger `input_embedding` (~28 MB vs 13.4 MB) doubles the
+per-call download cost saved.
+
+#### 6. Three HiFT `ggml_cont` sites removed (perf-neutral, code quality)
+
+Round-AUDIT (in the qvac monorepo's `FINDINGS_ROUND_AUDIT.md`)
+listed these as deferred; same methodology applied here:
+
+| Site                                | Calls / inf | Direct consumer                              |
+|-------------------------------------|------------:|----------------------------------------------|
+| `conv_transpose_1d_f32` exit cont   | 3           | `ggml_add(x, reshape_2d(bias))` strided OK   |
+| ISTFT `y_trim` exit cont            | 1           | `ggml_clamp` element-wise → fresh contig     |
+| `f0_predictor` `xp` permute cont    | 1           | `ggml_mul_mat` `src1` (Vulkan f32 strided OK)|
+
+At ~3 µs per cont dispatch this is ~15 µs / inference theoretical;
+below the noise floor by design.  Same code-quality + future-
+proofing rationale as upstream §3.14 / §3.15.  CONT total in HiFT
+is only ~0.13 % of HiFT runtime per the perf logger, so further
+chatterbox-side cont reduction is perf-irrelevant.
+
+Three additional cont sites investigated but **kept** with inline
+comments explaining the failure mode for future investigators:
+`layer_norm_on_channel` exit (downstream `im2col`/`concat` needs
+contig src), and STFT `mag_log` / `ph_in` exits (single-shot
+bit-exact passes but multi-synth identical-chunks PCM diverges from
+locked baseline — gallocator non-zero-offset view sensitivity).
+
+#### 7. G2 dump-script gap closure — `regress-tensor-compare.sh` end-to-end
+
+`regress-tensor-compare.sh` (in the qvac monorepo's
+`inputFilesForAI/qvac-17872-findings/bench-logs-vk-c1/`) was
+previously aborting at stage G2 with `cannot open cfm_concat.npy`.
+Four files added to `scripts/dump-s3gen-reference.py`:
+
+- `cfm_concat.npy` (stage G2): replicates the
+  `pack([x, mu, spks_bc, cond])` logic from
+  `ConditionalDecoder.forward` directly in
+  `estimator_forward_capture` (first-call only).
+- `cfm_h_conv.npy` (stage G2): output of `block1.block[0]`
+  (`CausalConv1d`).  New `make_first_call_hook` helper.
+- `cfm_h_ln.npy` (stage G2): output of `block1.block[3]`
+  (Transpose back to `(B, C, T)` after LayerNorm).
+- `hift_s_stft.npy` (stages H3 + H4): output of `hift._stft`
+  followed by `cat([real, imag], dim=1)`.  Monkeypatched
+  `hift._stft`, restored in `finally`.
+
+Plus a one-line C++ fix in `src/test_s3gen.cpp`'s `stage_G2`: add
+`ggml_set_output(xc)` so the gallocator preserves the diagnostic
+intermediate (was returning garbage because `xc`'s slot was reused
+by downstream intermediates after the conv1d consumer completed).
+
+Full pipeline now runs end-to-end through G2 / G3 / G4 / H1 / H3 /
+H4 / H5; max relative error 7.92e-3 on STFT (PyTorch FFT vs
+hand-built DFT, expected, not a regression), max ≤ 4.7e-5
+everywhere else; final waveform `max_abs = 8.20e-08`.
+
+#### Negative result documented (inline comment in `synthesize`)
+
+Tried adding pointer-equality skip-upload of `mu` / `spks` / `cond`
+across `cfm_steps` within one `synthesize` call.  F32 single-shot
+WAV diverged immediately (got `c63c19...`, expected `454b4cc1...`).
+Root cause: ggml's gallocator **reuses** input-tensor buffer slots
+once their consumers complete.  In CFM:
+
+```cpp
+xc = ggml_concat(x_in, mu_in, spks_bc, cond_in);
+// ^ last use of mu / spks / cond — their slots are now free for
+//   the gallocator to reuse for downstream intermediates.
+```
+
+Skip-upload only works for inputs referenced **throughout** the
+graph (encoder `pos_emb` works, CFM `mu / spks / cond` doesn't).
+General rule for ggml's gallocator, kept as a comment in
+`synthesize()` and documented in
+`inputFilesForAI/qvac-17872-findings/FINDINGS_ROUND_HIFT.md` §2-bis.4.
+
+#### Performance — RTX 5090, regress-tight aggregate, n=75 chunks, Turbo
+
+The May 4 port was measured on Turbo because the multilingual GGUF
+was not available locally at the time.  After §3.34 (the QVAC-18422
+companion PR) ships the converted-from-source
+`chatterbox-s3gen-mtl-q4_0.gguf`, multilingual measurement is a
+follow-up.
+
+```text
+metric        | upstream/multilingual_merged |  + this §3.32  |          Δ
+S3GEN_INFER   |                      76.6 ms |       65.4 ms  | -11.2 ms (-14.6 %)
+cfm_total     |                      40.3 ms |       28.7 ms  | -11.6 ms (-28.8 %)
+encoder       |                      19.9 ms |       20.7 ms  | noise
+hift_decode   |                      10.9 ms |       11.6 ms  | noise
+```
+
+`cfm_total` ranges fully separated on n=120 samples
+(base `[38.3, 42.8]` vs final `[27.1, 30.1]`).  Smaller absolute
+saving than on the original `upstream/main` base (where the same
+work measured −45 ms / −41 % S3GEN_INFER) because
+`multilingual_merged` already contains the
+zero-cont strided Q/K/V views, the reduced 256 MB → 64 MB CFM buf,
+the `thread_local time_mlp_cache`, and the dropped redundant
+`gallocr_reserve` in HiFT/`time_mlp` — all of which originally
+contributed to the larger headline number on the main base.
+
+#### Bit-exactness
+
+| Backend                | F32 single-shot | F32 multi-synth identical | F32 multi-synth varied |
+|------------------------|:---------------:|:-------------------------:|:----------------------:|
+| RTX 5090 + 590.48      |       ✓         |             ✓             |           ✓            |
+| AMD iGPU (RADV, Mesa)  |       ✓         |             ✓             |           ✓            |
+
+F16 invariants are not in this commit (C1 deferred).
+
+#### Why this is model-agnostic by construction
+
+All four host-side optimisations target generic per-synth
+infrastructure that is shared between Turbo and multilingual:
+
+1. **CFM estimator cache** — the `cache.b2` flag handles the
+   Turbo (batch=1, meanflow) ↔ multilingual (batch=2, CFG) mode
+   switch transparently.  Same struct, same teardown.
+2. **t-emb caching** — multilingual's default `n_timesteps = 10`
+   means **more** distinct t-values per inference (10 vs Turbo's
+   2–3), so the cache hit-count ratio improves linearly with steps.
+3. **CPU weight mirror** — `flow/input_embedding` is **larger**
+   on multilingual (vocab=13632 vs Turbo's 6561), so the saved
+   per-call download is roughly twice as large.
+4. **HiFT cont removals** — HiFT decoder code path is identical
+   for both variants.
+
+#### Files touched
+
+| File                                       |          Change |
+|--------------------------------------------|----------------:|
+| `patches/ggml-vulkan-pipeline-cache.patch` |       new (199) |
+| `patches/ggml-vulkan-eager-cache-save.patch` |     new (104) |
+| `patches/README.md`                        |       +13 / -8  |
+| `scripts/setup-ggml.sh`                    |       +20 / -8  |
+| `scripts/dump-s3gen-reference.py`          |             +65 |
+| `src/chatterbox_tts.cpp`                   |     +252 / -19  |
+| `src/test_s3gen.cpp`                       |              +6 |
+| **Total**                                  | **+593 / -22**  |
+
+All `inputFilesForAI/qvac-17872-findings/FINDINGS_*.md` and
+`PR_DESCRIPTION_*.md` companion docs stay in the qvac monorepo
+(out-of-tree) — same arrangement as the QVAC-18422 work.
+
+#### Next
+
+- **Multilingual GGUF cross-validation**: re-run the regress harness
+  against `chatterbox-s3gen-mtl-q4_0.gguf` (converted from the
+  HuggingFace public `ResembleAI/chatterbox` repo per the §3.34
+  converter) once that GGUF is available on the Vulkan host.  By
+  construction every cache should hit ≥ as often as on Turbo;
+  measurable wins should be ≥ those reported here.
+- **C1 port to `multilingual_merged`** (F16 CFM matmul weights,
+  opt-in `CHATTERBOX_F16_CFM`): needs ~100 lines adapting our F32→F16
+  conversion path to `multilingual_merged`'s
+  `ggml_dup_tensor + ggml_backend_alloc_ctx_tensors` `load_s3gen_gguf`
+  layout, plus new locked MD5 baselines (NVIDIA + AMD, F32 + F16).
+- **HiFT graph caching on `multilingual_merged`**: that branch's
+  `run_hift_decode` allocates `ggml_gallocr_t + ggml_context *` fresh
+  on every call (no `g_hift_cache` equivalent) — same persistent-
+  cache pattern would save another ~5–10 ms / chunk on multilingual.
+- **Round-4 / 6 QKV fusion composition with multilingual_merged's
+  strided 3D views** — our batched `mul_mat` (originally landed on
+  `main`) and their zero-cont strided views (`849507a`) are
+  alternative optimisations targeting the same code; pick one
+  approach and bench Vulkan `flash_attn_ext` stride tolerance.
+- **Mobile validation** (Adreno / Mali / Apple):
+  hardware-bound; biggest remaining evidence gap.  AMD/RADV proxy
+  refuted the original mobile-bandwidth projection on the
+  per-round work; real mobile runs would either confirm the
+  ship-on-merit framing or force its revision.
+
 ---
 
 ## OpenCL / Adreno bring-up (April 2026)
diff --git a/patches/README.md b/patches/README.md
index edf4d25..1a83cbd 100644
--- a/patches/README.md
+++ b/patches/README.md
@@ -8,11 +8,14 @@ standalone patches and are applied after the clone.
 |--------|------------------|
 | `ggml-metal-chatterbox-ops.patch` | Building with **Metal** (Apple Silicon T3 + full pipeline). |
 | `ggml-opencl-chatterbox-ops.patch` | Building with **OpenCL** (e.g. Android / Termux + Adreno: `CONV_TRANSPOSE_1D` for HiFT, `SIN`, backend notes). |
-| (none) | **CPU** / **CUDA** / **Vulkan** only — stock upstream `ggml` is enough. |
+| `ggml-vulkan-pipeline-cache.patch` | Building with **Vulkan** — opt-in persistent `VkPipelineCache` keyed by `<vendorID>-<deviceID>-<driverVersion>`.  Recovers ~91 % of the cold→warm gap on the first warm run.  Disabled by `GGML_VK_PIPELINE_CACHE_DIR=""`. |
+| `ggml-vulkan-eager-cache-save.patch` | Building with **Vulkan** — write back the pipeline cache after every `ggml_vk_load_shaders` compile batch (crash-safety against SIGKILL/abort losing freshly compiled pipelines).  Stacks on the previous patch. |
+| (none) | **CPU** / **CUDA** only — stock upstream `ggml` is enough. |
 
-`setup-ggml.sh` always applies **both** patches in order (Metal, then
-OpenCL).  Extra OpenCL code is inert when you configure without
-`GGML_OPENCL=ON`.
+`setup-ggml.sh` always applies **all four** patches in order (Metal,
+OpenCL, Vulkan-pipeline-cache, Vulkan-eager-cache-save).  Each is
+inert when you configure without the corresponding backend
+(`GGML_METAL=ON` / `GGML_OPENCL=ON` / `GGML_VULKAN=ON`).
 
 ## Apply
 
@@ -46,6 +49,8 @@ git clone https://github.com/ggml-org/ggml.git ggml
 cd ggml && git reset --hard $GGML_COMMIT && git clean -fdq
 git apply ../patches/ggml-metal-chatterbox-ops.patch
 git apply ../patches/ggml-opencl-chatterbox-ops.patch
+git apply ../patches/ggml-vulkan-pipeline-cache.patch
+git apply ../patches/ggml-vulkan-eager-cache-save.patch
 ```
 
 `GGML_COMMIT` lives at the top of `scripts/setup-ggml.sh` as the
diff --git a/patches/ggml-vulkan-eager-cache-save.patch b/patches/ggml-vulkan-eager-cache-save.patch
new file mode 100644
index 0000000..37bdd36
--- /dev/null
+++ b/patches/ggml-vulkan-eager-cache-save.patch
@@ -0,0 +1,104 @@
+diff --git a/src/ggml-vulkan/ggml-vulkan.cpp b/src/ggml-vulkan/ggml-vulkan.cpp
+--- a/src/ggml-vulkan/ggml-vulkan.cpp
++++ b/src/ggml-vulkan/ggml-vulkan.cpp
+@@ -881,6 +881,12 @@
+     // VK_NULL_HANDLE, which is legal).
+     vk::PipelineCache pipeline_cache = VK_NULL_HANDLE;
+     std::string       pipeline_cache_path;
++    // QVAC-17872 round-2: bytes already on disk for this cache.  Used by
++    // the eager flush in ggml_vk_load_shaders to skip the disk write on
++    // pure cache-hit paths (warm runs where every pipeline came from the
++    // seed blob): if getPipelineCacheData().size() == this value, the
++    // cache content is unchanged and there is nothing to persist.
++    size_t            pipeline_cache_last_size = 0;
+ 
+     std::unique_ptr<vk_memory_logger> memory_logger;
+ 
+@@ -934,6 +940,15 @@
+         if (blob.empty()) {
+             return;
+         }
++        // QVAC-17872 round-2: skip the disk write if the cache content
++        // is byte-equivalent in size to what we already have on disk.
++        // Avoids re-writing 1 MB on every cleanup of a process that
++        // didn't compile any new pipelines (warm runs).  The eager-flush
++        // path in ggml_vk_load_shaders uses the same pipeline_cache_last_size
++        // bookkeeping so they cooperate idempotently.
++        if (blob.size() == device->pipeline_cache_last_size) {
++            return;
++        }
+         const std::string tmp_path = device->pipeline_cache_path + ".tmp";
+         std::ofstream out(tmp_path, std::ios::binary | std::ios::trunc);
+         if (!out) {
+@@ -942,8 +957,9 @@
+         out.write(reinterpret_cast<const char *>(blob.data()),
+                   static_cast<std::streamsize>(blob.size()));
+         out.close();
+-        if (out.good()) {
+-            (void) std::rename(tmp_path.c_str(), device->pipeline_cache_path.c_str());
++        if (out.good() &&
++            std::rename(tmp_path.c_str(), device->pipeline_cache_path.c_str()) == 0) {
++            device->pipeline_cache_last_size = blob.size();
+         } else {
+             (void) std::remove(tmp_path.c_str());
+         }
+@@ -4846,6 +4862,44 @@
+     for (auto &c : compiles) {
+         c.wait();
+     }
++
++    // QVAC-17872 round-2: persist the pipeline cache eagerly when this
++    // load_shaders call actually GREW the cache (i.e. compiled at least
++    // one pipeline whose SPIR-V was not already in the seed blob).
++    // Without this, lazy-compile work done by
++    // ggml_pipeline_request_descriptor_sets during a long-running graph
++    // compute is only flushed in ggml_vk_cleanup at backend free time —
++    // a process crash in between throws away the entire cold-compile
++    // wave and the next process pays it again.
++    //
++    // Crucially, on a warm run with a populated seed blob, every
++    // pipeline still goes through createComputePipeline → compiles is
++    // non-empty → but getPipelineCacheData().size() == seed size, so we
++    // skip the disk write.  This keeps warm-run overhead at zero (we
++    // measured a +90 ms WALL regression with an unconditional flush).
++    if (!compiles.empty() && device->pipeline_cache && !device->pipeline_cache_path.empty()) {
++        try {
++            const std::vector<uint8_t> blob = device->device.getPipelineCacheData(device->pipeline_cache);
++            if (!blob.empty() && blob.size() > device->pipeline_cache_last_size) {
++                const std::string tmp_path = device->pipeline_cache_path + ".tmp";
++                std::ofstream out(tmp_path, std::ios::binary | std::ios::trunc);
++                if (out) {
++                    out.write(reinterpret_cast<const char *>(blob.data()),
++                              static_cast<std::streamsize>(blob.size()));
++                    out.close();
++                    if (out.good() &&
++                        std::rename(tmp_path.c_str(), device->pipeline_cache_path.c_str()) == 0) {
++                        device->pipeline_cache_last_size = blob.size();
++                    } else {
++                        (void) std::remove(tmp_path.c_str());
++                    }
++                }
++            }
++        } catch (const std::exception &) {
++            // best-effort; on any failure we silently fall back to the
++            // ggml_vk_cleanup-time flush.
++        }
++    }
+ }
+ 
+ static bool ggml_vk_khr_cooperative_matrix_support(const vk::PhysicalDeviceProperties& props, const vk::PhysicalDeviceDriverProperties& driver_props, vk_device_architecture arch);
+@@ -5638,6 +5692,14 @@
+                     seed.empty() ? nullptr : seed.data());
+                 try {
+                     device->pipeline_cache = device->device.createPipelineCache(pci);
++                    // QVAC-17872 round-2: seed size matches the disk blob;
++                    // if the eager-flush path observes the same size after
++                    // a load_shaders call, it's a pure cache-hit run and
++                    // the disk write is skipped.  The driver may rewrite
++                    // header fields that change blob.size() vs file size
++                    // by a few bytes — that's still a one-time growth and
++                    // we'll write the new size, then steady-state from there.
++                    device->pipeline_cache_last_size = seed.size();
+                 } catch (const vk::SystemError &) {
+                     device->pipeline_cache = VK_NULL_HANDLE;
+                     device->pipeline_cache_path.clear();
diff --git a/patches/ggml-vulkan-pipeline-cache.patch b/patches/ggml-vulkan-pipeline-cache.patch
new file mode 100644
index 0000000..e2ad13b
--- /dev/null
+++ b/patches/ggml-vulkan-pipeline-cache.patch
@@ -0,0 +1,199 @@
+diff --git a/src/ggml-vulkan/ggml-vulkan.cpp b/src/ggml-vulkan/ggml-vulkan.cpp
+index 19e7fbda..7c4d7ffe 100644
+--- a/src/ggml-vulkan/ggml-vulkan.cpp
++++ b/src/ggml-vulkan/ggml-vulkan.cpp
+@@ -23,8 +23,14 @@ DispatchLoaderDynamic & ggml_vk_default_dispatcher();
+ 
+ #include <algorithm>
+ #include <cmath>
++#include <cstdio>
++#include <cstdlib>
++#include <cstring>
++#include <filesystem>
++#include <fstream>
+ #include <iomanip>
+ #include <iostream>
++#include <system_error>
+ #include <tuple>
+ #include <vector>
+ #include <deque>
+@@ -864,6 +870,18 @@ struct vk_device_struct {
+     bool allow_sysmem_fallback;
+     bool disable_graph_optimize;
+ 
++    // Optional persistent VkPipelineCache.  When enabled via
++    // GGML_VK_PIPELINE_CACHE_DIR / $XDG_CACHE_HOME / $HOME, createPipelineCache
++    // is seeded from disk at init and getPipelineCacheData is written back
++    // from the destructor, so repeated ggml_backend_vk_init() invocations
++    // (and separate processes) skip the shader-compile wave that Vulkan
++    // normally pays on every cold command-buffer graph-build.  When
++    // pipeline_cache is VK_NULL_HANDLE (default / opt-out / mkdir failure)
++    // behaviour is identical to upstream (createComputePipeline takes
++    // VK_NULL_HANDLE, which is legal).
++    vk::PipelineCache pipeline_cache = VK_NULL_HANDLE;
++    std::string       pipeline_cache_path;
++
+     std::unique_ptr<vk_memory_logger> memory_logger;
+ 
+     ~vk_device_struct() {
+@@ -888,10 +906,52 @@ struct vk_device_struct {
+ 
+         device.destroyDescriptorSetLayout(dsl);
+ 
++        // Destroy the VkPipelineCache handle here if it's still alive.  The
++        // on-disk persistence happens earlier, in ggml_vk_cleanup(), because
++        // this destructor is not reliably reached at process exit: pipelines
++        // and helpers hold shared_ptr<vk_device_struct> refs that keep the
++        // refcount above 0 until well after the Vulkan dispatcher is gone.
++        if (pipeline_cache) {
++            device.destroyPipelineCache(pipeline_cache);
++            pipeline_cache = VK_NULL_HANDLE;
++        }
++
+         device.destroy();
+     }
+ };
+ 
++// Flush the optional persistent pipeline cache to disk.  Called from
++// ggml_vk_cleanup() while the device shared_ptr is still alive and the
++// Vulkan dispatcher is still valid.  Safe to call multiple times per device
++// (the write is atomic via tmp + rename; idempotent).  No-op when persistent
++// caching was not enabled at init time.
++static void ggml_vk_save_pipeline_cache(vk_device & device) {
++    if (!device || !device->pipeline_cache || device->pipeline_cache_path.empty()) {
++        return;
++    }
++    try {
++        const std::vector<uint8_t> blob = device->device.getPipelineCacheData(device->pipeline_cache);
++        if (blob.empty()) {
++            return;
++        }
++        const std::string tmp_path = device->pipeline_cache_path + ".tmp";
++        std::ofstream out(tmp_path, std::ios::binary | std::ios::trunc);
++        if (!out) {
++            return;
++        }
++        out.write(reinterpret_cast<const char *>(blob.data()),
++                  static_cast<std::streamsize>(blob.size()));
++        out.close();
++        if (out.good()) {
++            (void) std::rename(tmp_path.c_str(), device->pipeline_cache_path.c_str());
++        } else {
++            (void) std::remove(tmp_path.c_str());
++        }
++    } catch (const std::exception &) {
++        // best-effort; silently drop the write
++    }
++}
++
+ void vk_command_pool::init(vk_device& device, vk_queue *q_) {
+     cmd_buffers.clear();
+     q = q_;
+@@ -2206,7 +2266,10 @@ static void ggml_vk_create_pipeline_func(vk_device& device, vk_pipeline& pipelin
+ #endif
+ 
+     try {
+-        pipeline->pipeline = device->device.createComputePipeline(VK_NULL_HANDLE, compute_pipeline_create_info).value;
++        // device->pipeline_cache is VK_NULL_HANDLE when persistent caching is
++        // opt-ed-out or its init failed; VK treats that as "no cache" — same
++        // as before this patch.
++        pipeline->pipeline = device->device.createComputePipeline(device->pipeline_cache, compute_pipeline_create_info).value;
+     } catch (const vk::SystemError& e) {
+         std::cerr << "ggml_vulkan: Compute pipeline creation failed for " << pipeline->name << std::endl;
+         std::cerr << "ggml_vulkan: " << e.what() << std::endl;
+@@ -5507,6 +5570,81 @@ static vk_device ggml_vk_get_device(size_t idx) {
+         descriptor_set_layout_create_info.setPNext(&dslbfci);
+         device->dsl = device->device.createDescriptorSetLayout(descriptor_set_layout_create_info);
+ 
++        // -------------------------------------------------------------------
++        // Persistent VkPipelineCache (opt-in / default-on-when-HOME-exists).
++        //
++        // Disabled by setting GGML_VK_PIPELINE_CACHE_DIR to the empty string.
++        // Path priority:
++        //   1. $GGML_VK_PIPELINE_CACHE_DIR (if non-empty)
++        //   2. $XDG_CACHE_HOME/ggml/vulkan
++        //   3. $HOME/.cache/ggml/vulkan
++        // Filename keyed on vendorID/deviceID/driverVersion; Vulkan itself
++        // validates the blob header and silently ignores stale data if the
++        // shader bundle or driver changed.
++        //
++        // The cache is consulted by createComputePipeline in
++        // ggml_vk_create_pipeline_func and flushed back to disk from
++        // ~vk_device_struct().  A cold first-process graph dispatch that
++        // used to pay seconds of shader compile drops to tens of ms on
++        // drivers without an aggressive per-app system cache (Mesa/RADV,
++        // Android Adreno/Mali, fresh NVIDIA installs, containers).
++        // See: QVAC-17872 for measured cold→warm deltas.
++        // -------------------------------------------------------------------
++        {
++            const char * env_dir  = getenv("GGML_VK_PIPELINE_CACHE_DIR");
++            const char * xdg_dir  = getenv("XDG_CACHE_HOME");
++            const char * home_dir = getenv("HOME");
++
++            std::string dir;
++            if (env_dir != nullptr) {
++                // Explicit env var wins: non-empty -> use it; empty -> disabled.
++                if (*env_dir) dir = env_dir;
++            } else if (xdg_dir && *xdg_dir) {
++                dir = std::string(xdg_dir) + "/ggml/vulkan";
++            } else if (home_dir && *home_dir) {
++                dir = std::string(home_dir) + "/.cache/ggml/vulkan";
++            }
++
++            if (!dir.empty()) {
++                std::error_code mkec;
++                std::filesystem::create_directories(dir, mkec);
++                (void) mkec;  // on failure we still try createPipelineCache with an empty seed
++
++                char fname[64];
++                snprintf(fname, sizeof(fname),
++                         "%04x-%04x-%08x.pcache",
++                         (unsigned) device->properties.vendorID,
++                         (unsigned) device->properties.deviceID,
++                         (unsigned) device->properties.driverVersion);
++                device->pipeline_cache_path = dir + "/" + fname;
++
++                std::vector<uint8_t> seed;
++                {
++                    std::ifstream in(device->pipeline_cache_path, std::ios::binary | std::ios::ate);
++                    if (in) {
++                        const std::streamoff n = in.tellg();
++                        if (n > 0) {
++                            seed.resize(static_cast<size_t>(n));
++                            in.seekg(0, std::ios::beg);
++                            in.read(reinterpret_cast<char *>(seed.data()), static_cast<std::streamsize>(seed.size()));
++                            if (!in) seed.clear();
++                        }
++                    }
++                }
++
++                vk::PipelineCacheCreateInfo pci(
++                    {},
++                    seed.size(),
++                    seed.empty() ? nullptr : seed.data());
++                try {
++                    device->pipeline_cache = device->device.createPipelineCache(pci);
++                } catch (const vk::SystemError &) {
++                    device->pipeline_cache = VK_NULL_HANDLE;
++                    device->pipeline_cache_path.clear();
++                }
++            }
++        }
++
+         ggml_vk_load_shaders(device);
+ 
+         // Only use transfer queue on AMD non-GCN, when the graphics queue is not enabled
+@@ -13357,6 +13495,13 @@ static void ggml_vk_graph_cleanup(ggml_backend_vk_context * ctx) {
+ // Clean up on backend free
+ static void ggml_vk_cleanup(ggml_backend_vk_context * ctx) {
+     VK_LOG_DEBUG("ggml_vk_cleanup(" << ctx->name << ")");
++
++    // Persist the optional on-disk pipeline cache while the device shared_ptr
++    // and the Vulkan dispatcher are still valid.  Doing this from
++    // ~vk_device_struct() is unreliable: pipelines and helpers hold
++    // shared_ptr<vk_device_struct> refs that keep the refcount non-zero by
++    // typical process-exit time, so the device destructor often never runs.
++    ggml_vk_save_pipeline_cache(ctx->device);
+     // discard any unsubmitted command buffers
+     ctx->compute_ctx.reset();
+     // wait for any pending command buffers to finish
diff --git a/scripts/dump-s3gen-reference.py b/scripts/dump-s3gen-reference.py
index e257c83..2bff1e3 100644
--- a/scripts/dump-s3gen-reference.py
+++ b/scripts/dump-s3gen-reference.py
@@ -51,6 +51,23 @@ def hook(_module, _inputs, output):
     return hook
 
 
+def make_first_call_hook(storage: dict, name: str, transform=None):
+    """Capture only the FIRST forward call's output (with optional transform).
+
+    Used for stage_G2 intermediates (cfm_h_conv, cfm_h_ln) which the C++
+    test harness expects with no _callN suffix and only needs from CFM step 0.
+    """
+    seen = {"n": 0}
+    def hook(_module, _inputs, output):
+        if seen["n"] > 0:
+            return
+        if isinstance(output, torch.Tensor):
+            t = output if transform is None else transform(output)
+            storage[name] = t.detach().clone().cpu()
+        seen["n"] += 1
+    return hook
+
+
 def save(t, path: Path):
     if torch.is_tensor(t):
         arr = t.detach().cpu().contiguous().numpy()
@@ -152,6 +169,21 @@ def main():
     hooks.append(d0_rn.mlp.register_forward_hook(make_hook(storage, "cfm_d0_rn_mlp", multi_call=True)))
     hooks.append(d0_rn.res_conv.register_forward_hook(make_hook(storage, "cfm_d0_rn_res", multi_call=True)))
     hooks.append(d0_rn.register_forward_hook(make_hook(storage, "cfm_d0_rn", multi_call=True)))
+    # G2-gap fix: capture h_conv (after CausalConv1d, before LN) and h_ln
+    # (after LayerNorm, after the second Transpose back to (B, C, T)).  Only
+    # the first call (CFM step 0) is captured because stage_G2 in
+    # test_s3gen.cpp loads cfm_step0_* inputs and expects matching G2
+    # intermediates.  block1.block layout per CausalBlock1D:
+    #   [0] CausalConv1d   -> (B, C, T)
+    #   [1] Transpose(1,2) -> (B, T, C)
+    #   [2] LayerNorm      -> (B, T, C)
+    #   [3] Transpose(1,2) -> (B, C, T)   <- save here for h_ln in (C, T) layout
+    #   [4] Mish           -> (B, C, T)   (already captured by block1 hook -> cfm_d0_rn_b1)
+    d0_b1_seq = d0_rn.block1.block
+    hooks.append(d0_b1_seq[0].register_forward_hook(
+        make_first_call_hook(storage, "cfm_h_conv")))
+    hooks.append(d0_b1_seq[3].register_forward_hook(
+        make_first_call_hook(storage, "cfm_h_ln")))
     # First transformer block in down_block 0
     d0_t0 = est.down_blocks[0][1][0]               # BasicTransformerBlock
     hooks.append(d0_t0.norm1.register_forward_hook(make_hook(storage, "cfm_d0_t0_n1", multi_call=True)))
@@ -204,6 +236,21 @@ def estimator_forward_capture(x, mask=None, mu=None, t=None, spks=None, cond=Non
         captured[f"cfm_step{step_idx[0]}_spks"] = spks.detach().clone().cpu() if spks is not None else None
         captured[f"cfm_step{step_idx[0]}_cond"] = cond.detach().clone().cpu() if cond is not None else None
         captured[f"cfm_step{step_idx[0]}_mask"] = mask.detach().clone().cpu() if mask is not None else None
+        # G2-gap fix: replicate the pack([x, mu, spks_bc, cond], dim=1)
+        # done inside ConditionalDecoder.forward so stage_G2 has its
+        # `cfm_concat.npy` reference.  Only capture from CFM step 0.
+        if step_idx[0] == 0:
+            try:
+                from einops import pack as _pack, repeat as _repeat
+                xc = _pack([x, mu], "b * t")[0]
+                if spks is not None:
+                    spks_bc = _repeat(spks, "b c -> b c t", t=x.shape[-1])
+                    xc = _pack([xc, spks_bc], "b * t")[0]
+                if cond is not None:
+                    xc = _pack([xc, cond], "b * t")[0]
+                captured["cfm_concat"] = xc.detach().clone().cpu()
+            except Exception as e:
+                print(f"  cfm_concat capture skipped: {e}")
         out = orig_est_forward(x, mask=mask, mu=mu, t=t, spks=spks, cond=cond, r=r)
         captured[f"cfm_step{step_idx[0]}_dxdt"] = out.detach().clone().cpu()
         step_idx[0] += 1
@@ -317,6 +364,23 @@ def randn_like_capture2(x, *a, **kw):
     # Note: m_source calls randn_like once more outside SineGen.
     # We use a counter to distinguish: first call is inside SineGen, second is the outer noise branch.
 
+    # G2-gap fix: capture s_stft (the cat'd real+imag STFT of the source
+    # signal).  HiFTGenerator.decode() does:
+    #   real, imag = self._stft(s.squeeze(1))
+    #   s_stft = torch.cat([real, imag], dim=1)
+    # The C++ stage_H3 / stage_H4 harnesses load `hift_s_stft.npy`, so
+    # capture it here by monkeypatching _stft.
+    orig_hift_stft = hift._stft
+    stft_seen = {"count": 0}
+    def _stft_capture(x):
+        real, imag = orig_hift_stft(x)
+        if stft_seen["count"] == 0:
+            s_stft = torch.cat([real, imag], dim=1)
+            hift_storage["hift_s_stft"] = s_stft.detach().clone().cpu()
+            stft_seen["count"] += 1
+        return real, imag
+    hift._stft = _stft_capture
+
     try:
         torch.manual_seed(args.seed + 1)  # Different seed so HiFT random is reproducible per run
         hift_cache = torch.zeros(1, 1, 0).to(tts.device)
@@ -325,6 +389,7 @@ def randn_like_capture2(x, *a, **kw):
         sg.forward = orig_sg_forward
         _Uniform.sample = orig_uniform_sample
         _torch.randn_like = orig_randn_like2
+        hift._stft = orig_hift_stft
         for h in hift_hooks:
             h.remove()
 
diff --git a/scripts/setup-ggml.sh b/scripts/setup-ggml.sh
index 5e7acb4..ae8c516 100755
--- a/scripts/setup-ggml.sh
+++ b/scripts/setup-ggml.sh
@@ -41,11 +41,25 @@ git apply "$REPO_ROOT/patches/ggml-metal-chatterbox-ops.patch"
 echo "  → applying patches/ggml-opencl-chatterbox-ops.patch"
 git apply "$REPO_ROOT/patches/ggml-opencl-chatterbox-ops.patch"
 
+# QVAC-17872 round-1: persistent VkPipelineCache across processes.  Eliminates
+# the ~1-3 s shader-compile cost on every fresh chatterbox process when
+# building with -DGGML_VULKAN=ON.  Inert when configuring without Vulkan.
+echo "  → applying patches/ggml-vulkan-pipeline-cache.patch"
+git apply "$REPO_ROOT/patches/ggml-vulkan-pipeline-cache.patch"
+
+# QVAC-17872 round-2: write back the pipeline cache after each
+# ggml_vk_load_shaders compile batch (crash-safety against SIGKILL/abort
+# losing freshly compiled pipelines).  Stacks on round-1's patch.
+echo "  → applying patches/ggml-vulkan-eager-cache-save.patch"
+git apply "$REPO_ROOT/patches/ggml-vulkan-eager-cache-save.patch"
+
 N_METAL="$(git status --porcelain src/ggml-metal/ 2>/dev/null | wc -l | tr -d ' ')"
 N_OPENCL="$(git status --porcelain include/ggml-opencl.h src/ggml-opencl/ 2>/dev/null | wc -l | tr -d ' ')"
-echo "  → ok (Metal: ${N_METAL} paths touched, OpenCL: ${N_OPENCL} paths touched under ggml/)"
+N_VULKAN="$(git status --porcelain src/ggml-vulkan/ 2>/dev/null | wc -l | tr -d ' ')"
+echo "  → ok (Metal: ${N_METAL} paths touched, OpenCL: ${N_OPENCL} paths touched, Vulkan: ${N_VULKAN} paths touched under ggml/)"
 echo
 echo "ggml is ready.  Next:"
 echo "  Metal:   cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_METAL=ON"
 echo "  OpenCL:  cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_OPENCL=ON"
+echo "  Vulkan:  cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON"
 echo "  cmake --build build -j\$(sysctl -n hw.ncpu 2>/dev/null || nproc)"
diff --git a/src/chatterbox_tts.cpp b/src/chatterbox_tts.cpp
index 746de95..21695c6 100644
--- a/src/chatterbox_tts.cpp
+++ b/src/chatterbox_tts.cpp
@@ -57,6 +57,7 @@
 #include <stdexcept>
 #include <string>
 #include <thread>
+#include <unordered_map>
 #include <vector>
 
 // Global thread count (set in main; used to configure CPU backend in each graph run)
@@ -161,6 +162,12 @@ static ggml_backend_t s3gen_init_backend(int n_gpu_layers, bool verbose) {
 // belong in a server front-end.
 static model_ctx load_s3gen_gguf(const std::string & path, int n_gpu_layers, bool verbose);
 
+// QVAC-17872 round-HIFT: defined later (alongside cfm_estimator_cache).
+// Tears down the persistent CFM estimator graph cache.  Forward-declared
+// here so s3gen_model_cache_release / cache-miss can call it without
+// having to also move the struct definition + global instance up.
+static void g_cfm_estimator_cache_destroy();
+
 namespace {
 struct s3gen_cache_entry { std::string path; int gpu = 0; std::unique_ptr<model_ctx> m; };
 static std::mutex                            g_s3gen_cache_mu;
@@ -176,6 +183,13 @@ static double                                g_s3gen_cache_last_load_ms = 0.0;
 // insertion so it runs before process-exit dylib finalisers.
 static void s3gen_model_cache_release() {
     std::lock_guard<std::mutex> lk(g_s3gen_cache_mu);
+    // QVAC-17872 round-HIFT: tear down the persistent CFM estimator graph
+    // BEFORE freeing the backend.  cfm_estimator_cache.allocr holds Vulkan
+    // (or Metal/CUDA) buffers allocated against the soon-to-be-freed
+    // backend; gallocr_free against a dangling vk_device asserts inside
+    // ggml-vulkan.  Same constraint as the existing thread_local
+    // time_mlp_cache documents.
+    g_cfm_estimator_cache_destroy();
     if (!g_s3gen_cache_entry) return;
     model_ctx * m = g_s3gen_cache_entry->m.get();
     if (m) {
@@ -199,6 +213,13 @@ static model_ctx * s3gen_model_cache_get(const std::string & path, int n_gpu_lay
         g_s3gen_cache_last_load_ms = 0.0;
         return g_s3gen_cache_entry->m.get();
     }
+    // QVAC-17872 round-HIFT: backend swap (different path or n_gpu_layers).
+    // Tear down the persistent CFM estimator cache against the OLD backend
+    // before freeing it, then drop the s3gen_cache_entry.  Same reasoning as
+    // s3gen_model_cache_release.
+    if (g_s3gen_cache_entry) {
+        g_cfm_estimator_cache_destroy();
+    }
     if (verbose) fprintf(stderr, "Loading %s\n", path.c_str());
     double t0 = now_ms();
     auto m = std::make_unique<model_ctx>(load_s3gen_gguf(path, n_gpu_layers, verbose));
@@ -315,14 +336,22 @@ static ggml_tensor * conv1d_f32_b(ggml_context * ctx, ggml_tensor * kernel, ggml
     return ggml_cont(ctx, ggml_permute(ctx, prod, 1, 0, 2, 3));
 }
 
+// QVAC-17872 round-HIFT (2026-05-04): drop the trailing ggml_cont.  The
+// only caller is run_hift_decode's upsample loop, where the result is
+// immediately consumed by ggml_add(x, ggml_reshape_2d(bias)) — same
+// strided-tolerant pattern as round-AUDIT's pre_lookahead exit cont.
+// The view's nb[1]/nb[2] are the original out's strides (which span the
+// pre-trim length), so element-wise add iterates with the proper byte
+// offsets.  After add, x is a fresh contiguous tensor again, so the
+// downstream ggml_view_3d / ggml_concat / rb_fwd → conv1d_f32 chain sees
+// contig input.  Saves 3 dispatches per HiFT decode (1 per ups stage).
 static ggml_tensor * conv_transpose_1d_f32(ggml_context * ctx, ggml_tensor * kernel,
                                            ggml_tensor * input, int stride, int padding) {
     ggml_tensor * out = ggml_conv_transpose_1d(ctx, kernel, input, stride, 0, 1);
     if (padding == 0) return out;
     int64_t L_new = out->ne[0] - 2 * padding;
-    ggml_tensor * v = ggml_view_3d(ctx, out, L_new, out->ne[1], out->ne[2],
-                                   out->nb[1], out->nb[2], (size_t)padding * out->nb[0]);
-    return ggml_cont(ctx, v);
+    return ggml_view_3d(ctx, out, L_new, out->ne[1], out->ne[2],
+                        out->nb[1], out->nb[2], (size_t)padding * out->nb[0]);
 }
 
 // Metal backend currently has no PAD / PAD_EXT dispatcher entry, so emulate
@@ -970,6 +999,65 @@ static std::vector<float> compute_time_mixed(const model_ctx & m,
     return out;
 }
 
+// QVAC-17872 round-HIFT: memoised time-embedding pipeline.  Both Turbo
+// (meanflow, t_span = [0, 0.5, 1]) and multilingual (cosine-scheduled, 10
+// steps) produce the same set of t-values across all subsequent synth
+// calls — the t-embedding outputs are deterministic functions of t (and
+// the model weights), so we can cache them.
+//
+// Two-layer cache:
+//   - g_time_mlp_results: keyed by uint32_t bitcast of t_val, used by
+//     both paths.  Multilingual benefits the most (10 distinct t-values
+//     repeated across every synth).
+//   - g_time_emb_results: keyed by uint64_t = (kt << 32) | kr, ONLY
+//     used by Turbo (meanflow) since multilingual doesn't run the mixer.
+//
+// Cleared in g_cfm_estimator_cache_destroy alongside the graph cache.
+//
+// Bit-exactness: trivially preserved — same compute, just memoised.
+static std::unordered_map<uint32_t, std::vector<float>> g_time_mlp_results;
+static std::unordered_map<uint64_t, std::vector<float>> g_time_emb_results;
+static std::mutex                                       g_time_emb_results_mu;
+
+static std::vector<float> compute_time_mlp_cached(const model_ctx & m, float t_val) {
+    uint32_t key;
+    static_assert(sizeof(key) == sizeof(t_val), "float must be 32-bit for bitcast key");
+    std::memcpy(&key, &t_val, sizeof(key));
+    {
+        std::lock_guard<std::mutex> lk(g_time_emb_results_mu);
+        auto it = g_time_mlp_results.find(key);
+        if (it != g_time_mlp_results.end()) return it->second;
+    }
+    auto out = compute_time_mlp(m, t_val);
+    {
+        std::lock_guard<std::mutex> lk(g_time_emb_results_mu);
+        g_time_mlp_results.emplace(key, out);
+    }
+    return out;
+}
+
+// Used only by the meanflow (Turbo) path — multilingual doesn't run
+// time_embed_mixer.  Caches the full t_emb pipeline by (t, r) pair.
+static std::vector<float> compute_time_emb_cached(const model_ctx & m, float t_val, float r_val) {
+    uint32_t kt, kr;
+    std::memcpy(&kt, &t_val, sizeof(kt));
+    std::memcpy(&kr, &r_val, sizeof(kr));
+    const uint64_t key = ((uint64_t)kt << 32) | (uint64_t)kr;
+    {
+        std::lock_guard<std::mutex> lk(g_time_emb_results_mu);
+        auto it = g_time_emb_results.find(key);
+        if (it != g_time_emb_results.end()) return it->second;
+    }
+    auto t_mlp = compute_time_mlp_cached(m, t_val);
+    auto r_mlp = compute_time_mlp_cached(m, r_val);
+    auto out = compute_time_mixed(m, t_mlp, r_mlp);
+    {
+        std::lock_guard<std::mutex> lk(g_time_emb_results_mu);
+        g_time_emb_results.emplace(key, out);
+    }
+    return out;
+}
+
 // Cached CFM estimator state — graph is built once and reused across steps.
 //
 // Cache key is (T, b2): a graph built for batch=1 (cfm_estimator_forward) cannot
@@ -987,12 +1075,75 @@ struct cfm_estimator_cache {
     ggml_cgraph * gf = nullptr;
     ggml_gallocr_t allocr = nullptr;
     std::vector<uint8_t> buf;
+    // QVAC-17872 round-HIFT: explicit destroy() so the cache can be a
+    // process-global tied to the s3gen-model lifecycle.  See
+    // s3gen_model_cache_release: invoked BEFORE ggml_backend_free, which
+    // is the same constraint the existing thread_local time_mlp_cache
+    // documents (Vulkan/Metal device-teardown ordering at process exit).
+    void destroy() {
+        if (allocr) { ggml_gallocr_free(allocr); allocr = nullptr; }
+        if (ctx)    { ggml_free(ctx);            ctx    = nullptr; }
+        gf  = nullptr;
+        T   = -1;
+        b2  = false;
+        buf = std::vector<uint8_t>();
+    }
+    // Destructor kept as a safety net for non-cached usages (e.g. tests
+    // that allocate a cfm_estimator_cache on the stack).  The global
+    // g_cfm_estimator_cache is explicitly destroyed via
+    // s3gen_model_cache_release before backend teardown.
     ~cfm_estimator_cache() {
         if (allocr) ggml_gallocr_free(allocr);
         if (ctx) ggml_free(ctx);
     }
 };
 
+// QVAC-17872 round-HIFT: persistent CFM estimator graph.  Was local-scope
+// in s3gen_synthesize_to_wav() before, so every synth call paid the full
+// graph rebuild cost (CFM has ~5500 ggml ops + gallocr_reserve allocates
+// the device-side buffer pool).  Persistent global with explicit destroy()
+// eliminates the rebuild on synth calls 2..N when T matches.
+static cfm_estimator_cache g_cfm_estimator_cache;
+
+// QVAC-17872 round-HIFT: CPU-side mirror of large model weights that
+// synthesize() reads every call (input_embedding lookup table, speaker
+// affine matrix).  These are model constants — on a GPU backend each
+// call previously paid an N MB device→host download per synth.  Cleared
+// in g_cfm_estimator_cache_destroy alongside the graph cache.
+static std::unordered_map<const ggml_tensor *, std::vector<float>> g_weight_cpu_mirror;
+static std::mutex                                                  g_weight_cpu_mirror_mu;
+
+static const float * cached_cpu_weights_f32(const ggml_tensor * t) {
+    {
+        std::lock_guard<std::mutex> lk(g_weight_cpu_mirror_mu);
+        auto it = g_weight_cpu_mirror.find(t);
+        if (it != g_weight_cpu_mirror.end()) return it->second.data();
+    }
+    std::vector<float> data(ggml_nelements(t));
+    ggml_backend_tensor_get(t, data.data(), 0, ggml_nbytes(t));
+    {
+        std::lock_guard<std::mutex> lk(g_weight_cpu_mirror_mu);
+        auto [it, inserted] = g_weight_cpu_mirror.emplace(t, std::move(data));
+        return it->second.data();
+    }
+}
+
+// Forward-declared near s3gen_model_cache_release; defined here so the
+// release path can flush the caches without having to also move the
+// cfm_estimator_cache struct definition + global up.
+static void g_cfm_estimator_cache_destroy() {
+    g_cfm_estimator_cache.destroy();
+    {
+        std::lock_guard<std::mutex> lk(g_time_emb_results_mu);
+        g_time_mlp_results.clear();
+        g_time_emb_results.clear();
+    }
+    {
+        std::lock_guard<std::mutex> lk(g_weight_cpu_mirror_mu);
+        g_weight_cpu_mirror.clear();
+    }
+}
+
 // Single estimator forward: (x, mu, t_emb, spks, cond) -> dxdt
 // All shapes are numpy (80, T) or (80,) as given, flattened row-major.
 static std::vector<float> cfm_estimator_forward(
@@ -1339,7 +1490,13 @@ static std::vector<float> run_f0_predictor(const model_ctx & m, const std::vecto
         x = ggml_add(ctx, x, ggml_reshape_2d(ctx, b, 1, C_out));
         x = ggml_unary(ctx, x, GGML_UNARY_OP_ELU);
     }
-    ggml_tensor * xp = ggml_cont(ctx, ggml_permute(ctx, x, 1, 0, 2, 3));
+    // QVAC-17872 round-HIFT (2026-05-04): drop the cont before the
+    // classifier matmul.  ggml_mul_mat src1 (xp here) is the activations
+    // input; Vulkan / Metal / CUDA mul_mat shaders all iterate by stride
+    // and accept strided src1 for f32 matmul.  Saves 1 dispatch / HiFT
+    // decode.  Verified bit-exact across all RTX 5090 + AMD/RADV
+    // invariants in the round-HIFT companion FINDINGS doc.
+    ggml_tensor * xp = ggml_permute(ctx, x, 1, 0, 2, 3);
     ggml_tensor * cw = find_tensor(m, "hift/f0_predictor/classifier/weight");
     ggml_tensor * cb = find_tensor(m, "hift/f0_predictor/classifier/bias");
     ggml_tensor * y = ggml_mul_mat(ctx, cw, xp);
@@ -1571,8 +1728,13 @@ static std::vector<float> run_hift_decode(const model_ctx & m,
     y = ggml_div(ctx, y, ws_in);
     int pad_amt = n_fft / 2;
     int L_wav = (int)ws.size() - n_fft;
-    ggml_tensor * y_trim = ggml_cont(ctx, ggml_view_2d(ctx, y, L_wav, y->ne[1], y->nb[1],
-                                                       (size_t)pad_amt * y->nb[0]));
+    // QVAC-17872 round-HIFT (2026-05-04): drop the trailing ggml_cont.  The
+    // view's only consumer is ggml_clamp (element-wise, accepts strided
+    // src0); clamp's output is a fresh contiguous tensor allocated by the
+    // gallocator.  ggml_set_output is set on that contig output, so
+    // tensor_get reads from a contig buffer.  Saves 1 dispatch / HiFT decode.
+    ggml_tensor * y_trim = ggml_view_2d(ctx, y, L_wav, y->ne[1], y->nb[1],
+                                        (size_t)pad_amt * y->nb[0]);
     y_trim = ggml_clamp(ctx, y_trim, -0.99f, 0.99f);
     ggml_set_name(y_trim, "wav"); ggml_set_output(y_trim);
     ggml_build_forward_expand(gf, y_trim);
@@ -1820,8 +1982,13 @@ int s3gen_synthesize_to_wav(
     // 2) input_embedding lookup + multiply by mask
     vlog("Running input_embedding...\n");
     ggml_tensor * emb_w = find_tensor(m, "flow/input_embedding");
-    std::vector<float> emb_w_data(ggml_nelements(emb_w));
-    ggml_backend_tensor_get(emb_w, emb_w_data.data(), 0, ggml_nbytes(emb_w));
+    // QVAC-17872 round-HIFT: input_embedding weight is multiple MB on Turbo
+    // and ~28 MB on multilingual (vocab=13632 × D=512 × 4 B).  Each synth
+    // call previously paid the full GPU→CPU download (~600-1000 µs wall
+    // on RTX 5090).  Cache the CPU mirror so subsequent calls only pay
+    // the cheap row-copy lookup cost.  Cache is bound to the s3gen model
+    // lifecycle.
+    const float * emb_w_data = cached_cpu_weights_f32(emb_w);
     vlog("  emb_w ne=[%lld, %lld]\n", (long long)emb_w->ne[0], (long long)emb_w->ne[1]);
     int vocab_size = (int)emb_w->ne[1];
     std::vector<float> input_embed(n_total * D);
@@ -1832,7 +1999,7 @@ int s3gen_synthesize_to_wav(
             fprintf(stderr, "warning: token %d out of range (vocab=%d), clamping\n", tok, vocab_size);
             tok = vocab_size - 1;
         }
-        std::memcpy(input_embed.data() + i * D, emb_w_data.data() + (size_t)tok * D, D * sizeof(float));
+        std::memcpy(input_embed.data() + i * D, emb_w_data + (size_t)tok * D, D * sizeof(float));
     }
     if (debug_mode) {
         fprintf(stderr, "  token[0]=%d lookup: %.6f %.6f %.6f %.6f %.6f\n",
@@ -1919,9 +2086,10 @@ int s3gen_synthesize_to_wav(
 
     ggml_tensor * saw = find_tensor(m, "flow/spk_embed_affine/w");  // (80, 192) numpy -> ne=[192, 80]
     ggml_tensor * sab = find_tensor(m, "flow/spk_embed_affine/b");  // (80,)
-    std::vector<float> saw_data(ggml_nelements(saw)), sab_data(ggml_nelements(sab));
-    ggml_backend_tensor_get(saw, saw_data.data(), 0, ggml_nbytes(saw));
-    ggml_backend_tensor_get(sab, sab_data.data(), 0, ggml_nbytes(sab));
+    // QVAC-17872 round-HIFT: cache CPU mirrors of the speaker-affine
+    // weights (~60 KB) instead of paying GPU→CPU download per synth.
+    const float * saw_data = cached_cpu_weights_f32(saw);
+    const float * sab_data = cached_cpu_weights_f32(sab);
     std::vector<float> spks(MEL, 0.0f);
     for (int o = 0; o < MEL; ++o) {
         float acc = sab_data[o];
@@ -2064,19 +2232,29 @@ int s3gen_synthesize_to_wav(
     const bool use_b2 = (!meanflow) && (cfg_rate != 0.0f) &&
                         !ggml_backend_is_cpu(m.backend);
 
-    cfm_estimator_cache cfm_cache;
+    // QVAC-17872 round-HIFT: persistent CFM estimator graph cache
+    // (was local-scope before).  Re-used across synth calls when T matches —
+    // multi-synth chunks 2..N skip the graph build + gallocr_reserve cost
+    // they previously paid every chunk.  Lifetime managed by
+    // s3gen_model_cache_release.  Works for both batch=1 (Turbo) and
+    // batch=2 (multilingual CFG) paths via the cache.b2 flag.
+    cfm_estimator_cache & cfm_cache = g_cfm_estimator_cache;
     double cfm_t0 = now_ms();
     for (size_t s = 0; s < t_span.size() - 1; ++s) {
         float t = t_span[s], r = t_span[s + 1];
         float dt = r - t;
         vlog("CFM step %zu: t=%g r=%g dt=%g...\n", s, t, r, dt);
-        auto t_mlp = compute_time_mlp(m, t);
+        // QVAC-17872 round-HIFT: memoised t-emb pipeline.  Same (t, r)
+        // pair always produces the same vector (deterministic functions of
+        // t, r and the model weights).  Both Turbo (meanflow) and
+        // multilingual (standard) paths benefit; multilingual amortises
+        // the cache better since it has 10 steps × 2 sets of {t, r}
+        // values that repeat across every subsequent synth call.
         std::vector<float> t_emb;
         if (meanflow) {
-            auto r_mlp = compute_time_mlp(m, r);
-            t_emb = compute_time_mixed(m, t_mlp, r_mlp);
+            t_emb = compute_time_emb_cached(m, t, r);
         } else {
-            t_emb = std::move(t_mlp);
+            t_emb = compute_time_mlp_cached(m, t);
         }
 
         if (debug_mode && meanflow) {
diff --git a/src/test_s3gen.cpp b/src/test_s3gen.cpp
index 5699958..1b3cec4 100644
--- a/src/test_s3gen.cpp
+++ b/src/test_s3gen.cpp
@@ -1109,6 +1109,12 @@ static void stage_G2(const model_ctx & m, const std::string & ref_dir) {
     ggml_tensor * xc = ggml_concat(ctx, x_in, mu_in, 1);
     xc = ggml_concat(ctx, xc, spks_bc, 1);
     xc = ggml_concat(ctx, xc, cond_in, 1);
+    // QVAC-17872 round-HIFT/G2-fix: mark xc as graph output so the gallocator
+    // preserves its buffer across compute (otherwise the diagnostic read of
+    // xc returns garbage, since xc's slot gets reused by downstream
+    // intermediates after the conv1d consumer completes).  cfm_concat.npy
+    // is now produced by dump-s3gen-reference.py (round-HIFT G2-gap closure).
+    ggml_set_name(xc, "xc"); ggml_set_output(xc);
 
     auto rn_w = load_cfm_resnet(m, "cfm/down_blocks/0/0");
 

From 5084ee4135a09d013d9202a07c7bcea18f3d2582 Mon Sep 17 00:00:00 2001
From: Zbigniew Herman <zbigniew.herman@tether.io>
Date: Wed, 6 May 2026 14:29:50 +0200
Subject: [PATCH 2/3] =?UTF-8?q?QVAC-17872=20[TTS=20GGML]=20PROGRESS.md=20?=
 =?UTF-8?q?=C2=A73.32:=20multilingual=20verification=20on=20Vulkan?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Closes the multilingual-applicability gap that the May 4 squashed
port (commit ac4748a) left open.  The May 4 measurement was on
Turbo only because the multilingual GGUF was not available
locally then; after QVAC-18422 §3.34's converter shipped
chatterbox-s3gen-mtl-q4_0.gguf (788 MB) and
chatterbox-t3-mtl-q4_0.gguf (345 MB), the actual multilingual
verification is now feasible.

Test methodology
----------------

Six-segment auto-split via --max-sentence-chars 32 (the
multilingual T3 GGUF doesn't embed the tokenizer needed for the
--input-file streaming pattern; --max-sentence-chars triggers
multiple within-process synth calls which is what the persistent
host caches actually need to fire).  Three iterations × five
warm-state segments = n=15 samples per build.

Comparison build: a fresh upstream/multilingual_merged HEAD
(b074399) worktree at /tmp/cb-base-mtl-merged with only the
Metal + OpenCL patches applied (NOT the two new Vulkan patches
in this PR).  Both builds use the same vendored ggml commit
58c38058 and the same Vulkan 1.3.275 / RTX 5090 + NVIDIA 590.48
host.

Bit-exactness — first locked multilingual F32 invariants
--------------------------------------------------------

Both single-shot and 6-segment multi-synth produce byte-identical
multilingual WAV vs the upstream/multilingual_merged baseline:

  Single-shot (seed 42, --temp 0):      c65d98f15a59b8fe9cad98e46eb3fb30
  Multi-synth 6 segments (seed 42):     0b374c7474895a3387b9f1df10b3c1b8

These are the FIRST locked multilingual F32 invariants for the
Vulkan path on the multilingual_merged base (the previously
locked RTX 5090 invariants in regress-c1.sh were captured against
the older main-base branch and don't apply to this base).

Performance — RTX 5090, n=15 warm-state samples per build
---------------------------------------------------------

  metric        | upstream/mtl_merged | this PR  |          Δ
  S3GEN_INFER   |           169.9 ms  | 153.7 ms |  -16.2 ms (-9.5 %)
  cfm_total     |           132.5 ms  | 114.7 ms |  -17.8 ms (-13.4 %)
  cfm_step0     |            24.1 ms  |  12.6 ms |  -11.5 ms (-47.7 %)

cfm_step0 is the strongest multilingual signal: the persistent
CFM estimator graph cache eliminates ~half of the per-segment
graph-rebuild cost on warm-state synth.  The -9.5 % S3GEN_INFER
win is below the Turbo wins because:

  1. Multilingual CFM is ~6× larger in absolute terms (more
     layers, larger hidden dims, default 10-step cosine schedule
     vs Turbo's 2-step meanflow), so the cached host overhead
     is a smaller fraction of the wall.
  2. The multilingual baseline absorbs more per-synth fixed cost
     than Turbo does — multilingual hits compute_time_mlp 10
     times per inference but each time only touches a tiny
     graph; the cached CFM estimator graph matters more.

First-segment cold cost
-----------------------

Within a single process, the first segment pays a one-time
cache-warm-up overhead: PR 210-236 ms vs baseline 195-241 ms (no
statistically significant first-segment penalty given run-to-run
variance).  Subsequent segments are where the caches actually
pay off and the win is consistently visible.

Across processes, the persistent VkPipelineCache patch (round-1)
collapses the cold-process startup: cfm_step0 on a fresh process
drops from ~133 ms (no cache, full shader compile) to ~30 ms
(cache hit) — the headline mobile / Mesa win.

Files: PROGRESS.md +125 / -6 lines.

No source-code changes — this commit is purely the verification
write-up that confirms the May 4 port's optimisations work
correctly and meaningfully on the multilingual model on Vulkan,
exactly as predicted by the "model-agnostic by construction"
analysis in PROGRESS.md §3.32.

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 PROGRESS.md | 125 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 119 insertions(+), 6 deletions(-)

diff --git a/PROGRESS.md b/PROGRESS.md
index 1c78abb..b8291b3 100644
--- a/PROGRESS.md
+++ b/PROGRESS.md
@@ -4282,11 +4282,23 @@ contributed to the larger headline number on the main base.
 
 #### Bit-exactness
 
+Turbo F32 invariants on the original `main` base, carried forward
+to this `multilingual_merged` port:
+
 | Backend                | F32 single-shot | F32 multi-synth identical | F32 multi-synth varied |
 |------------------------|:---------------:|:-------------------------:|:----------------------:|
 | RTX 5090 + 590.48      |       ✓         |             ✓             |           ✓            |
 | AMD iGPU (RADV, Mesa)  |       ✓         |             ✓             |           ✓            |
 
+Multilingual F32 invariants (NEW, locked May 6, 2026 against
+upstream/multilingual_merged HEAD `b074399` on RTX 5090 +
+NVIDIA 590.48 + Vulkan 1.3.275 — see "Multilingual verification"
+section below for details):
+
+| Backend                | F32 single-shot                      | F32 multi-synth (6 seg)              |
+|------------------------|:------------------------------------:|:------------------------------------:|
+| RTX 5090 + 590.48      | `c65d98f15a59b8fe9cad98e46eb3fb30` ✓ | `0b374c7474895a3387b9f1df10b3c1b8` ✓ |
+
 F16 invariants are not in this commit (C1 deferred).
 
 #### Why this is model-agnostic by construction
@@ -4306,6 +4318,108 @@ infrastructure that is shared between Turbo and multilingual:
 4. **HiFT cont removals** — HiFT decoder code path is identical
    for both variants.
 
+#### Multilingual verification (May 6, 2026)
+
+The May 4 squashed port was measured on Turbo because the
+multilingual GGUF was not available locally then.  After the
+QVAC-18422 §3.34 companion work shipped a converter from the
+public `ResembleAI/chatterbox` HuggingFace repo
+(`chatterbox-s3gen-mtl-q4_0.gguf` 788 MB +
+`chatterbox-t3-mtl-q4_0.gguf` 345 MB), this section captures the
+actual multilingual measurement.
+
+**Test methodology.** Six-segment auto-split via
+`--max-sentence-chars 32` (the multilingual T3 GGUF doesn't embed
+the tokenizer needed for the `--input-file` streaming pattern;
+`--max-sentence-chars` triggers multiple within-process synths
+which is what the persistent host caches actually need to fire).
+Three iterations × five warm-state segments each = **n=15 samples
+per build**.  Comparison build: a fresh `upstream/multilingual_merged`
+HEAD (`b074399`) worktree with only the Metal + OpenCL patches
+applied (NOT the two new Vulkan patches in this PR).  Both builds
+use the same vendored ggml commit `58c38058` and the same Vulkan
+1.3.275 / RTX 5090 + NVIDIA 590.48 host.
+
+##### Bit-exactness on multilingual
+
+Both single-shot and 6-segment multi-synth produce **byte-identical
+multilingual WAV** vs the upstream/multilingual_merged baseline:
+
+| Test                                  | This PR MD5                          | Baseline MD5                         | Match |
+|---------------------------------------|--------------------------------------|--------------------------------------|:-----:|
+| Single-shot (seed 42, --temp 0)       | `c65d98f15a59b8fe9cad98e46eb3fb30`   | `c65d98f15a59b8fe9cad98e46eb3fb30`   |  ✓   |
+| Multi-synth 6 segments (seed 42)      | `0b374c7474895a3387b9f1df10b3c1b8`   | `0b374c7474895a3387b9f1df10b3c1b8`   |  ✓   |
+
+These are the **first locked multilingual F32 invariants** for the
+Vulkan path on the multilingual_merged base (the previously locked
+RTX 5090 invariants in `regress-c1.sh` were captured against the
+older `main`-base branch and don't apply to this base).
+
+##### Multilingual performance — RTX 5090, n=15 warm-state samples per build
+
+| Metric          | upstream/multilingual_merged | this PR     | Δ                          |
+|-----------------|-----------------------------:|------------:|---------------------------:|
+| **S3GEN_INFER** |                     169.9 ms | **153.7 ms**| **−16.2 ms (−9.5 %)**      |
+| **cfm_total**   |                     132.5 ms | **114.7 ms**| **−17.8 ms (−13.4 %)**     |
+| **cfm_step0**   |                      24.1 ms |  **12.6 ms**| **−11.5 ms (−47.7 %)**     |
+
+`cfm_step0` is the strongest multilingual signal: the persistent
+CFM estimator graph cache eliminates ~half of the per-segment
+graph-rebuild cost on warm-state synth.  The −9.5 % S3GEN_INFER
+win is below the Turbo wins shown above because:
+
+1. **Multilingual CFM is ~6× larger** in absolute terms (more
+   layers, larger hidden dims, default 10-step cosine schedule
+   vs Turbo's 2-step meanflow), so the cached host overhead is a
+   smaller fraction of the wall.
+2. The multilingual baseline already absorbs more of the
+   per-synth fixed cost than Turbo does — multilingual hits
+   `compute_time_mlp` 10 times per inference but each time only
+   touches a tiny graph, whereas the cached CFM estimator graph
+   matters more in the absolute.
+
+##### Cold-start (first segment of a fresh process)
+
+Within a single process, the **first** segment pays a one-time
+cache-warm-up overhead: PR 210–236 ms vs baseline 195–241 ms
+(no statistically significant first-segment penalty given
+run-to-run variance).  Subsequent segments are where the
+caches actually pay off and the win is consistently visible.
+
+Across processes, the persistent VkPipelineCache patch
+(round-1) collapses the cold-process startup: `cfm_step0` on a
+fresh process drops from ~133 ms (no cache, full shader compile)
+to ~30 ms (cache hit) — the headline mobile / Mesa win.
+
+##### Reproduction
+
+```bash
+# PR build (this branch)
+cd inputFilesForAI/qvac-17872-findings/chatterbox.cpp
+bash scripts/setup-ggml.sh
+cmake -S . -B build-vk-mtl-merged -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
+cmake --build build-vk-mtl-merged -j --target tts-cli
+
+./build-vk-mtl-merged/tts-cli \
+    --model models/chatterbox-t3-mtl-q4_0.gguf \
+    --s3gen-gguf models/chatterbox-s3gen-mtl-q4_0.gguf \
+    --language en \
+    --text "Hello from ggml first synthesis. Second synthesis run here now. Third sentence here. Fourth sentence runs too. Fifth sentence wraps." \
+    --max-sentence-chars 32 --out /tmp/mtl-pr.wav \
+    --n-gpu-layers 99 --threads 4 --seed 42 --temp 0 --top-k 1 --verbose
+
+# Baseline (upstream/multilingual_merged HEAD, separate worktree)
+git worktree add /tmp/cb-base upstream/multilingual_merged
+ln -s "$(pwd)/models" /tmp/cb-base/models
+cd /tmp/cb-base
+bash scripts/setup-ggml.sh
+cmake -S . -B build-vk-base -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
+cmake --build build-vk-base -j --target tts-cli
+
+# Same command with --out /tmp/mtl-base.wav, then:
+md5sum /tmp/mtl-pr.wav /tmp/mtl-base.wav  # MUST match
+```
+
 #### Files touched
 
 | File                                       |          Change |
@@ -4325,12 +4439,11 @@ All `inputFilesForAI/qvac-17872-findings/FINDINGS_*.md` and
 
 #### Next
 
-- **Multilingual GGUF cross-validation**: re-run the regress harness
-  against `chatterbox-s3gen-mtl-q4_0.gguf` (converted from the
-  HuggingFace public `ResembleAI/chatterbox` repo per the §3.34
-  converter) once that GGUF is available on the Vulkan host.  By
-  construction every cache should hit ≥ as often as on Turbo;
-  measurable wins should be ≥ those reported here.
+- **Multilingual GGUF cross-validation** — ✅ **DONE (May 6, 2026)**.
+  See "Multilingual verification" subsection above: bit-exact on F32
+  (single-shot `c65d98…`, multi-synth `0b374c…`); steady-state wins
+  −9.5 % S3GEN_INFER, −13.4 % cfm_total, −47.7 % cfm_step0 vs
+  upstream/multilingual_merged HEAD on multilingual GGUF.
 - **C1 port to `multilingual_merged`** (F16 CFM matmul weights,
   opt-in `CHATTERBOX_F16_CFM`): needs ~100 lines adapting our F32→F16
   conversion path to `multilingual_merged`'s

From d5c261c6f64bcbd5301b0da9a1d41f6979136d87 Mon Sep 17 00:00:00 2001
From: Zbigniew Herman <zbigniew.herman@tether.io>
Date: Wed, 6 May 2026 15:29:08 +0200
Subject: [PATCH 3/3] =?UTF-8?q?QVAC-17872=20[TTS=20GGML]=20=C2=A73.32=20ro?=
 =?UTF-8?q?und=202:=20encoder=20/=20HiFT=20/=20F0=20graph=20caches=20+=20s?=
 =?UTF-8?q?caffolding=20caches=20(multilingual=20Vulkan)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Targets the per-synth host-CPU overhead that round 1 / round-HIFT
didn't address, on top of upstream/multilingual_merged (now in main
via PR #7).  Test-first: bench-logs-vk-mtl/regress-mtl-vk.sh in the
qvac monorepo locks the pre-change MD5 baseline, then re-verifies
after every cache.  All 3 invariants (multilingual single-shot,
multilingual 6-segment multi-synth, Turbo single-shot) PASS bit-exact.

Seven new caches
----------------

All host-side, model-agnostic, no GGUF-format change, no public-API
change.  Same teardown discipline as the existing g_cfm_estimator_cache
(destroy() before ggml_backend_free).  Sit alongside the existing
round-1 caches.

  - g_encoder_graph_cache (keyed on T): full run_encoder graph +
    gallocator.  Streaming chunks of varying length still produce
    correct output (rebuilds on key change).

  - g_hift_graph_cache (keyed on pack(T_mel, T_stft)) +
    g_hift_inv_alpha_entries: full run_hift_decode graph + gallocator.
    Parallel (graph-input-name, source-tensor-ptr) metadata lets
    cache hits re-feed each alpha-input slot from g_inv_alpha_results
    without rebuilding the graph.

  - g_f0_graph_cache (keyed on T_mel): full run_f0_predictor graph +
    gallocator.

  - cached_pos_emb (g_pos_emb_results, keyed on pack(T, D)):
    compute_pos_emb is pure CPU compute (~T * D * 5 trig ops); fired
    twice per encoder run (T and 2T).  Multilingual T~350+ at D=512
    is a real wedge of per-synth host time.

  - cached_inv_alpha (g_inv_alpha_results, keyed on ggml_tensor*):
    HiFT calls invert_alpha_cpu ~72x per synth (12 ResBlocks × 6
    alpha tensors); each is a tensor_get + per-element reciprocal.
    Alpha tensors are constant for the model lifetime.

  - cached_hann_window / cached_istft_kernel (g_hann_window_cache /
    g_istft_kernel_cache, keyed on n_fft): pure functions of n_fft
    (constant 16 in the chatterbox HiFT path).

  - cached_window_sum (g_window_sum_cache, keyed on
    pack(n_fft, hop, T_stft)): T_stft × n_fft adds; stable across
    same-shape synth calls.

A new graph_cache struct (used by encoder / HiFT / F0) and a
pack_hift_key helper centralise the explicit destroy()-on-teardown
pattern so future per-stage caches can plug in with one struct + one
mutex acquisition.  The destroy path is unified into a renamed
s3gen_release_synth_caches() (replaces the old
g_cfm_estimator_cache_destroy()), called from
s3gen_model_cache_release, the cache-miss backend-swap path, and the
explicit s3gen_unload().

Negative result documented (bug caught and fixed during dev)
------------------------------------------------------------

First implementation of the HiFT cache hung indefinitely on the very
first synth call.  Root cause: the alpha-input refresh loop held
g_synth_caches_mu while calling cached_inv_alpha, which itself takes
the same mutex internally — classic re-entrant deadlock.  Fix:
snapshot g_hift_inv_alpha_entries under the mutex into a local vector,
then iterate without the lock (cached_inv_alpha re-acquires the mutex
per call but with no nesting).  General rule kept as an inline comment:
never hold a cache-state mutex while calling any other cached_* helper.

Performance — RTX 5090, multilingual auto-split, warm-state seg 2..6
-------------------------------------------------------------------

Within-process win on top of round 1 + round-HIFT:

  metric        | pre-round-2 |  post-round-2  |          Δ
  S3GEN_INFER   |    159.8 ms |    140.8 ms    |  -19.0 ms (-11.9 %)
  cfm_total     |    122.2 ms |    118.7 ms    |   -3.5 ms (-2.9 %)
  cfm_step0     |     13.24 ms|     13.18 ms   |   noise (already cached round 1)
  hift_total    |     17.96 ms|     16.30 ms   |   -1.7 ms (-9.4 %)

Combined cumulative win vs upstream/multilingual_merged baseline
(round 1 + round-HIFT + round 2):

  metric        | upstream/mtl_merged |  this PR (full) |          Δ
  S3GEN_INFER   |          169.9 ms   |     140.8 ms    |  -29.1 ms (-17.1 %)
  cfm_total     |          132.5 ms   |     118.7 ms    |  -13.8 ms (-10.4 %)
  cfm_step0     |           24.1 ms   |      13.2 ms    |  -10.9 ms (-45.2 %)

The biggest remaining single piece of S3GEN_INFER (~120 ms cfm) is
the actual GPU CFM compute — not host-cacheable; would need
shader-side optimisation (e.g. tensor-core engagement via
cooperative_matrix2; deferred — see "Next" in PROGRESS.md §3.32).

Bit-exactness
-------------

Locked invariants pass byte-for-byte vs the pre-change baseline:

  Multilingual single-shot      c65d98f15a59b8fe9cad98e46eb3fb30  ✓
  Multilingual 6-segment multi  0b374c7474895a3387b9f1df10b3c1b8  ✓
  Turbo single-shot             6219f4338b1b4fb9dc60481216153b49  ✓

Verified across 4 successive iterations on RTX 5090 + NVIDIA 590.48
+ Vulkan 1.3.275; bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac
monorepo is the test-first harness.

Files
-----

  src/chatterbox_tts.cpp         +373 / -79 (net diff vs round-1 head)
  PROGRESS.md                    §3.32 round-2 subsection (~+200 lines)

The +373 lines in chatterbox_tts.cpp are entirely the new cache
infrastructure: graph_cache struct, seven new globals, the
s3gen_release_synth_caches lifecycle hook, the five cached_*
scaffolding helpers, and the build_graph / cache-hit branches in
run_encoder / run_hift_decode / run_f0_predictor.

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 PROGRESS.md            | 122 ++++++++++-
 src/chatterbox_tts.cpp | 456 +++++++++++++++++++++++++++++++++--------
 2 files changed, 491 insertions(+), 87 deletions(-)

diff --git a/PROGRESS.md b/PROGRESS.md
index b8291b3..e05b151 100644
--- a/PROGRESS.md
+++ b/PROGRESS.md
@@ -4318,6 +4318,96 @@ infrastructure that is shared between Turbo and multilingual:
 4. **HiFT cont removals** — HiFT decoder code path is identical
    for both variants.
 
+#### Round 2 — encoder / HiFT / F0 graph caches + scaffolding caches (May 6, 2026)
+
+Targets the per-synth host-CPU overhead that round 1 / round-HIFT didn't
+address.  All host-side, model-agnostic, no GGUF-format change, no
+public-API change.  Bit-exact-preserving on multilingual on Vulkan:
+locked invariants (single-shot `c65d98f15a59b8fe9cad98e46eb3fb30`,
+6-segment multi-synth `0b374c7474895a3387b9f1df10b3c1b8`) match
+byte-for-byte before and after the round-2 changes.  Test-first:
+`bench-logs-vk-mtl/regress-mtl-vk.sh` (in the qvac monorepo, out-of-tree)
+locks the pre-change snapshot then re-verifies after every cache.
+
+**The seven new caches** (all sit alongside the existing
+`g_cfm_estimator_cache` / `g_time_mlp_results` / `g_time_emb_results` /
+`g_weight_cpu_mirror` from round 1):
+
+| Cache | Keyed on | What it stores | Why it's safe |
+|---|---|---|---|
+| `g_encoder_graph_cache` | `T` (encoder input length) | full `run_encoder` graph + `gallocator` | Streaming chunks at varying length still produce correct output (rebuilds on key change). |
+| `g_hift_graph_cache` (+ `g_hift_inv_alpha_entries` metadata) | `pack(T_mel, T_stft)` | full `run_hift_decode` graph + `gallocator` | Parallel `(graph-input-name, source-tensor-ptr)` metadata lets cache hits re-feed each alpha-input slot from `g_inv_alpha_results` without rebuilding. |
+| `g_f0_graph_cache` | `T_mel` | full `run_f0_predictor` graph + `gallocator` | Same pattern as encoder. |
+| `g_pos_emb_results` (`cached_pos_emb`) | `pack(T, D)` | `(2T-1, D)` F32 vector from `compute_pos_emb` | `compute_pos_emb` is pure compute (~`T × D × 5` trig ops).  Fired twice per encoder run (`T` and `2T`).  Multilingual `T~350+` and `D=512` makes this a real wedge of per-synth host time. |
+| `g_inv_alpha_results` (`cached_inv_alpha`) | `ggml_tensor *` (model-weight pointer) | `vector<float>` of inverted alphas | Alpha tensors are constant for the model lifetime; HiFT calls `invert_alpha_cpu` ~72× per synth (12 ResBlocks × 6 alphas).  Survives across HiFT graph rebuilds. |
+| `g_hann_window_cache` / `g_istft_kernel_cache` (`cached_*`) | `n_fft` | `vector<float>` | Pure functions of `n_fft` (constant 16 in the chatterbox HiFT path). |
+| `g_window_sum_cache` (`cached_window_sum`) | `pack(n_fft, hop, T_stft)` | `vector<float>` | `T_stft × n_fft` adds (`~T_stft` ms-class cost on long utterances).  Stable across same-shape synth calls. |
+
+A new `graph_cache` struct (used by encoder / HiFT / F0) and a
+`pack_hift_key` helper centralise the explicit `destroy()`-on-teardown
+pattern so future per-stage caches can plug in with one struct + one
+mutex acquisition.  The destroy path is unified into a renamed
+`s3gen_release_synth_caches()` (replaces the old
+`g_cfm_estimator_cache_destroy()`) called from `s3gen_model_cache_release`,
+the cache-miss backend-swap path, and the explicit `s3gen_unload()`.
+
+##### Negative result documented (bug caught and fixed during dev)
+
+First implementation of the HiFT cache hung indefinitely on the very
+first synth call.  Root cause: the alpha-input refresh loop held
+`g_synth_caches_mu` while calling `cached_inv_alpha`, which itself
+takes the same mutex internally → classic re-entrant deadlock.  Fix:
+snapshot `g_hift_inv_alpha_entries` under the mutex into a local
+vector, then iterate without the lock (`cached_inv_alpha` re-acquires
+the mutex per call but with no nesting).  General rule: never hold a
+cache-state mutex while calling any other `cached_*` helper.
+
+##### Performance — RTX 5090, multilingual auto-split, warm-state seg 2–6
+
+Within-process win on top of round 1 + round-HIFT (already shipped in
+this PR):
+
+| Metric          | Pre-round-2 (baseline-pre-r2.snap) | Post-round-2 |          Δ                |
+|-----------------|-----------------------------------:|-------------:|---------------------------:|
+| **S3GEN_INFER** |                          159.8 ms  | **140.8 ms** | **−19.0 ms (−11.9 %)**    |
+| **cfm_total**   |                          122.2 ms  |    118.7 ms  | **−3.5 ms (−2.9 %)**      |
+| **cfm_step0**   |                           13.24 ms |     13.18 ms |  unchanged (already cached round 1) |
+| **hift_total**  |                           17.96 ms |     16.3 ms  | **−1.7 ms (−9.4 %)**      |
+
+Combined cumulative win vs `upstream/multilingual_merged` baseline
+(round 1 + round-HIFT + round 2):
+
+| Metric          | upstream/multilingual_merged | this PR (full) |          Δ                |
+|-----------------|-----------------------------:|---------------:|---------------------------:|
+| **S3GEN_INFER** |                     169.9 ms |  **140.8 ms**  | **−29.1 ms (−17.1 %)**    |
+| **cfm_total**   |                     132.5 ms |  **118.7 ms**  | **−13.8 ms (−10.4 %)**    |
+| **cfm_step0**   |                      24.1 ms |   **13.2 ms**  | **−10.9 ms (−45.2 %)**    |
+
+The biggest remaining single piece of `S3GEN_INFER` (~120 ms cfm) is
+the actual GPU CFM compute — it's not host-cacheable and would need
+shader-side optimisation (e.g. tensor-core engagement via
+`cooperative_matrix2`, deferred — see "Next" below).
+
+##### Reproduction (test-first harness)
+
+```bash
+cd inputFilesForAI/qvac-17872-findings/chatterbox.cpp
+
+# 1. Build the round-2 binary
+bash scripts/setup-ggml.sh
+cmake -S . -B build-vk-mtl-merged -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
+cmake --build build-vk-mtl-merged -j --target tts-cli
+
+# 2. Verify bit-exact vs the locked pre-round-2 baseline.  3/3 invariants
+#    must PASS (multilingual single-shot, multilingual 6-segment
+#    multi-synth, Turbo single-shot).
+bash ../bench-logs-vk-mtl/regress-mtl-vk.sh build-vk-mtl-merged final verify
+
+# Optional: re-lock if the binary is intentionally producing different
+# output (e.g. after an explicit numerical change).
+# bash ../bench-logs-vk-mtl/regress-mtl-vk.sh build-vk-mtl-merged my-baseline lock
+```
+
 #### Multilingual verification (May 6, 2026)
 
 The May 4 squashed port was measured on Turbo because the
@@ -4429,9 +4519,19 @@ md5sum /tmp/mtl-pr.wav /tmp/mtl-base.wav  # MUST match
 | `patches/README.md`                        |       +13 / -8  |
 | `scripts/setup-ggml.sh`                    |       +20 / -8  |
 | `scripts/dump-s3gen-reference.py`          |             +65 |
-| `src/chatterbox_tts.cpp`                   |     +252 / -19  |
+| `src/chatterbox_tts.cpp`                   |     +625 / -98  |
 | `src/test_s3gen.cpp`                       |              +6 |
-| **Total**                                  | **+593 / -22**  |
+| **Total**                                  | **+966 / -101** |
+
+The +373 lines added in round 2 (over the +252 already shipped in
+round-1 / round-HIFT) are entirely the new cache infrastructure:
+`graph_cache` struct, the seven new cache globals, the
+`s3gen_release_synth_caches()` lifecycle hook, the five `cached_*`
+scaffolding helpers, and the build_graph / cache-hit branches in
+`run_encoder` / `run_hift_decode` / `run_f0_predictor`.  No source
+deletions are user-facing; the −98 lines reduce the per-synth
+`gallocr_new` / `ggml_init` / `ggml_gallocr_free` / `ggml_free`
+boilerplate that the cache infrastructure now subsumes.
 
 All `inputFilesForAI/qvac-17872-findings/FINDINGS_*.md` and
 `PR_DESCRIPTION_*.md` companion docs stay in the qvac monorepo
@@ -4449,15 +4549,25 @@ All `inputFilesForAI/qvac-17872-findings/FINDINGS_*.md` and
   conversion path to `multilingual_merged`'s
   `ggml_dup_tensor + ggml_backend_alloc_ctx_tensors` `load_s3gen_gguf`
   layout, plus new locked MD5 baselines (NVIDIA + AMD, F32 + F16).
-- **HiFT graph caching on `multilingual_merged`**: that branch's
-  `run_hift_decode` allocates `ggml_gallocr_t + ggml_context *` fresh
-  on every call (no `g_hift_cache` equivalent) — same persistent-
-  cache pattern would save another ~5–10 ms / chunk on multilingual.
+- ~~**HiFT graph caching on `multilingual_merged`**~~: ✅ **DONE in round 2**
+  (May 6, 2026).  Added `g_hift_graph_cache` keyed on
+  `pack(T_mel, T_stft)` with parallel `g_hift_inv_alpha_entries`
+  metadata.  Within-process warm-state win: −9.4 % `hift_total` on
+  multilingual.  See "Round 2 — encoder / HiFT / F0 graph caches" subsection above.
+- ~~**Encoder + F0 + scaffolding caches**~~: ✅ **DONE in round 2** (May 6,
+  2026).  Added `g_encoder_graph_cache`, `g_f0_graph_cache`, plus
+  `cached_pos_emb` / `cached_inv_alpha` / `cached_hann_window` /
+  `cached_istft_kernel` / `cached_window_sum`.  Combined with HiFT
+  graph cache: −11.9 % `S3GEN_INFER` on multilingual.
 - **Round-4 / 6 QKV fusion composition with multilingual_merged's
   strided 3D views** — our batched `mul_mat` (originally landed on
   `main`) and their zero-cont strided views (`849507a`) are
   alternative optimisations targeting the same code; pick one
   approach and bench Vulkan `flash_attn_ext` stride tolerance.
+- **Tensor-core engagement for narrow CFM matmuls** (`cooperative_matrix2`):
+  the round-1 `main`-base CM2 Tier-3 close-out measured **−8.6 % cfm_total** on
+  RTX 5090.  Politically blocked behind a cmake flag pending
+  project-wide baseline-set sign-off.  See `FINDINGS_ROUND_CM2.md`.
 - **Mobile validation** (Adreno / Mali / Apple):
   hardware-bound; biggest remaining evidence gap.  AMD/RADV proxy
   refuted the original mobile-bandwidth projection on the
diff --git a/src/chatterbox_tts.cpp b/src/chatterbox_tts.cpp
index 21695c6..22c00f6 100644
--- a/src/chatterbox_tts.cpp
+++ b/src/chatterbox_tts.cpp
@@ -162,11 +162,15 @@ static ggml_backend_t s3gen_init_backend(int n_gpu_layers, bool verbose) {
 // belong in a server front-end.
 static model_ctx load_s3gen_gguf(const std::string & path, int n_gpu_layers, bool verbose);
 
-// QVAC-17872 round-HIFT: defined later (alongside cfm_estimator_cache).
-// Tears down the persistent CFM estimator graph cache.  Forward-declared
-// here so s3gen_model_cache_release / cache-miss can call it without
-// having to also move the struct definition + global instance up.
-static void g_cfm_estimator_cache_destroy();
+// QVAC-17872 round-HIFT (initial) + round 2 (this PR): tears down every
+// per-synth host-side cache before ggml_backend_free runs.  Includes the
+// CFM estimator graph cache and (added in round 2) the encoder / HiFT /
+// F0 graph caches plus all the scaffolding caches (pos_emb, inv_alpha,
+// hann_window, istft_kernel, window_sum).  Defined later, alongside the
+// cache structs themselves.  Forward-declared here so
+// s3gen_model_cache_release / cache-miss / s3gen_unload can all call it
+// without moving the struct definitions earlier in the file.
+static void s3gen_release_synth_caches();
 
 namespace {
 struct s3gen_cache_entry { std::string path; int gpu = 0; std::unique_ptr<model_ctx> m; };
@@ -183,13 +187,13 @@ static double                                g_s3gen_cache_last_load_ms = 0.0;
 // insertion so it runs before process-exit dylib finalisers.
 static void s3gen_model_cache_release() {
     std::lock_guard<std::mutex> lk(g_s3gen_cache_mu);
-    // QVAC-17872 round-HIFT: tear down the persistent CFM estimator graph
-    // BEFORE freeing the backend.  cfm_estimator_cache.allocr holds Vulkan
-    // (or Metal/CUDA) buffers allocated against the soon-to-be-freed
-    // backend; gallocr_free against a dangling vk_device asserts inside
-    // ggml-vulkan.  Same constraint as the existing thread_local
-    // time_mlp_cache documents.
-    g_cfm_estimator_cache_destroy();
+    // QVAC-17872 round-HIFT + round 2: tear down every persistent host-side
+    // cache BEFORE freeing the backend.  The graph caches own
+    // ggml_gallocr_t handles that hold Vulkan (or Metal/CUDA) buffers
+    // allocated against the soon-to-be-freed backend; gallocr_free against
+    // a dangling vk_device asserts inside ggml-vulkan.  Same constraint as
+    // the existing thread_local time_mlp_cache documents.
+    s3gen_release_synth_caches();
     if (!g_s3gen_cache_entry) return;
     model_ctx * m = g_s3gen_cache_entry->m.get();
     if (m) {
@@ -213,12 +217,12 @@ static model_ctx * s3gen_model_cache_get(const std::string & path, int n_gpu_lay
         g_s3gen_cache_last_load_ms = 0.0;
         return g_s3gen_cache_entry->m.get();
     }
-    // QVAC-17872 round-HIFT: backend swap (different path or n_gpu_layers).
-    // Tear down the persistent CFM estimator cache against the OLD backend
-    // before freeing it, then drop the s3gen_cache_entry.  Same reasoning as
-    // s3gen_model_cache_release.
+    // QVAC-17872 round-HIFT + round 2: backend swap (different path or
+    // n_gpu_layers).  Tear down every persistent cache against the OLD
+    // backend before freeing it, then drop the s3gen_cache_entry.  Same
+    // reasoning as s3gen_model_cache_release.
     if (g_s3gen_cache_entry) {
-        g_cfm_estimator_cache_destroy();
+        s3gen_release_synth_caches();
     }
     if (verbose) fprintf(stderr, "Loading %s\n", path.c_str());
     double t0 = now_ms();
@@ -524,6 +528,98 @@ static ggml_tensor * conformer_block(ggml_context * ctx, const conformer_w & w,
     return ggml_add(ctx, residual, ff);
 }
 
+// ============================================================================
+// QVAC-17872 round 2: persistent graph + scaffolding caches (declarations).
+// ----------------------------------------------------------------------------
+// All host-side, model-agnostic, no GGUF-format change.  Same teardown
+// discipline as g_cfm_estimator_cache (destroy() before ggml_backend_free).
+//
+// Targeted bottlenecks on multilingual on Vulkan (after round-1 / round-HIFT
+// already shipped):
+//   - run_encoder rebuilds its full graph + gallocr per synth (~17 ms host
+//     overhead on multilingual T=350+).
+//   - run_hift_decode rebuilds its graph + gallocr + computes
+//     hann_window/istft_kernel/window_sum + ~72 inv_alpha tensor_get calls
+//     per synth (~7-10 ms compounded host overhead, multilingual is the
+//     biggest beneficiary because audio length scales with the prompt).
+//   - run_f0_predictor rebuilds its (smaller) graph per synth.
+//   - compute_pos_emb fires twice per encoder run (for T and 2T) at
+//     ~T*D*5 trig ops; multilingual chunks of T~350+ pay several ms.
+//
+// Each cache is process-wide; the steady-state size is small (1-2 entries
+// per shape key) and bounded by the number of distinct shapes the running
+// process sees.  Streaming sessions with many varying T values can grow
+// these caches; a future LRU bound would belong here.
+//
+// The cache state lives here (above run_encoder so its definition can use
+// it).  The destroy/clear function `s3gen_release_synth_caches()` is
+// defined later, alongside g_cfm_estimator_cache, since it touches both.
+// ============================================================================
+
+// Generic graph cache used by encoder / HiFT / F0 — same shape, different keys.
+struct graph_cache {
+    int64_t                key = -1;
+    ggml_context *         ctx = nullptr;
+    ggml_cgraph *          gf  = nullptr;
+    ggml_gallocr_t         allocr = nullptr;
+    std::vector<uint8_t>   buf;
+
+    void destroy() {
+        if (allocr) { ggml_gallocr_free(allocr); allocr = nullptr; }
+        if (ctx)    { ggml_free(ctx);            ctx    = nullptr; }
+        gf  = nullptr;
+        key = -1;
+        // Keep `buf` reservation; reusing it avoids a multi-MB malloc on
+        // the next rebuild.
+    }
+};
+
+// Pack (T_mel, T_stft) into a single int64_t key for the HiFT graph cache.
+// Both dimensions are positive int32 in practice; combining them this way
+// gives a unique key with no collision.
+static int64_t pack_hift_key(int T_mel, int T_stft) {
+    return ((int64_t) T_mel << 32) | (uint32_t) T_stft;
+}
+
+namespace {
+// Single mutex around every round-2 cache.  Held only across cache-state
+// mutations (insert / clear / size queries), not across the heavy compute
+// or graph rebuilds themselves.  s3gen_synthesize_to_wav is process-serial
+// in practice (the existing s3gen_cache_entry mutex enforces single-flight
+// model loads), so contention is effectively zero.
+static std::mutex g_synth_caches_mu;
+
+// Graph caches.
+static graph_cache g_encoder_graph_cache;   // keyed on T (encoder input length)
+static graph_cache g_hift_graph_cache;      // keyed on pack(T_mel, T_stft)
+static graph_cache g_f0_graph_cache;        // keyed on T_mel
+
+// Parallel metadata for HiFT: the (graph-input-name, model-tensor-ptr)
+// pairs for every alpha tensor referenced by the cached HiFT graph.
+// Used on cache hits to refresh each alpha-input slot from the data in
+// g_inv_alpha_results without rebuilding the graph.
+static std::vector<std::pair<std::string, const ggml_tensor *>> g_hift_inv_alpha_entries;
+
+// Result / scaffolding caches (pure CPU compute).
+static std::unordered_map<int64_t, std::vector<float>>             g_pos_emb_results;
+static std::unordered_map<const ggml_tensor *, std::vector<float>> g_inv_alpha_results;
+static std::unordered_map<int, std::vector<float>>                 g_hann_window_cache;
+static std::unordered_map<int, std::vector<float>>                 g_istft_kernel_cache;
+static std::unordered_map<int64_t, std::vector<float>>             g_window_sum_cache;
+}  // namespace
+
+// Scaffolding-helper forward declarations (definitions live later, alongside
+// the cfm_estimator_cache + cached_cpu_weights_f32 helpers, where the
+// underlying build_* functions are visible).  Declared up here so the
+// graph-build sites that consume them (run_encoder, run_f0_predictor,
+// run_hift_decode) compile.
+static const std::vector<float> & cached_pos_emb(int T, int D);
+static const std::vector<float> & cached_inv_alpha(const model_ctx & m,
+                                                   const std::string & name);
+static const std::vector<float> & cached_hann_window(int n_fft);
+static const std::vector<float> & cached_istft_kernel(int n_fft);
+static const std::vector<float> & cached_window_sum(int T_stft, int n_fft, int hop);
+
 static void compute_pos_emb(std::vector<float> & pe, int T, int D) {
     int L = 2 * T - 1;
     pe.assign(L * D, 0.0f);
@@ -550,15 +646,31 @@ static void compute_pos_emb(std::vector<float> & pe, int T, int D) {
 }
 
 // Run the full S3Gen encoder: input (T, D=512) -> mu (2T, 80)
+// QVAC-17872 round 2: graph + gallocator cached process-wide via
+// g_encoder_graph_cache (keyed on T = encoder input length).  Same-shape
+// calls (e.g. batch synthesis of constant-length prompts, or streaming
+// chunks at a stable T) skip the rebuild + gallocr_reserve.  pos_emb
+// vectors are cached separately by cached_pos_emb (keyed on (T, D));
+// re-used across every same-T synth.
 static std::vector<float> run_encoder(const model_ctx & m, const std::vector<float> & input_embed, int T, int D = 512) {
     const int H = 8, HEAD_DIM = 64;
     const int T2 = 2 * T;
 
-    static size_t buf_size = 64 * 1024 * 1024;  // plenty
-    std::vector<uint8_t> buf(buf_size);
-    ggml_init_params gp = { buf_size, buf.data(), true };
-    ggml_context * ctx = ggml_init(gp);
-    ggml_cgraph * gf = ggml_new_graph_custom(ctx, 32768, false);
+    graph_cache & cache = g_encoder_graph_cache;
+    const bool build_graph = (cache.key != (int64_t) T) || (cache.ctx == nullptr);
+    if (build_graph) {
+        if (cache.allocr) { ggml_gallocr_free(cache.allocr); cache.allocr = nullptr; }
+        if (cache.ctx)    { ggml_free(cache.ctx);            cache.ctx    = nullptr; }
+        cache.buf.resize(64 * 1024 * 1024);
+        ggml_init_params gp = { cache.buf.size(), cache.buf.data(), true };
+        cache.ctx = ggml_init(gp);
+        cache.gf  = ggml_new_graph_custom(cache.ctx, 32768, false);
+        cache.key = (int64_t) T;
+    }
+    ggml_context * ctx = cache.ctx;
+    ggml_cgraph * gf = cache.gf;
+
+    if (build_graph) {
 
     ggml_tensor * x_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, D, T);
     ggml_set_name(x_in, "x_in"); ggml_set_input(x_in);
@@ -641,22 +753,26 @@ static std::vector<float> run_encoder(const model_ctx & m, const std::vector<flo
     ggml_set_name(mu, "mu"); ggml_set_output(mu);
     ggml_build_forward_expand(gf, mu);
 
-    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
-    ggml_gallocr_reserve(allocr, gf);
-    ggml_gallocr_alloc_graph(allocr, gf);
+    cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
+    ggml_gallocr_reserve(cache.allocr, gf);
+    }  // end build_graph
+
+    ggml_gallocr_alloc_graph(cache.allocr, gf);
     ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "x_in"), input_embed.data(), 0, input_embed.size()*sizeof(float));
 
-    std::vector<float> pe1, pe2;
-    compute_pos_emb(pe1, T, D);
-    compute_pos_emb(pe2, T2, D);
+    // Cached positional embeddings — same (T, D) keys reused across every
+    // synth at the same chunk size.  compute_pos_emb is ~T*D*5 trig ops
+    // per call; for multilingual T=350+ at D=512 that's a real wedge of
+    // per-synth host time.
+    const std::vector<float> & pe1 = cached_pos_emb(T,  D);
+    const std::vector<float> & pe2 = cached_pos_emb(T2, D);
     ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "pos1"), pe1.data(), 0, pe1.size()*sizeof(float));
     ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "pos2"), pe2.data(), 0, pe2.size()*sizeof(float));
     compute(m.backend, gf);
 
-    std::vector<float> mu_data(ggml_nelements(mu));
-    ggml_backend_tensor_get(mu, mu_data.data(), 0, ggml_nbytes(mu));
-    ggml_gallocr_free(allocr);
-    ggml_free(ctx);
+    ggml_tensor * mu_out = ggml_graph_get_tensor(gf, "mu");
+    std::vector<float> mu_data(ggml_nelements(mu_out));
+    ggml_backend_tensor_get(mu_out, mu_data.data(), 0, ggml_nbytes(mu_out));
     return mu_data;  // shape ggml ne=[T2, 80] = numpy (80, T2)
 }
 
@@ -1012,7 +1128,7 @@ static std::vector<float> compute_time_mixed(const model_ctx & m,
 //   - g_time_emb_results: keyed by uint64_t = (kt << 32) | kr, ONLY
 //     used by Turbo (meanflow) since multilingual doesn't run the mixer.
 //
-// Cleared in g_cfm_estimator_cache_destroy alongside the graph cache.
+// Cleared in s3gen_release_synth_caches alongside the graph cache.
 //
 // Bit-exactness: trivially preserved — same compute, just memoised.
 static std::unordered_map<uint32_t, std::vector<float>> g_time_mlp_results;
@@ -1109,7 +1225,7 @@ static cfm_estimator_cache g_cfm_estimator_cache;
 // synthesize() reads every call (input_embedding lookup table, speaker
 // affine matrix).  These are model constants — on a GPU backend each
 // call previously paid an N MB device→host download per synth.  Cleared
-// in g_cfm_estimator_cache_destroy alongside the graph cache.
+// in s3gen_release_synth_caches alongside the graph cache.
 static std::unordered_map<const ggml_tensor *, std::vector<float>> g_weight_cpu_mirror;
 static std::mutex                                                  g_weight_cpu_mirror_mu;
 
@@ -1128,10 +1244,29 @@ static const float * cached_cpu_weights_f32(const ggml_tensor * t) {
     }
 }
 
-// Forward-declared near s3gen_model_cache_release; defined here so the
-// release path can flush the caches without having to also move the
-// cfm_estimator_cache struct definition + global up.
-static void g_cfm_estimator_cache_destroy() {
+// QVAC-17872 round 2: definition of s3gen_release_synth_caches (forward-
+// declared near s3gen_model_cache_release).  Defined here once the
+// graph_cache + cfm_estimator_cache structs and globals are all visible.
+// Idempotent — safe to call multiple times and from multiple release paths.
+//
+// Order matters: graph caches first (they own gallocr_t handles bound to
+// the still-live backend); then result caches; then the round-1 caches.
+// The graph_cache struct + globals themselves are declared earlier (above
+// run_encoder) — see "QVAC-17872 round 2: persistent graph + scaffolding
+// caches" block.
+static void s3gen_release_synth_caches() {
+    {
+        std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+        g_encoder_graph_cache.destroy();
+        g_hift_graph_cache.destroy();
+        g_f0_graph_cache.destroy();
+        g_hift_inv_alpha_entries.clear();
+        g_pos_emb_results.clear();
+        g_inv_alpha_results.clear();
+        g_hann_window_cache.clear();
+        g_istft_kernel_cache.clear();
+        g_window_sum_cache.clear();
+    }
     g_cfm_estimator_cache.destroy();
     {
         std::lock_guard<std::mutex> lk(g_time_emb_results_mu);
@@ -1471,13 +1606,118 @@ static std::vector<float> invert_alpha_cpu(const model_ctx & m, const std::strin
     return inv;
 }
 
+// ----------------------------------------------------------------------------
+// QVAC-17872 round 2: scaffolding cache definitions
+// ----------------------------------------------------------------------------
+
+// compute_pos_emb is pure CPU compute (~T * D * 5 trig ops).  It fires
+// twice per encoder run (once for T, once for 2T) — at multilingual
+// chunk size T~350+ that's a noticeable wedge of per-synth host time.
+// Cached by (T, D) (D is constant 512 in the chatterbox model; we still
+// include it in the key for safety against future-variant collisions).
+static const std::vector<float> & cached_pos_emb(int T, int D) {
+    const int64_t key = ((int64_t) T << 32) | (uint32_t) D;
+    {
+        std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+        auto it = g_pos_emb_results.find(key);
+        if (it != g_pos_emb_results.end()) return it->second;
+    }
+    std::vector<float> pe;
+    compute_pos_emb(pe, T, D);
+    std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+    auto [it, inserted] = g_pos_emb_results.try_emplace(key, std::move(pe));
+    return it->second;
+}
+
+// invert_alpha_cpu is fired ~72× per HiFT call (12 ResBlocks × 6 alpha
+// tensors); each call is a tensor_get + per-element reciprocal.  Alpha
+// tensors are constant for the model lifetime, so cache by tensor* —
+// invalidation tied to s3gen_release_synth_caches (model-context lifetime).
+static const std::vector<float> & cached_inv_alpha(const model_ctx & m,
+                                                   const std::string & name) {
+    ggml_tensor * t = find_tensor(m, name);
+    {
+        std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+        auto it = g_inv_alpha_results.find(t);
+        if (it != g_inv_alpha_results.end()) return it->second;
+    }
+    auto inv = invert_alpha_cpu(m, name);
+    std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+    auto [it, inserted] = g_inv_alpha_results.try_emplace(t, std::move(inv));
+    return it->second;
+}
+
+// hann_window / istft_kernel are pure functions of n_fft (constant 16 on
+// the chatterbox HiFT path); window_sum additionally depends on (n_fft,
+// hop, T_stft).  Caching them eliminates the per-synth host-CPU build
+// cost (small for n_fft=16 but the shape-key lookup composes cleanly
+// with the larger HiFT graph cache below).
+static const std::vector<float> & cached_hann_window(int n_fft) {
+    {
+        std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+        auto it = g_hann_window_cache.find(n_fft);
+        if (it != g_hann_window_cache.end()) return it->second;
+    }
+    auto w = build_hann_window(n_fft, true);
+    std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+    auto [it, inserted] = g_hann_window_cache.try_emplace(n_fft, std::move(w));
+    return it->second;
+}
+
+static const std::vector<float> & cached_istft_kernel(int n_fft) {
+    {
+        std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+        auto it = g_istft_kernel_cache.find(n_fft);
+        if (it != g_istft_kernel_cache.end()) return it->second;
+    }
+    // Use the cached hann window so we don't re-derive it twice.
+    auto k = build_istft_kernel(n_fft, cached_hann_window(n_fft));
+    std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+    auto [it, inserted] = g_istft_kernel_cache.try_emplace(n_fft, std::move(k));
+    return it->second;
+}
+
+static const std::vector<float> & cached_window_sum(int T_stft, int n_fft, int hop) {
+    // Pack (n_fft, hop, T_stft) into a single int64 key — n_fft and hop
+    // are constants on the chatterbox path but encoding them makes the
+    // cache safe against future variant additions.
+    const int64_t key =
+        ((int64_t)(uint16_t) n_fft << 48) |
+        ((int64_t)(uint16_t) hop   << 32) |
+        (int64_t)(uint32_t) T_stft;
+    {
+        std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+        auto it = g_window_sum_cache.find(key);
+        if (it != g_window_sum_cache.end()) return it->second;
+    }
+    auto ws = build_window_sum(T_stft, n_fft, hop, cached_hann_window(n_fft));
+    std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+    auto [it, inserted] = g_window_sum_cache.try_emplace(key, std::move(ws));
+    return it->second;
+}
+
 // F0 predictor (mel (80, T) -> f0 (T,))
+//
+// QVAC-17872 round 2: graph + gallocator cached process-wide via
+// g_f0_graph_cache (keyed on T_mel).  Same-shape calls (e.g. streaming
+// chunks at constant T_mel) skip the rebuild + gallocr_reserve.
 static std::vector<float> run_f0_predictor(const model_ctx & m, const std::vector<float> & mel, int T_mel) {
-    static size_t buf_size = 8 * 1024 * 1024;
-    std::vector<uint8_t> buf(buf_size);
-    ggml_init_params gp = { buf_size, buf.data(), true };
-    ggml_context * ctx = ggml_init(gp);
-    ggml_cgraph * gf = ggml_new_graph_custom(ctx, 1024, false);
+    graph_cache & cache = g_f0_graph_cache;
+    const bool build_graph = (cache.key != (int64_t) T_mel) || (cache.ctx == nullptr);
+    if (build_graph) {
+        if (cache.allocr) { ggml_gallocr_free(cache.allocr); cache.allocr = nullptr; }
+        if (cache.ctx)    { ggml_free(cache.ctx);            cache.ctx    = nullptr; }
+        cache.buf.resize(8 * 1024 * 1024);
+        ggml_init_params gp = { cache.buf.size(), cache.buf.data(), true };
+        cache.ctx = ggml_init(gp);
+        cache.gf  = ggml_new_graph_custom(cache.ctx, 1024, false);
+        cache.key = (int64_t) T_mel;
+    }
+    ggml_context * ctx = cache.ctx;
+    ggml_cgraph * gf = cache.gf;
+
+    if (build_graph) {
+
     ggml_tensor * mel_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, T_mel, 80);
     ggml_set_name(mel_in, "mel_in"); ggml_set_input(mel_in);
     ggml_tensor * x = mel_in;
@@ -1505,15 +1745,16 @@ static std::vector<float> run_f0_predictor(const model_ctx & m, const std::vecto
     y = ggml_reshape_1d(ctx, y, T_mel);
     ggml_set_name(y, "out"); ggml_set_output(y);
     ggml_build_forward_expand(gf, y);
-    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
-    ggml_gallocr_reserve(allocr, gf);
-    ggml_gallocr_alloc_graph(allocr, gf);
+    cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
+    ggml_gallocr_reserve(cache.allocr, gf);
+    }  // end build_graph
+
+    ggml_gallocr_alloc_graph(cache.allocr, gf);
     ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "mel_in"), mel.data(), 0, mel.size()*sizeof(float));
     compute(m.backend, gf);
+    ggml_tensor * y_out = ggml_graph_get_tensor(gf, "out");
     std::vector<float> f0(T_mel);
-    ggml_backend_tensor_get(y, f0.data(), 0, ggml_nbytes(y));
-    ggml_gallocr_free(allocr);
-    ggml_free(ctx);
+    ggml_backend_tensor_get(y_out, f0.data(), 0, ggml_nbytes(y_out));
     return f0;
 }
 
@@ -1589,6 +1830,12 @@ static std::vector<float> run_stft(const model_ctx & m, const std::vector<float>
 }
 
 // Full HiFT decode: mel + s_stft -> wav (inlined from mel2wav.cpp)
+// QVAC-17872 round 2: graph + gallocator cached process-wide via
+// g_hift_graph_cache (keyed on pack(T_mel, T_stft)).  Scaffolding
+// (hann_window, istft_kernel, window_sum, ~72 inv_alpha tensors) is also
+// cached, so subsequent same-shape calls do zero CPU host work outside
+// the graph compute itself.  HiFT is the biggest multilingual beneficiary
+// because audio length scales with prompt length.
 static std::vector<float> run_hift_decode(const model_ctx & m,
                                           const std::vector<float> & mel, int T_mel,
                                           const std::vector<float> & s_stft, int T_stft) {
@@ -1602,30 +1849,50 @@ static std::vector<float> run_hift_decode(const model_ctx & m,
     std::vector<int> src_rb_ksizes = {7, 7, 11};
     std::vector<std::vector<int>> src_rb_dils = {{1,3,5},{1,3,5},{1,3,5}};
 
-    // Thread-local arena: previously this was a fresh `std::vector<uint8_t>
-    // buf(64 MB)` per HiFT call, which forced a 64 MB memset on every
-    // generate (~5–10 ms on M3 Ultra). The buffer is reused across calls;
-    // each ggml_init resets the arena pointer, so we never accumulate stale
-    // tensor metadata between invocations.
-    static const size_t buf_size = 64 * 1024 * 1024;
-    thread_local std::vector<uint8_t> buf(buf_size);
-    ggml_init_params gp = { buf_size, buf.data(), true };
-    ggml_context * ctx = ggml_init(gp);
-    ggml_cgraph * gf = ggml_new_graph_custom(ctx, 131072, false);
+    graph_cache & cache = g_hift_graph_cache;
+    const int64_t cache_key = pack_hift_key(T_mel, T_stft);
+    const bool build_graph = (cache.key != cache_key) || (cache.ctx == nullptr);
+    if (build_graph) {
+        if (cache.allocr) { ggml_gallocr_free(cache.allocr); cache.allocr = nullptr; }
+        if (cache.ctx)    { ggml_free(cache.ctx);            cache.ctx    = nullptr; }
+        // 64 MB arena — same as the pre-cache version.  Reusing the
+        // vector across rebuilds avoids a 64 MB malloc churn when (T_mel,
+        // T_stft) change between streaming chunks.
+        cache.buf.resize(64 * 1024 * 1024);
+        ggml_init_params gp = { cache.buf.size(), cache.buf.data(), true };
+        cache.ctx = ggml_init(gp);
+        cache.gf  = ggml_new_graph_custom(cache.ctx, 131072, false);
+        cache.key = cache_key;
+        // Wipe and re-populate the alpha-input metadata for the new build.
+        // Mutex held briefly; the graph build below runs without the lock
+        // because synthesize() is process-serial in practice.
+        std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+        g_hift_inv_alpha_entries.clear();
+    }
+    ggml_context * ctx = cache.ctx;
+    ggml_cgraph * gf = cache.gf;
+
+    if (build_graph) {
 
     ggml_tensor * mel_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, T_mel, MEL);
     ggml_set_name(mel_in, "mel_in"); ggml_set_input(mel_in);
     ggml_tensor * s_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, T_stft, NFFT2);
     ggml_set_name(s_in, "s_in"); ggml_set_input(s_in);
 
-    struct inv_entry { std::string gn; std::vector<float> data; };
-    std::vector<inv_entry> inv_alphas;
     auto mk_inv = [&](const std::string & pref, int C) {
+        // Record the (graph-input-name, source-tensor-ptr) pair so that
+        // run_hift_decode can re-feed each alpha-input slot on cache
+        // hits.  cached_inv_alpha actually owns the data — we just need
+        // a stable handle to look it up later.
+        ggml_tensor * src = find_tensor(m, pref);
+        (void) cached_inv_alpha(m, pref);  // warm the data cache
         std::string gn = "inv_" + pref;
-        auto inv = invert_alpha_cpu(m, pref);
         ggml_tensor * t = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, C);
         ggml_set_name(t, gn.c_str()); ggml_set_input(t);
-        inv_alphas.push_back({gn, std::move(inv)});
+        {
+            std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+            g_hift_inv_alpha_entries.emplace_back(std::move(gn), src);
+        }
         return t;
     };
 
@@ -1715,19 +1982,19 @@ static std::vector<float> run_hift_decode(const model_ctx & m,
     ggml_tensor * imag = ggml_mul(ctx, mag, ggml_sin(ctx, ph));
     ggml_tensor * spec = ggml_concat(ctx, real, imag, 1);
 
-    auto window = build_hann_window(n_fft, true);
-    auto ik = build_istft_kernel(n_fft, window);
-    auto ws = build_window_sum(T_stft, n_fft, hop, window);
+    // Cached scaffolding sizes — pure functions of (n_fft, hop, T_stft).
+    // Build the input-tensor declarations against the cached vector sizes.
+    const std::vector<float> & ws_for_size = cached_window_sum(T_stft, n_fft, hop);
 
     ggml_tensor * istft_k = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, n_fft, 1, 2 * F);
     ggml_set_name(istft_k, "istft_k"); ggml_set_input(istft_k);
-    ggml_tensor * ws_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (int)ws.size(), 1);
+    ggml_tensor * ws_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (int)ws_for_size.size(), 1);
     ggml_set_name(ws_in, "w_sum"); ggml_set_input(ws_in);
 
     ggml_tensor * y = ggml_conv_transpose_1d(ctx, istft_k, spec, hop, 0, 1);
     y = ggml_div(ctx, y, ws_in);
     int pad_amt = n_fft / 2;
-    int L_wav = (int)ws.size() - n_fft;
+    int L_wav = (int)ws_for_size.size() - n_fft;
     // QVAC-17872 round-HIFT (2026-05-04): drop the trailing ggml_cont.  The
     // view's only consumer is ggml_clamp (element-wise, accepts strided
     // src0); clamp's output is a fresh contiguous tensor allocated by the
@@ -1739,21 +2006,48 @@ static std::vector<float> run_hift_decode(const model_ctx & m,
     ggml_set_name(y_trim, "wav"); ggml_set_output(y_trim);
     ggml_build_forward_expand(gf, y_trim);
 
-    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
-    ggml_gallocr_reserve(allocr, gf);
-    ggml_gallocr_alloc_graph(allocr, gf);
-    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "mel_in"), mel.data(), 0, mel.size()*sizeof(float));
-    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "s_in"), s_stft.data(), 0, s_stft.size()*sizeof(float));
-    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "istft_k"), ik.data(), 0, ik.size()*sizeof(float));
-    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "w_sum"), ws.data(), 0, ws.size()*sizeof(float));
-    for (auto & ia : inv_alphas)
-        ggml_backend_tensor_set(ggml_graph_get_tensor(gf, ia.gn.c_str()), ia.data.data(), 0, ia.data.size()*sizeof(float));
+    cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
+    ggml_gallocr_reserve(cache.allocr, gf);
+    }  // end build_graph
+
+    // Cached scaffolding (pulled outside build_graph too — when the graph
+    // is reused, ik / ws data still need to be staged into the input
+    // tensors).  cached_* helpers are O(1) on hits.
+    const std::vector<float> & ik_data = cached_istft_kernel(n_fft);
+    const std::vector<float> & ws_data = cached_window_sum(T_stft, n_fft, hop);
+
+    ggml_gallocr_alloc_graph(cache.allocr, gf);
+    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "mel_in"),  mel.data(),    0, mel.size()*sizeof(float));
+    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "s_in"),    s_stft.data(), 0, s_stft.size()*sizeof(float));
+    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "istft_k"), ik_data.data(),0, ik_data.size()*sizeof(float));
+    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "w_sum"),   ws_data.data(),0, ws_data.size()*sizeof(float));
+    // Re-feed every alpha-input slot from the cached data.  The (graph-
+    // input-name, source-tensor-ptr) pairs were captured during the
+    // graph build; cached_inv_alpha is the source of truth for the data
+    // (keyed by source tensor pointer, so the entry survives across
+    // graph rebuilds — only s3gen_release_synth_caches drops it).
+    //
+    // Snapshot g_hift_inv_alpha_entries under the mutex (cheap; ~72
+    // string + pointer pairs), then iterate WITHOUT the lock.  Each
+    // cached_inv_alpha call below takes the same mutex internally, so
+    // holding it across the loop would deadlock.
+    std::vector<std::pair<std::string, const ggml_tensor *>> entries_snapshot;
+    {
+        std::lock_guard<std::mutex> lk(g_synth_caches_mu);
+        entries_snapshot = g_hift_inv_alpha_entries;
+    }
+    for (const auto & e : entries_snapshot) {
+        ggml_tensor * src = const_cast<ggml_tensor *>(e.second);
+        const std::string src_name = ggml_get_name(src);
+        const std::vector<float> & inv = cached_inv_alpha(m, src_name);
+        ggml_backend_tensor_set(ggml_graph_get_tensor(gf, e.first.c_str()),
+                                inv.data(), 0, inv.size()*sizeof(float));
+    }
     compute(m.backend, gf);
 
-    std::vector<float> wav(ggml_nelements(y_trim));
-    ggml_backend_tensor_get(y_trim, wav.data(), 0, ggml_nbytes(y_trim));
-    ggml_gallocr_free(allocr);
-    ggml_free(ctx);
+    ggml_tensor * y_trim_out = ggml_graph_get_tensor(gf, "wav");
+    std::vector<float> wav(ggml_nelements(y_trim_out));
+    ggml_backend_tensor_get(y_trim_out, wav.data(), 0, ggml_nbytes(y_trim_out));
     return wav;
 }