diff --git a/parakeet-cpp/patches/README.md b/parakeet-cpp/patches/README.md
deleted file mode 100644
index d55e53e27a4..00000000000
--- a/parakeet-cpp/patches/README.md
+++ /dev/null
@@ -1,264 +0,0 @@
-# ggml patches for parakeet.cpp
-
-`ggml` is vendored as a pristine upstream clone (see the top-level
-[`README.md`](../README.md) and [`scripts/setup-ggml.sh`](../scripts/setup-ggml.sh)),
-so the local fixes parakeet.cpp depends on live here as standalone
-patches and are applied after the clone.
-
-Three patches ship today:
-
-1. [`ggml-backend-reg-filename-prefix.patch`](#ggml-backend-reg-filename-prefixpatch)
-   — teaches `ggml_backend_load_best()` to honour a compile-time
-   `GGML_BACKEND_DL_PROJECT_PREFIX` macro, so renaming the bundled
-   backend .so/.dll files (parakeet does this to avoid colliding with
-   another consumer's `libggml-*` files in the same host process) does
-   not break runtime backend discovery under `GGML_BACKEND_DL=ON`.
-   No-op when the macro is undefined.
-2. [`ggml-opencl-allow-non-adreno.patch`](#ggml-opencl-allow-non-adrenopatch)
-   — lets the OpenCL backend bring up on commodity desktop GPUs
-   (NVIDIA, AMD, Apple) so `parakeet.cpp` can be built and parity-
-   tested with `-DGGML_OPENCL=ON` outside Adreno-only environments.
-   No-op on real Adreno targets (the patch only relaxes the rejection
-   of unknown GPU vendors and the assertion in
-   `ggml_backend_opencl_init()` when no devices were found).
-3. [`ggml-opencl-program-binary-cache.patch`](#ggml-opencl-program-binary-cachepatch)
-   — adds a persistent on-disk cache for compiled OpenCL kernel
-   binaries, removing the multi-second `clBuildProgram` wave at every
-   cold start. Honours `$GGML_OPENCL_CACHE_DIR`, with
-   `$XDG_CACHE_HOME/ggml/opencl` → `$HOME/.cache/ggml/opencl`
-   fallbacks. Opt-out via `GGML_OPENCL_CACHE_DIR=""`.
-
-`scripts/setup-ggml.sh` applies every `patches/ggml-*.patch` in
-lexicographic order; the script is idempotent and resets the ggml
-worktree to the pinned commit before applying.
-
-## Apply
-
-The top-level [`scripts/setup-ggml.sh`](../scripts/setup-ggml.sh) does
-everything for you:
-
-```bash
-# From the repo root.  Clones ggml if needed, checks out the pinned
-# commit, and applies every patch under patches/.  Idempotent --
-# re-running is a no-op.
-./scripts/setup-ggml.sh
-```
-
-Then configure + build as usual. Pick the backend flags for your
-platform; OpenCL pulls in the patch automatically:
-
-```bash
-# Apple Silicon
-cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON
-
-# NVIDIA / desktop
-cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
-
-# Vulkan (anything else)
-cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
-
-# OpenCL: Adreno (Android) target
-cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_OPENCL=ON
-
-# OpenCL: NVIDIA / AMD / Apple desktop (dev / CI parity testing) --
-# Adreno-tuned matmul kernels OFF, generic OpenCL paths only:
-cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
-    -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=OFF
-```
-
-If you'd rather run the steps by hand (e.g. to pin a different
-upstream commit), the script is effectively:
-
-```bash
-git clone https://github.com/ggml-org/ggml.git ggml
-cd ggml && git checkout $GGML_COMMIT
-git apply ../patches/ggml-backend-reg-filename-prefix.patch
-git apply ../patches/ggml-opencl-allow-non-adreno.patch
-git apply ../patches/ggml-opencl-program-binary-cache.patch
-```
-
-`GGML_COMMIT` lives at the top of `scripts/setup-ggml.sh` as the
-single source of truth -- bump it when re-generating the patches
-against a newer upstream ggml. To confirm everything applied
-cleanly:
-
-```bash
-(cd ggml && git status --short)
-# Expected: 2 modified files
-#   ggml/src/ggml-backend-reg.cpp     (filename-prefix patch)
-#   ggml/src/ggml-opencl/ggml-opencl.cpp  (both OpenCL patches stack on this file)
-```
-
-CPU / CUDA / Metal / Vulkan builds get the pinned commit and the
-filename-prefix patch (which is a strict no-op when the host
-project does not define `GGML_BACKEND_DL_PROJECT_PREFIX`); the
-OpenCL changes are no-op for every other backend.
-
-## `ggml-backend-reg-filename-prefix.patch`
-
-Base commit: `58c38058` (`sync : llama.cpp`, 2026-04-09).
-
-Adds a single compile-time switch
-`GGML_BACKEND_DL_PROJECT_PREFIX` to `ggml_backend_load_best()` so
-the runtime backend-discovery walk can be retargeted at the
-filename prefix used by a host project that renames the bundled
-`libggml-*` files to avoid colliding with another consumer's
-`libggml-*` files in the same host process.
-
-Background: parakeet ships its bundled ggml backends as
-`libspeech-ggml-*.{so,dll}` (CMake option
-`PARAKEET_GGML_LIB_PREFIX=ON`, default) so a host process that
-loads two consumers each vendoring its own ggml does not see a
-name clash on `libggml-vulkan.so` / `libggml-cuda.so` / etc. The
-`speech-` prefix is shared with the rest of the QVAC speech stack
-(whisper, parakeet, chatterbox, supertonic, ...) so the family
-co-vendors a single ggml file set.
-Without this patch, the rename works at link time but
-`ggml_backend_load_best()` still searches for `libggml-*.so` /
-`ggml-*.dll`, so under `GGML_BACKEND_DL=ON` the renamed files are
-on disk but never discovered and Vulkan/OpenCL/CUDA backends
-silently fail to load.
-
-| Symptom | Root cause | What this patch does |
-|---------|-----------|----------------------|
-| `speech-ggml-vulkan.so` (etc.) is on disk but ggml's loader never picks it up under `GGML_BACKEND_DL=ON` | `backend_filename_prefix()` hard-codes `libggml-` / `ggml-` and `ggml_backend_load_best` filters directory entries by that fixed prefix | Honour an optional compile-time `GGML_BACKEND_DL_PROJECT_PREFIX` string literal (e.g. `"speech-"`); when defined, the loader searches for `lib<prefix>ggml-*` / `<prefix>ggml-*` instead. Macro undefined ⇒ behaviour byte-equal to upstream. |
-
-The CMake side wires the macro from `PARAKEET_GGML_LIB_PREFIX`:
-when that option is on (the default), parakeet's top-level
-`CMakeLists.txt` does
-`target_compile_definitions(ggml PRIVATE GGML_BACKEND_DL_PROJECT_PREFIX="speech-")`
-on the `ggml` target (which is what compiles
-`ggml-backend-reg.cpp`). Consumers that prefer the upstream
-filenames (system ggml, single-consumer hosts) configure with
-`-DPARAKEET_GGML_LIB_PREFIX=OFF` and the macro stays undefined,
-so the loader behaviour matches stock ggml exactly.
-
-## `ggml-opencl-allow-non-adreno.patch`
-
-Base commit: `58c38058` (`sync : llama.cpp`, 2026-04-09).
-
-Fixes two gaps in `ggml-opencl` that make `-DGGML_OPENCL=ON` builds of
-`parakeet.cpp` impossible to bring up outside an Adreno-only
-environment:
-
-| Symptom                                                                                                | Root cause in `ggml-opencl`                                                                                                                                                                                                                                                                                            | What this patch does                                                                                                                                                                                                          |
-|--------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Every NVIDIA / AMD / Apple OpenCL device is dropped at init with `Unsupported GPU: <device-name>`      | `ggml_cl2_init()` whitelists `Adreno` / `Qualcomm` / `Intel` and returns `nullptr` for everything else. Even with `-DGGML_OPENCL_USE_ADRENO_KERNELS=OFF`, a non-Adreno GPU never reaches the generic kernels.                                                                                                           | Default behaviour is byte-equal to upstream (still returns `nullptr`). Set `GGML_OPENCL_ALLOW_UNKNOWN_GPU=1` to opt the device through with `GPU_FAMILY::UNKNOWN`; we additionally require `cl_intel_required_subgroup_size` *or* `cl_qcom_reqd_sub_group_size` (the matmul-vec kernels need one to define `N_DST`/`N_SIMDGROUP`/`N_SIMDWIDTH`), so AMD/NVIDIA still fall back to host instead of crashing in `clBuildProgram`. |
-| `parakeet --n-gpu-layers 1` aborts with `GGML_ASSERT(index < ggml_backend_opencl_reg_device_count(reg))` when zero usable devices were found | `ggml_backend_opencl_init()` calls `ggml_backend_reg_dev_get(reg, 0)` unconditionally. When the device discovery cleared the list (e.g. only an unsupported GPU was present), `dev_get(0)` asserts and the host process aborts. parakeet's `init_gpu_backend()` cascade expects a nullable result so it can fall back. | Check `ggml_backend_reg_dev_count(reg) == 0` before `dev_get` and return `nullptr` on empty. Also propagate `nullptr` when `ggml_cl2_init()` rejects the device, so the host-side fallback path actually runs.                |
-
-The patch is **strictly additive** for real Adreno targets:
-`gpu_family == ADRENO` is computed exactly as before, the Adreno
-shuffle / large-buffer paths still trigger when (and only when) the
-device is Adreno, and without `GGML_OPENCL_ALLOW_UNKNOWN_GPU=1` the
-non-Adreno reject path is byte-equal to upstream so production Android
-builds get the same compile-time guarantees as before.
-
-The intended audience for the patch is:
-
-  * `parakeet.cpp` developers running CI on Intel iGPU desktop
-    hardware (the matmul-vec kernels gate on
-    `cl_intel_required_subgroup_size`, so Intel iGPU is the only
-    desktop class that can actually execute the OpenCL kernels;
-    AMD/NVIDIA users get a clean CPU fallback instead of crashing
-    inside `clBuildProgram`).
-  * Anyone who wants to reproduce the OpenCL backend's mel/encoder
-    parity numbers without an Adreno device.
-
-Opt-in is gated behind `GGML_OPENCL_ALLOW_UNKNOWN_GPU=1` so misconfigured
-production builds still get the same explicit `Unsupported GPU` error
-upstream returned, instead of a silent "running with an untested GPU".
-
-It is **not** intended to ship a fast OpenCL path on NVIDIA / AMD /
-Apple desktops (CUDA / Vulkan / Metal are far better suited there);
-its only purpose is bring-up + parity testing.
-
-## `ggml-opencl-program-binary-cache.patch`
-
-Base commit: `58c38058` (`sync : llama.cpp`, 2026-04-09).
-
-Adds a persistent on-disk cache for compiled OpenCL kernel binaries
-to `ggml-opencl`. Upstream `build_program_from_source()` calls
-`clCreateProgramWithSource` + `clBuildProgram` on every cold start,
-re-paying the driver's shader-compile wave (multiple seconds on
-Adreno / Mesa / Mali; tens of ms on most desktop drivers). This
-patch drops the call to `clCreateProgramWithBinary` against a
-device-specific cache blob whenever one exists, and persists every
-freshly-compiled program back to disk on miss.
-
-| Symptom                                                                                | Root cause                                                                              | What this patch does                                                                                              |
-|----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
-| Every cold-start `parakeet --n-gpu-layers 1` re-compiles all 88 OpenCL kernels    | `build_program_from_source` always calls `clCreateProgramWithSource` + `clBuildProgram` | Look up `<cache_dir>/<key>.bin` first via `clCreateProgramWithBinary`; only fall through to source compile on miss |
-| Hosts already `setenv` `GGML_OPENCL_CACHE_DIR` for the same goal, but ggml-opencl ignores it | The env var is read **nowhere** in upstream ggml-opencl at this commit  | Resolves cache dir from `$GGML_OPENCL_CACHE_DIR` → `$XDG_CACHE_HOME/ggml/opencl` → `$HOME/.cache/ggml/opencl`, so the env-var contract takes effect. |
-
-### Cache key
-
-`<src_hash>_<opts_hash>_<driver_hash>_<dev_name_hash>_<dev_ver_hash>.bin`,
-where each component is FNV-1a-64. Each kernel's `program_buffer`
-hashes independently (88 different cache files per device); a
-driver upgrade or moving to a different device silently invalidates
-the cache because either `driver_hash` or `dev_*_hash` changes.
-There is no manual invalidation step.
-
-### Atomic writes
-
-The cache writer dumps `getProgramInfo(CL_PROGRAM_BINARIES)` to
-`<path>.tmp` then `rename(2)`s into place. POSIX rename is atomic,
-so concurrent processes can't read a half-written file; the
-last-writer-wins result is fine because each blob is independently
-valid for the same `(src, opts, driver, dev)` combination.
-
-### Footprint
-
-Each kernel binary lands at ~10-200 KB on Adreno (driver-dependent);
-88 kernels × ~50 KB average ≈ 4-5 MB on disk per device per process
-family. No size cap on disk today -- if it ever becomes a concern
-on tightly-budgeted mobile installs, wrap the writer with a
-ceiling.
-
-### Opt-out / disable
-
-`GGML_OPENCL_CACHE_DIR=""` (literal empty string) short-circuits
-both the read and the write paths and runs the original
-source-compile route. Useful for benchmarking the cold-start cost,
-or in a CI runner that wants every run to re-compile.
-
-When the cache dir resolves but `mkdir -p` fails (read-only
-filesystem, permissions, ...), the writer logs nothing and falls
-through to source compile silently -- no behavioural difference
-versus running with the patch absent.
-
-### Stale-cache handling
-
-`clCreateProgramWithBinary` can return `CL_INVALID_BINARY` (or the
-subsequent `clBuildProgram` can fail) when the on-disk blob is
-stale (driver upgrade, different shader IR version, mismatched
-device). The patch handles every such failure by releasing the
-program and falling through to source compile. The next run then
-overwrites the bad blob.
-
-### Measured impact
-
-This patch is **not yet benchmarked on a real Adreno device**: the
-benchmark hosts the patch was developed on are NVIDIA-only, and
-NVIDIA's OpenCL driver lacks the fp16 / OpenCL C 2.0 features
-ggml-opencl mandates -- the kernels never compile at all there, so
-there is nothing to cache. Expected impact:
-
-  * **Cold start (no cache)**: same as upstream -- multi-second
-    shader compile wave on Adreno.
-  * **Warm cache** (any subsequent invocation): saves the entire
-    `clBuildProgram` wave; typical Adreno saving is multiple
-    seconds per process.
-
-Once Adreno hardware is available for follow-up benchmarking, the
-expected bench shape is the standard pipeline-cache curve:
-cold ≫ ggml-warm ≈ both-warm.
-
-## Dropping the patches
-
-If upstream ggml-opencl decides to relax the GPU-vendor whitelist
-itself, or ships its own kernel binary cache, delete the patch
-file(s) and remove the corresponding entry from the `PATCHES=(…)`
-glob in `scripts/setup-ggml.sh`. The C++ side of parakeet uses
-only ops that ggml-opencl already supports natively (per the
-op-coverage audit), so nothing else needs to change.
diff --git a/parakeet-cpp/patches/ggml-backend-reg-filename-prefix.patch b/parakeet-cpp/patches/ggml-backend-reg-filename-prefix.patch
deleted file mode 100644
index e5e824e592c..00000000000
--- a/parakeet-cpp/patches/ggml-backend-reg-filename-prefix.patch
+++ /dev/null
@@ -1,35 +0,0 @@
-diff --git a/src/ggml-backend-reg.cpp b/src/ggml-backend-reg.cpp
---- a/src/ggml-backend-reg.cpp
-+++ b/src/ggml-backend-reg.cpp
-@@ -442,12 +442,31 @@ static std::string get_executable_path() {
- #endif
- }
- 
-+// parakeet patch: allow consuming projects to override the backend
-+// shared-library filename prefix at compile time. Without this, the
-+// loader hard-codes "ggml-" (Windows) / "libggml-" (other), so two
-+// addons that vendor different ggml versions and rename their bundled
-+// backend .so/.dll files to avoid filename collisions still cannot be
-+// loaded with `GGML_BACKEND_DL=ON`: the discovery walk in
-+// `ggml_backend_load_best` only matches the unprefixed names. Define
-+// `GGML_BACKEND_DL_PROJECT_PREFIX` (a string literal, e.g.
-+// "speech-") at compile time and the loader will instead search for
-+// "<prefix>ggml-*" / "lib<prefix>ggml-*". Default behaviour (macro
-+// undefined) is byte-equal to upstream.
- static fs::path backend_filename_prefix() {
-+#if defined(GGML_BACKEND_DL_PROJECT_PREFIX)
-+#ifdef _WIN32
-+    return fs::u8path(GGML_BACKEND_DL_PROJECT_PREFIX "ggml-");
-+#else
-+    return fs::u8path("lib" GGML_BACKEND_DL_PROJECT_PREFIX "ggml-");
-+#endif
-+#else
- #ifdef _WIN32
-     return fs::u8path("ggml-");
- #else
-     return fs::u8path("libggml-");
- #endif
-+#endif
- }
- 
- static fs::path backend_filename_extension() {
diff --git a/parakeet-cpp/patches/ggml-opencl-allow-non-adreno.patch b/parakeet-cpp/patches/ggml-opencl-allow-non-adreno.patch
deleted file mode 100644
index 458c10f8768..00000000000
--- a/parakeet-cpp/patches/ggml-opencl-allow-non-adreno.patch
+++ /dev/null
@@ -1,91 +0,0 @@
-diff --git a/src/ggml-opencl/ggml-opencl.cpp b/src/ggml-opencl/ggml-opencl.cpp
-index 6f3fc588..96942915 100644
---- a/src/ggml-opencl/ggml-opencl.cpp
-+++ b/src/ggml-opencl/ggml-opencl.cpp
-@@ -3020,9 +3020,57 @@ static ggml_backend_opencl_context * ggml_cl2_init(ggml_backend_dev_t dev) {
-     } else if (strstr(dev_ctx->device_name.c_str(), "Intel")) {
-         backend_ctx->gpu_family = GPU_FAMILY::INTEL;
-     } else {
--        GGML_LOG_ERROR("Unsupported GPU: %s\n", dev_ctx->device_name.c_str());
-+        // parakeet patch: upstream ggml-opencl rejects any GPU that is
-+        // not Adreno/Qualcomm or Intel. Parakeet's real OpenCL deployment
-+        // target is Adreno (Android); for desktop dev/CI parity on Intel
-+        // iGPUs we let the device through with `gpu_family = UNKNOWN`
-+        // when the host opts in via `GGML_OPENCL_ALLOW_UNKNOWN_GPU=1`.
-+        //
-+        // Default (env var unset) preserves upstream behaviour byte-equal,
-+        // so production Adreno builds get no behavioural change and a
-+        // misconfigured non-Adreno consumer gets the same clear error as
-+        // before instead of crashing later in kernel-compile.
-+        //
-+        // The matmul-vec kernels (mul_mv_q4_0_f32_v.cl etc.) auto-define
-+        // INTEL_GPU / ADRENO_GPU based on `cl_intel_required_subgroup_size`
-+        // / `cl_qcom_reqd_sub_group_size`. Without one of those extensions
-+        // the kernel source has no way to define N_DST / N_SIMDGROUP /
-+        // N_SIMDWIDTH and `clBuildProgram` aborts the host process. So we
-+        // additionally require one of those two extensions before letting
-+        // the device through; AMD/NVIDIA desktop drivers expose neither
-+        // and now fall back cleanly to CPU instead of crashing.
-+        const char * allow = getenv("GGML_OPENCL_ALLOW_UNKNOWN_GPU");
-+        if (!allow || allow[0] != '1') {
-+            GGML_LOG_ERROR("Unsupported GPU: %s\n", dev_ctx->device_name.c_str());
-+            backend_ctx->gpu_family = GPU_FAMILY::UNKNOWN;
-+            return nullptr;
-+        }
-+
-+        size_t ext_size = 0;
-+        clGetDeviceInfo(dev_ctx->device, CL_DEVICE_EXTENSIONS, 0, NULL, &ext_size);
-+        std::string ext;
-+        if (ext_size > 0) {
-+            ext.resize(ext_size);
-+            clGetDeviceInfo(dev_ctx->device, CL_DEVICE_EXTENSIONS, ext_size, ext.data(), NULL);
-+        }
-+        const bool has_intel_sg = ext.find("cl_intel_required_subgroup_size") != std::string::npos;
-+        const bool has_qcom_sg  = ext.find("cl_qcom_reqd_sub_group_size")     != std::string::npos;
-+        if (!has_intel_sg && !has_qcom_sg) {
-+            GGML_LOG_ERROR("ggml_opencl: GPU '%s' has neither cl_intel_required_subgroup_size "
-+                "nor cl_qcom_reqd_sub_group_size; matmul-vec kernels cannot define "
-+                "N_DST/N_SIMDGROUP/N_SIMDWIDTH and clBuildProgram would abort. "
-+                "Falling back to host (parakeet patch).\n",
-+                dev_ctx->device_name.c_str());
-+            backend_ctx->gpu_family = GPU_FAMILY::UNKNOWN;
-+            return nullptr;
-+        }
-+
-+        GGML_LOG_WARN("ggml_opencl: GPU '%s' is not Adreno/Qualcomm or Intel; "
-+                      "running with generic OpenCL kernels (parakeet patch + "
-+                      "GGML_OPENCL_ALLOW_UNKNOWN_GPU=1). "
-+                      "Adreno-specific kernels and large-buffer paths stay off.\n",
-+                      dev_ctx->device_name.c_str());
-         backend_ctx->gpu_family = GPU_FAMILY::UNKNOWN;
--        return nullptr;
-     }
- 
- #ifdef GGML_OPENCL_USE_ADRENO_KERNELS
-@@ -4075,8 +4123,25 @@ static ggml_backend_i ggml_backend_opencl_i = {
- };
- 
- ggml_backend_t ggml_backend_opencl_init(void) {
--    ggml_backend_dev_t dev = ggml_backend_reg_dev_get(ggml_backend_opencl_reg(), 0);
-+    // parakeet patch: bail out cleanly when the OpenCL backend
-+    // discovery saw zero usable devices. Upstream calls
-+    // ggml_backend_reg_dev_get() unconditionally, which asserts on an
-+    // empty device list. Parakeet's host code expects a nullable result
-+    // from ggml_backend_opencl_init() (it falls back to CPU when the
-+    // returned backend is null); the assertion makes that fallback path
-+    // unreachable on hosts where ggml-opencl can't find any GPU it
-+    // accepts (Adreno-only environments without an Adreno device,
-+    // headless CI runners, etc.).
-+    ggml_backend_reg_t reg = ggml_backend_opencl_reg();
-+    if (ggml_backend_reg_dev_count(reg) == 0) {
-+        return nullptr;
-+    }
-+
-+    ggml_backend_dev_t dev = ggml_backend_reg_dev_get(reg, 0);
-     ggml_backend_opencl_context *backend_ctx = ggml_cl2_init(dev);
-+    if (backend_ctx == nullptr) {
-+        return nullptr;
-+    }
- 
-     ggml_backend_t backend = new ggml_backend {
-         /* .guid    = */ ggml_backend_opencl_guid(),
diff --git a/parakeet-cpp/patches/ggml-opencl-program-binary-cache.patch b/parakeet-cpp/patches/ggml-opencl-program-binary-cache.patch
deleted file mode 100644
index bdf15bf2169..00000000000
--- a/parakeet-cpp/patches/ggml-opencl-program-binary-cache.patch
+++ /dev/null
@@ -1,269 +0,0 @@
-diff --git a/src/ggml-opencl/ggml-opencl.cpp b/src/ggml-opencl/ggml-opencl.cpp
-index 96942915..7c2e4bc2 100644
---- a/src/ggml-opencl/ggml-opencl.cpp
-+++ b/src/ggml-opencl/ggml-opencl.cpp
-@@ -20,6 +20,7 @@
- 
- #include <cstddef>
- #include <cstdint>
-+#include <cstdio>
- #include <fstream>
- #include <vector>
- #include <string>
-@@ -29,6 +30,32 @@
- #include <charconv>
- #include <mutex>
- 
-+// parakeet patch: persistent kernel binary cache support. The
-+// helpers below sit on POSIX file primitives (mkdir/unlink/fsync) but
-+// also need to build on MinGW / MSVC where those names map to the
-+// `_`-prefixed Windows variants and mkdir takes a single argument.
-+// Wrap them in parakeet_* macros so the rest of the patch stays
-+// platform-agnostic.
-+#include <cerrno>
-+#include <fcntl.h>
-+#include <sys/stat.h>
-+#ifdef _WIN32
-+#  include <direct.h>
-+#  include <io.h>
-+#  define parakeet_mkdir(path)   _mkdir(path)
-+#  define parakeet_unlink(path)  _unlink(path)
-+#  define parakeet_open_ro(path) _open((path), _O_RDONLY | _O_BINARY)
-+#  define parakeet_close(fd)     _close(fd)
-+#  define parakeet_fsync(fd)     _commit(fd)
-+#else
-+#  include <unistd.h>
-+#  define parakeet_mkdir(path)   mkdir((path), 0755)
-+#  define parakeet_unlink(path)  unlink(path)
-+#  define parakeet_open_ro(path) open((path), O_RDONLY)
-+#  define parakeet_close(fd)     close(fd)
-+#  define parakeet_fsync(fd)     fsync(fd)
-+#endif
-+
- #undef MIN
- #undef MAX
- #define MIN(a, b) ((a) < (b) ? (a) : (b))
-@@ -755,6 +782,193 @@ inline std::string read_file(const std::string &path) {
-   return text;
- }
- 
-+// parakeet patch: persistent OpenCL kernel-binary cache.
-+// ggml-opencl as shipped at this commit JIT-compiles every embedded
-+// kernel via `clBuildProgram(clCreateProgramWithSource)` on each cold
-+// start. On Adreno that's tens of seconds of shader compile per
-+// process invocation; on Mesa / Mali / iGPU drivers it's similar.
-+// This patch caches the device-specific compiled binaries under
-+// `$GGML_OPENCL_CACHE_DIR` (or `$XDG_CACHE_HOME/ggml/opencl` →
-+// `$HOME/.cache/ggml/opencl` fallback) keyed on a 64-bit FNV-1a hash of
-+// (source + compile_opts + driver_version + device_name + ggml_commit).
-+// Cache hit -> `clCreateProgramWithBinary`; miss / corrupted blob ->
-+// fall through to source compile and write the resulting binary back.
-+//
-+// The opt-out path is `GGML_OPENCL_CACHE_DIR=""` (empty string) which
-+// short-circuits the cache and runs the original source path. With no
-+// cache directory writable, the helper logs a warning and falls
-+// through to source compile silently.
-+//
-+// Hosts that already `setenv("GGML_OPENCL_CACHE_DIR", ...)` to point
-+// the runtime at a writable location (typical pattern on Android
-+// Adreno deployments) get the cache for free; this patch makes that
-+// env-var contract take effect rather than being ignored upstream.
-+
-+static uint64_t fnv1a_hash64(const void * data, size_t n) {
-+    const uint8_t * p = static_cast<const uint8_t *>(data);
-+    uint64_t h = 0xcbf29ce484222325ULL;
-+    for (size_t i = 0; i < n; ++i) {
-+        h ^= p[i];
-+        h *= 0x100000001b3ULL;
-+    }
-+    return h;
-+}
-+
-+static std::string opencl_cache_dir(cl_device_id dev) {
-+    const char * env = getenv("GGML_OPENCL_CACHE_DIR");
-+    if (env && *env == '\0') return ""; // explicit opt-out: empty string
-+    if (env && *env != '\0') return env;
-+    if (const char * xdg = getenv("XDG_CACHE_HOME"); xdg && *xdg) {
-+        return std::string(xdg) + "/ggml/opencl";
-+    }
-+    if (const char * home = getenv("HOME"); home && *home) {
-+        return std::string(home) + "/.cache/ggml/opencl";
-+    }
-+    GGML_UNUSED(dev);
-+    return ""; // no plausible default; opt out gracefully
-+}
-+
-+static bool opencl_mkdir_p(const std::string & path) {
-+    // Lightweight `mkdir -p` without C++17 <filesystem> dep on the
-+    // ggml-opencl side (some downstream consumers compile against
-+    // libstdc++ versions where std::filesystem requires linking
-+    // -lstdc++fs explicitly). Returns true if the directory exists
-+    // afterwards.
-+    if (path.empty()) return false;
-+    std::string cur;
-+    cur.reserve(path.size());
-+    for (size_t i = 0; i <= path.size(); ++i) {
-+        const char c = i < path.size() ? path[i] : '/';
-+        if ((c == '/' || c == '\\') && !cur.empty()) {
-+            if (parakeet_mkdir(cur.c_str()) != 0 && errno != EEXIST) {
-+                return false;
-+            }
-+        }
-+        if (i < path.size()) cur.push_back(c);
-+    }
-+    return true;
-+}
-+
-+static std::string opencl_cache_key(const char * program_buffer,
-+                                    size_t program_size,
-+                                    const std::string & compile_opts,
-+                                    cl_device_id dev) {
-+    // Combine source + opts + device + driver into the cache key so a
-+    // driver bump or a different SoC reuses different blobs. We hash
-+    // each component separately and combine to avoid pathological
-+    // FNV behaviour on long buffers.
-+    uint64_t h_src    = fnv1a_hash64(program_buffer, program_size);
-+    uint64_t h_opts   = fnv1a_hash64(compile_opts.data(), compile_opts.size());
-+
-+    // Driver version + device name + OpenCL C version pinpoint the
-+    // driver instance the binary was emitted by. Pinpointing too
-+    // tightly is a feature: a driver bump silently invalidates the
-+    // cache, exactly the policy you want.
-+    char driver_buf[256] = {0};
-+    char devname_buf[256] = {0};
-+    char devver_buf[256]  = {0};
-+    size_t n;
-+    clGetDeviceInfo(dev, CL_DRIVER_VERSION, sizeof(driver_buf) - 1, driver_buf, &n);
-+    clGetDeviceInfo(dev, CL_DEVICE_NAME,    sizeof(devname_buf) - 1, devname_buf, &n);
-+    clGetDeviceInfo(dev, CL_DEVICE_VERSION, sizeof(devver_buf) - 1,  devver_buf, &n);
-+    uint64_t h_drv    = fnv1a_hash64(driver_buf,  strlen(driver_buf));
-+    uint64_t h_dev    = fnv1a_hash64(devname_buf, strlen(devname_buf));
-+    uint64_t h_devver = fnv1a_hash64(devver_buf,  strlen(devver_buf));
-+
-+    // Five 16-char hex tokens + 4 underscores + ".bin" + NUL = 89 bytes.
-+    // Use PRIx64 + (uint64_t) so the format-spec width is correct on
-+    // both LP64 (Linux/Android) and LLP64 (Windows MinGW/MSVC) where
-+    // `unsigned long` is 32 bits and `%016lx` would silently truncate
-+    // the upper half of each FNV hash.
-+    char buf[128];
-+    std::snprintf(buf, sizeof(buf),
-+                  "%016" PRIx64 "_%016" PRIx64 "_%016" PRIx64
-+                  "_%016" PRIx64 "_%016" PRIx64 ".bin",
-+                  h_src, h_opts, h_drv, h_dev, h_devver);
-+    return buf;
-+}
-+
-+static cl_program opencl_build_program_with_cache(cl_context ctx,
-+                                                  cl_device_id dev,
-+                                                  const char * program_buffer,
-+                                                  size_t program_size,
-+                                                  const std::string & compile_opts,
-+                                                  const std::string & cache_dir,
-+                                                  const std::string & key) {
-+    if (cache_dir.empty() || key.empty()) return nullptr;
-+    const std::string path = cache_dir + "/" + key;
-+    std::ifstream ifs(path, std::ios::binary);
-+    if (!ifs) return nullptr;
-+    ifs.seekg(0, std::ios::end);
-+    const std::streamsize n = ifs.tellg();
-+    if (n <= 0) return nullptr;
-+    ifs.seekg(0, std::ios::beg);
-+    std::vector<unsigned char> blob((size_t) n);
-+    if (!ifs.read(reinterpret_cast<char*>(blob.data()), n)) return nullptr;
-+
-+    cl_int err1 = CL_SUCCESS, err2 = CL_SUCCESS;
-+    const unsigned char * data = blob.data();
-+    const size_t len = blob.size();
-+    cl_program p = clCreateProgramWithBinary(ctx, 1, &dev, &len, &data, &err1, &err2);
-+    if (err1 != CL_SUCCESS || err2 != CL_SUCCESS || !p) {
-+        if (p) clReleaseProgram(p);
-+        return nullptr;
-+    }
-+    if (clBuildProgram(p, 0, NULL, compile_opts.c_str(), NULL, NULL) != CL_SUCCESS) {
-+        clReleaseProgram(p);
-+        return nullptr;
-+    }
-+    GGML_UNUSED(program_buffer);
-+    GGML_UNUSED(program_size);
-+    return p;
-+}
-+
-+static void opencl_save_program_binary(cl_program p, cl_device_id /*dev*/,
-+                                       const std::string & cache_dir,
-+                                       const std::string & key) {
-+    if (cache_dir.empty() || key.empty()) return;
-+    if (!opencl_mkdir_p(cache_dir)) return;
-+
-+    size_t bin_size = 0;
-+    if (clGetProgramInfo(p, CL_PROGRAM_BINARY_SIZES, sizeof(size_t),
-+                         &bin_size, nullptr) != CL_SUCCESS || bin_size == 0) return;
-+    std::vector<unsigned char> blob(bin_size);
-+    unsigned char * blob_ptr = blob.data();
-+    if (clGetProgramInfo(p, CL_PROGRAM_BINARIES, sizeof(unsigned char *),
-+                         &blob_ptr, nullptr) != CL_SUCCESS) return;
-+
-+    // Atomic write: tmp + fsync + rename. Without the fsync the kernel
-+    // can flush blocks out of order on power loss, leaving the renamed
-+    // file pointing at zero/garbage data and forcing the next process
-+    // into the source-compile fallback (and the bad blob lives forever
-+    // unless explicitly invalidated).
-+    const std::string final_path = cache_dir + "/" + key;
-+    const std::string tmp_path   = final_path + ".tmp";
-+    {
-+        std::ofstream ofs(tmp_path, std::ios::binary);
-+        if (!ofs) return;
-+        ofs.write(reinterpret_cast<const char*>(blob.data()), (std::streamsize) blob.size());
-+        ofs.close();
-+        if (!ofs) { parakeet_unlink(tmp_path.c_str()); return; }
-+    }
-+    {
-+        int fd = parakeet_open_ro(tmp_path.c_str());
-+        if (fd >= 0) {
-+            parakeet_fsync(fd);
-+            parakeet_close(fd);
-+        }
-+    }
-+    // Windows rename() refuses to overwrite an existing destination, so
-+    // unlink it first. POSIX rename is atomic and replaces silently;
-+    // the redundant unlink there is a no-op when the target is missing.
-+#ifdef _WIN32
-+    parakeet_unlink(final_path.c_str());
-+#endif
-+    if (rename(tmp_path.c_str(), final_path.c_str()) != 0) {
-+        parakeet_unlink(tmp_path.c_str());
-+    }
-+}
-+
- static cl_program build_program_from_source(cl_context ctx, cl_device_id dev, const char* program_buffer, const std::string &compile_opts) {
-     cl_program p;
-     char *program_log;
-@@ -764,6 +978,17 @@ static cl_program build_program_from_source(cl_context ctx, cl_device_id dev, co
- 
-     program_size = strlen(program_buffer);
- 
-+    // parakeet patch: try the persistent cache first.
-+    const std::string cache_dir = opencl_cache_dir(dev);
-+    const std::string cache_key = cache_dir.empty()
-+        ? std::string()
-+        : opencl_cache_key(program_buffer, program_size, compile_opts, dev);
-+    if (cl_program cached = opencl_build_program_with_cache(
-+            ctx, dev, program_buffer, program_size, compile_opts,
-+            cache_dir, cache_key)) {
-+        return cached;
-+    }
-+
-     p = clCreateProgramWithSource(ctx, 1, (const char**)&program_buffer, &program_size, &err);
-     if(err < 0) {
-         GGML_LOG_ERROR("OpenCL error creating program");
-@@ -781,6 +1006,11 @@ static cl_program build_program_from_source(cl_context ctx, cl_device_id dev, co
-         exit(1);
-     }
- 
-+    // parakeet patch: save the freshly compiled binary. Fast path
-+    // (cache hit) above avoids re-compiling next time. Failures here
-+    // are non-fatal -- next process just re-pays the compile cost.
-+    opencl_save_program_binary(p, dev, cache_dir, cache_key);
-+
-     return p;
- }
- 
diff --git a/tts-cpp/.gitignore b/tts-cpp/.gitignore
index ca1d3c4c339..ba5670bf11a 100644
--- a/tts-cpp/.gitignore
+++ b/tts-cpp/.gitignore
@@ -1,5 +1,8 @@
 # Vendored ggml (cloned separately at setup time; see README)
-ggml/
+/ggml/
+# (We DO commit cmake/vcpkg-overlay-ports/ggml/ — it's the QVAC ggml port
+# overlay carrying our Supertonic custom-op patches.  The `/ggml/` above is
+# anchored to the tts-cpp root only.)
 
 # Build artifacts
 build/
diff --git a/tts-cpp/CMakeLists.txt b/tts-cpp/CMakeLists.txt
index 20e4d4634eb..65702e0fbe6 100644
--- a/tts-cpp/CMakeLists.txt
+++ b/tts-cpp/CMakeLists.txt
@@ -164,23 +164,23 @@ if (NOT TARGET ggml)
         endif()
         add_library(ggml ALIAS ggml::ggml)
     else()
-        # In-tree subtree of qvac-ext-lib-whisper.cpp: the standalone
-        # patches/ folder + scripts/setup-ggml.sh tooling is intentionally
-        # absent here.  Without them, an add_subdirectory(ggml) build
-        # would silently miss the ggml-backend-reg-filename-prefix patch
-        # that GGML_BACKEND_DL_PROJECT_PREFIX="speech-" depends on, so
-        # libspeech-ggml-*.so files would exist on disk but the runtime
-        # loader would still search for libggml-*.so under
-        # GGML_BACKEND_DL=ON.  Reject up front with a pointer at the
-        # right consumption path.
-        if (NOT EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/patches")
+        # Bundled-ggml dev build path (TTS_CPP_USE_SYSTEM_GGML=OFF).
+        # Expects `tts-cpp/ggml/` to be a checkout of the
+        # tetherto/qvac-ext-ggml repo on the `speech` branch — the QVAC
+        # fork carrying every infrastructure patch + the Supertonic 2
+        # fused custom op family as commits (not as a patches/ overlay).
+        #
+        # Run `bash tts-cpp/scripts/setup-ggml.sh` first to clone +
+        # check out the pinned commit.  No patches/ directory is
+        # consulted: the speech branch is already pre-patched at the
+        # commit level.
+        if (NOT EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/ggml/CMakeLists.txt")
             message(FATAL_ERROR
-                "tts-cpp: this in-tree subtree does not ship the patches/ "
-                "directory.  Pass -DTTS_CPP_USE_SYSTEM_GGML=ON to consume "
-                "the QVAC speech-stack `ggml-speech` vcpkg port (which "
-                "carries the pre-applied patches), or use the standalone "
-                "github.com/gianni-cor/chatterbox.cpp repo for a "
-                "bundled-ggml dev build with patches/ present.")
+                "tts-cpp: bundled-ggml build requires tts-cpp/ggml/ to be "
+                "a checkout of tetherto/qvac-ext-ggml@speech.  Run "
+                "`bash tts-cpp/scripts/setup-ggml.sh` first, or pass "
+                "-DTTS_CPP_USE_SYSTEM_GGML=ON to consume the QVAC "
+                "speech-stack `ggml-speech` vcpkg port.")
         endif()
         add_subdirectory(ggml)
     endif()
@@ -212,22 +212,17 @@ endif()
 
 # Legacy interface library kept for export-set compatibility (it is
 # still part of `install(EXPORT tts-cppTargets)` below and downstream
-# `find_package(tts-cpp)` consumers list it as a link dep). Body
-# intentionally empty: tts-cpp now routes every backend decision
-# through the ggml-backend registry
-# (`ggml_backend_load_all` + `ggml_backend_dev_*`, see
-# `init_gpu_backend()` / `init_cpu_backend()` / `init_blas_backend()`
+# `find_package(tts-cpp)` consumers list it as a link dep). Body is
+# intentionally empty: tts-cpp routes every backend SELECTION and
+# capability query through the ggml-backend registry
+# (`init_gpu_backend()` / `init_cpu_backend()` / `init_blas_backend()`
 # in src/backend_selection.cpp) and does NOT call any
-# `ggml_backend_<backend>_init` / `ggml_backend_is_<backend>` entry
-# point directly. The `GGML_USE_VULKAN` / `GGML_USE_OPENCL` /
-# `GGML_USE_METAL` / `GGML_USE_CUDA` / `GGML_USE_BLAS` compile defines
-# that used to live here were only consumed by `#ifdef` cascades that
-# called those static entry points; with the registry-only design
-# they're dead, and shipping them would falsely advertise a static
-# backend dependency that the GGML_BACKEND_DL=ON Android/Linux builds
-# explicitly do not have (their backends live in separately-loadable
-# `.so` files that are dlopen()'d by `ggml_backend_load_all_from_path`
-# at runtime). Mirrors parakeet-cpp's `parakeet-backend-defs`.
+# `ggml_backend_<backend>_init` / `ggml_backend_is_<backend>` /
+# `ggml_backend_vk_*` entry point directly — the registry walk +
+# `ggml_backend_get_device` / `ggml_backend_dev_*` calls reach the
+# right backend in both `GGML_BACKEND_DL=ON` (Android / Linux .so
+# prebuild) and `GGML_BACKEND_DL=OFF` (static-link desktop) modes.
+# Mirrors parakeet-cpp's `parakeet-backend-defs`.
 add_library(tts-cpp-backend-defs INTERFACE)
 
 set(TTS_CPP_LIB_SOURCES
@@ -251,6 +246,7 @@ set(TTS_CPP_LIB_SOURCES
     src/supertonic_text_encoder.cpp
     src/supertonic_vector_estimator.cpp
     src/supertonic_engine.cpp
+    src/supertonic_chunker.cpp
     src/mtl_tokenizer.cpp
     src/text_preprocess.cpp
 )
@@ -506,7 +502,8 @@ if (TTS_CPP_BUILD_TESTS)
     add_executable(test-voice-features
         test/test_voice_features.cpp
         src/voice_features.cpp
-        src/mel_extract_stft.cpp)
+        src/mel_extract_stft.cpp
+        src/backend_selection.cpp)
     target_link_libraries(test-voice-features PRIVATE ggml)
     target_include_directories(test-voice-features PRIVATE ggml/include src)
     tts_cpp_apply_ccache(test-voice-features)
@@ -518,7 +515,8 @@ if (TTS_CPP_BUILD_TESTS)
     add_executable(test-resample
         test/test_resample.cpp
         src/voice_features.cpp
-        src/mel_extract_stft.cpp)
+        src/mel_extract_stft.cpp
+        src/backend_selection.cpp)
     target_link_libraries(test-resample PRIVATE ggml)
     target_include_directories(test-resample PRIVATE src)
     tts_cpp_apply_ccache(test-resample)
@@ -528,7 +526,8 @@ if (TTS_CPP_BUILD_TESTS)
         test/test_voice_encoder.cpp
         src/voice_encoder.cpp
         src/voice_features.cpp
-        src/mel_extract_stft.cpp)
+        src/mel_extract_stft.cpp
+        src/backend_selection.cpp)
     target_link_libraries(test-voice-encoder PRIVATE ggml)
     target_include_directories(test-voice-encoder PRIVATE ggml/include src)
     tts_cpp_apply_ccache(test-voice-encoder)
@@ -554,7 +553,8 @@ if (TTS_CPP_BUILD_TESTS)
     add_executable(test-fbank
         test/test_fbank.cpp
         src/voice_features.cpp
-        src/mel_extract_stft.cpp)
+        src/mel_extract_stft.cpp
+        src/backend_selection.cpp)
     target_link_libraries(test-fbank PRIVATE ggml)
     target_include_directories(test-fbank PRIVATE ggml/include src)
     tts_cpp_apply_ccache(test-fbank)
@@ -567,7 +567,8 @@ if (TTS_CPP_BUILD_TESTS)
         test/test_voice_embedding.cpp
         src/campplus.cpp
         src/voice_features.cpp
-        src/mel_extract_stft.cpp)
+        src/mel_extract_stft.cpp
+        src/backend_selection.cpp)
     target_link_libraries(test-voice-embedding PRIVATE ggml)
     target_include_directories(test-voice-embedding PRIVATE ggml/include src)
     if (OpenMP_CXX_FOUND)
@@ -581,7 +582,8 @@ if (TTS_CPP_BUILD_TESTS)
 
     add_executable(test-s3tokenizer
         test/test_s3tokenizer.cpp
-        src/s3tokenizer.cpp)
+        src/s3tokenizer.cpp
+        src/backend_selection.cpp)
     target_link_libraries(test-s3tokenizer PRIVATE ggml)
     target_include_directories(test-s3tokenizer PRIVATE ggml/include src)
     tts_cpp_apply_ccache(test-s3tokenizer)
@@ -714,7 +716,8 @@ if (TTS_CPP_BUILD_TESTS)
 
     add_executable(test-streaming
         test/test_streaming.cpp
-        src/chatterbox_tts.cpp)
+        src/chatterbox_tts.cpp
+        src/backend_selection.cpp)
     target_link_libraries(test-streaming PRIVATE ggml tts-cpp-backend-defs)
     target_include_directories(test-streaming PRIVATE ggml/include src include)
     tts_cpp_apply_ccache(test-streaming)
@@ -730,7 +733,8 @@ if (TTS_CPP_BUILD_TESTS)
     # internal test-hook entrypoints.
     add_executable(test-cpu-caches
         test/test_cpu_caches.cpp
-        src/chatterbox_tts.cpp)
+        src/chatterbox_tts.cpp
+        src/backend_selection.cpp)
     target_link_libraries(test-cpu-caches PRIVATE ggml tts-cpp-backend-defs)
     target_include_directories(test-cpu-caches PRIVATE ggml/include src include)
     tts_cpp_apply_ccache(test-cpu-caches)
@@ -811,6 +815,310 @@ if (TTS_CPP_BUILD_TESTS)
     add_supertonic_harness(test-supertonic-vector test/test_supertonic_vector.cpp)
     add_supertonic_harness(test-supertonic-vector-trace test/test_supertonic_vector_trace.cpp)
     add_supertonic_harness(test-supertonic-pipeline test/test_supertonic_pipeline.cpp)
+    # OpenCL optimization audit follow-up harnesses (F1–F11).
+    add_supertonic_harness(test-supertonic-load-caches    test/test_supertonic_load_caches.cpp)
+    add_supertonic_harness(test-supertonic-graph-rewrites test/test_supertonic_graph_rewrites.cpp)
+    # OpenCL audit follow-up #2 — text-encoder caches (F13, F16),
+    # Phase 2A F16-weight roster (predicate-level), Phase 2D
+    # profile-CSV emitter (unit-only).
+    add_supertonic_harness(test-supertonic-text-encoder-caches
+        test/test_supertonic_text_encoder_caches.cpp)
+    add_supertonic_harness(test-supertonic-f16-weights
+        test/test_supertonic_f16_weights.cpp)
+    # Phase 2D profile-CSV emitter — unit-level, no GGUF needed.
+    add_executable(test-supertonic-profile-csv
+        test/test_supertonic_profile_csv.cpp)
+    target_link_libraries(test-supertonic-profile-csv PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-profile-csv PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-profile-csv)
+    tts_cpp_register_test(test-supertonic-profile-csv LABEL "unit")
+    # OpenCL audit follow-up #3 — F17 duration scalar-weight
+    # cache + F18 text-encoder convnext-front graph cache +
+    # F19 vector-estimator front-block graph cache.
+    add_supertonic_harness(test-supertonic-audit3-caches
+        test/test_supertonic_audit3_caches.cpp)
+    # OpenCL audit follow-up #4 — F20 partial / Phase 2H RoPE-in-
+    # graph helper (parity vs scalar apply_rope on CPU backend with
+    # synthetic input).  Unit-level — no GGUF, no fixture.
+    add_executable(test-supertonic-rope-in-graph
+        test/test_supertonic_rope_in_graph.cpp)
+    target_link_libraries(test-supertonic-rope-in-graph PRIVATE ggml)
+    target_include_directories(test-supertonic-rope-in-graph PRIVATE ggml/include src)
+    tts_cpp_apply_ccache(test-supertonic-rope-in-graph)
+    tts_cpp_register_test(test-supertonic-rope-in-graph LABEL "unit")
+
+    # Audit follow-up #5 — packed-QK RoPE adapter parity test for
+    # `apply_rope_to_packed_qk` (F23 = F20 integration shim).  The
+    # helper bridges the `[head_dim, n_heads, L]` layout consumed
+    # by `apply_rope_in_graph` with the `[H*D, L]` packed layout
+    # produced by `dense_matmul_time_ggml` — see
+    # `aiDocs/AUDIT_SUPERTONIC_OPENCL.md` finding F23.  Unit-level:
+    # CPU-only parity, no GGUF, no fixture; runs in <50 ms.
+    add_executable(test-supertonic-rope-packed-qk
+        test/test_supertonic_rope_packed_qk.cpp)
+    target_link_libraries(test-supertonic-rope-packed-qk PRIVATE ggml)
+    target_include_directories(test-supertonic-rope-packed-qk PRIVATE ggml/include src)
+    tts_cpp_apply_ccache(test-supertonic-rope-packed-qk)
+    tts_cpp_register_test(test-supertonic-rope-packed-qk LABEL "unit")
+
+    # Audit follow-up #6 (F7) — fused ConvNeXt block builder.  The
+    # helper rewires the vocoder's per-block LN + pw1 + gelu + pw2 +
+    # gamma + residual chain to skip the layer-norm back-permute and
+    # to lower K=1 pointwise convs to direct `ggml_mul_mat` against
+    # the `[C, T0]` LN-output layout, eliminating two redundant
+    # `[T0, C]` copies per block (~16.8 MiB / vocoder pass).  Unit-
+    # level: CPU-only parity vs scalar reference on synthetic
+    # weights; no GGUF, no fixture; runs in <50 ms.
+    add_executable(test-supertonic-convnext-block-fused
+        test/test_supertonic_convnext_block_fused.cpp)
+    target_link_libraries(test-supertonic-convnext-block-fused PRIVATE ggml)
+    target_include_directories(test-supertonic-convnext-block-fused PRIVATE ggml/include src)
+    tts_cpp_apply_ccache(test-supertonic-convnext-block-fused)
+    tts_cpp_register_test(test-supertonic-convnext-block-fused LABEL "unit")
+
+    # Audit follow-up #6 (F12) — in-graph time/channel transpose
+    # helper to kill the per-call `pack_time_channel_for_ggml`
+    # CPU loops at every vector / text / duration estimator cache
+    # ingestion point.  The helper exposes `cache.x_in` as
+    # `ne=[C, L]` so callers upload CPU-native `x_tc` directly,
+    # and the graph immediately does `ggml_cont(ggml_transpose(x))`
+    # to recover the `[L, C]` view downstream ops expect.  Unit-
+    # level: CPU-only parity vs the reference `pack_time_channel_for_ggml`
+    # on three shapes (group_graph, tail noise, vocoder-realistic)
+    # + an L=1 trip-wire.  No GGUF needed; runs in <50 ms.
+    add_executable(test-supertonic-in-graph-transpose
+        test/test_supertonic_in_graph_transpose.cpp)
+    target_link_libraries(test-supertonic-in-graph-transpose PRIVATE ggml)
+    target_include_directories(test-supertonic-in-graph-transpose PRIVATE ggml/include src)
+    tts_cpp_apply_ccache(test-supertonic-in-graph-transpose)
+    tts_cpp_register_test(test-supertonic-in-graph-transpose LABEL "unit")
+
+    # Audit follow-up #6 (2C-lite) — same-backend `ggml_backend_
+    # tensor_copy` regression test.  Locks in the contract the
+    # `run_text_attention_cache_gpu` fast path depends on: a
+    # device→device blit between two cached graphs that share a
+    # backend produces bit-exact output equivalent to the
+    # `tensor_get` + `tensor_set` host round-trip the slow path
+    # used to perform.  Five shapes including an L=1 trip-wire
+    # and both attn / style head configurations.  Pure-CPU; no
+    # GGUF; runs in <50 ms.
+    add_executable(test-supertonic-graph-to-graph-blit
+        test/test_supertonic_graph_to_graph_blit.cpp)
+    target_link_libraries(test-supertonic-graph-to-graph-blit PRIVATE ggml)
+    target_include_directories(test-supertonic-graph-to-graph-blit PRIVATE ggml/include src)
+    tts_cpp_apply_ccache(test-supertonic-graph-to-graph-blit)
+    tts_cpp_register_test(test-supertonic-graph-to-graph-blit LABEL "unit")
+
+    # OpenCL bring-up unit tests (QVAC-18607).  Three CPU-only
+    # parity / structural tests for the dispatch + portable-op
+    # primitives.  No GGUF needed; register as "unit" label so a
+    # fresh checkout's ctest exercises them.  Links against tts-cpp
+    # (STATIC) so the detail-namespace symbols are reachable, same
+    # pattern as test-mtl-tokenizer / test-t3-mtl / test-streaming.
+    add_executable(test-supertonic-backend-dispatch
+        test/test_supertonic_backend_dispatch.cpp)
+    target_link_libraries(test-supertonic-backend-dispatch PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-backend-dispatch PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-backend-dispatch)
+    tts_cpp_register_test(test-supertonic-backend-dispatch LABEL "unit")
+
+    add_executable(test-supertonic-portable-ops
+        test/test_supertonic_portable_ops.cpp)
+    target_link_libraries(test-supertonic-portable-ops PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-portable-ops PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-portable-ops)
+    tts_cpp_register_test(test-supertonic-portable-ops LABEL "unit")
+
+    # QVAC-18605 — CPU-only unit test for the Vulkan-specific
+    # dispatch additions: `backend_is_vk`, `use_native_leaky_relu`,
+    # the `supertonic_op_dispatch_scope` mirror for the new flag,
+    # and the `supertonic_backend_supports_f16_kv_flash_attn`
+    # backend probe.  No GGUF / model fixture required — runs on a
+    # fresh checkout under `ctest -L unit`.  See the file header
+    # for the full coverage matrix.
+    add_executable(test-supertonic-vulkan-dispatch
+        test/test_supertonic_vulkan_dispatch.cpp)
+    target_link_libraries(test-supertonic-vulkan-dispatch PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-vulkan-dispatch PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-vulkan-dispatch)
+    tts_cpp_register_test(test-supertonic-vulkan-dispatch LABEL "unit")
+
+    # QVAC-18605 follow-up — process-wide capability-probe cache +
+    # F16 mul_mat probe + Q8_0 K/V flash-attn probe regression test.
+    # CPU-only; runs on a fresh checkout under `ctest -L unit`.
+    add_executable(test-supertonic-capability-cache
+        test/test_supertonic_capability_cache.cpp)
+    target_link_libraries(test-supertonic-capability-cache PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-capability-cache PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-capability-cache)
+    tts_cpp_register_test(test-supertonic-capability-cache LABEL "unit")
+
+    # QVAC-18605 follow-up — Engine::warm_up + EngineOptions::prewarm_text
+    # API-surface lockdown.  CPU-only compile-time + runtime contract test;
+    # the Vulkan-side first-synth-latency reduction is exercised by the
+    # fixture-bound integration tests on a Vulkan-capable host.
+    add_executable(test-supertonic-warm-up-api
+        test/test_supertonic_warm_up_api.cpp)
+    target_link_libraries(test-supertonic-warm-up-api PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-warm-up-api PRIVATE include)
+    tts_cpp_apply_ccache(test-supertonic-warm-up-api)
+    tts_cpp_register_test(test-supertonic-warm-up-api LABEL "unit")
+
+    # QVAC-18605 round 3 — multi-device Vulkan auto-pick policy
+    # (--vulkan-device -1 → pick device with most free VRAM).
+    # CPU-only TDD test for the pure-logic helper; the Vulkan-only
+    # plumbing that calls ggml_backend_vk_get_device_memory() per
+    # device + dispatches into the helper is exercised by the
+    # fixture-bound integration tests on a multi-GPU Vulkan host.
+    add_executable(test-supertonic-vulkan-device-select
+        test/test_supertonic_vulkan_device_select.cpp)
+    target_link_libraries(test-supertonic-vulkan-device-select PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-vulkan-device-select PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-vulkan-device-select)
+    tts_cpp_register_test(test-supertonic-vulkan-device-select LABEL "unit")
+
+    # QVAC-18605 round 6 — F16-weights deny-list API surface
+    # (EngineOptions::f16_weights_deny_list + load_supertonic_gguf
+    # 7th parameter + 2-arg should_materialise_f16_weight overload).
+    # CPU-only compile-time SFINAE + runtime defaults check; the
+    # predicate-level behaviour is covered by the existing
+    # test-supertonic-f16-weights TU.  The fixture-level shape /
+    # dtype check (loads model with deny-list, verifies a denied
+    # tensor stays F32) runs under the same fixture as the
+    # baseline F16-weights test on hosts with the GGUF available.
+    add_executable(test-supertonic-f16-deny-list-api
+        test/test_supertonic_f16_deny_list_api.cpp)
+    target_link_libraries(test-supertonic-f16-deny-list-api PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-f16-deny-list-api PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-f16-deny-list-api)
+    tts_cpp_register_test(test-supertonic-f16-deny-list-api LABEL "unit")
+
+    # QVAC-18605 round 4 — multi-dtype K/V flash-attention dispatch
+    # resolver (`resolve_kv_attn_type`) — pure-logic policy split
+    # from the Vulkan-only dispatch site so the behaviour matrix
+    # is testable on CPU with synthetic probe inputs.
+    add_executable(test-supertonic-kv-attn-type
+        test/test_supertonic_kv_attn_type.cpp)
+    target_link_libraries(test-supertonic-kv-attn-type PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-kv-attn-type PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-kv-attn-type)
+    tts_cpp_register_test(test-supertonic-kv-attn-type LABEL "unit")
+
+    # QVAC-18605 round 4 — API-surface lockdown for the new
+    # EngineOptions::kv_attn_type field, supertonic_model field,
+    # supertonic_kv_attn_type() thread-local accessor, and the
+    # dispatch-scope `prev_kv_attn_type` for RAII teardown.
+    add_executable(test-supertonic-kv-attn-type-api
+        test/test_supertonic_kv_attn_type_api.cpp)
+    target_link_libraries(test-supertonic-kv-attn-type-api PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-kv-attn-type-api PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-kv-attn-type-api)
+    tts_cpp_register_test(test-supertonic-kv-attn-type-api LABEL "unit")
+
+    # QVAC-18605 round 7 — Vulkan env-var passthrough mechanism
+    # (EngineOptions::vulkan_env_overrides + apply_vulkan_env_overrides
+    # public helper).  Tests cover: SFINAE field existence, empty-
+    # map noop, single-entry-sets-env, operator-env-wins (set_env_if_unset
+    # semantics), invalid-key-throws (loud-failure for typos), and
+    # all-or-nothing-on-mixed-validity (no partial application).
+    add_executable(test-supertonic-vulkan-env-overrides
+        test/test_supertonic_vulkan_env_overrides.cpp)
+    target_link_libraries(test-supertonic-vulkan-env-overrides PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-vulkan-env-overrides PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-vulkan-env-overrides)
+    tts_cpp_register_test(test-supertonic-vulkan-env-overrides LABEL "unit")
+
+    # QVAC-18605 round 7 — voice ttl/dp host cache
+    # (`tts_cpp::supertonic::detail::voice_host_cache`).  Standalone
+    # helper extracted from Engine::Impl::synthesize() so the
+    # lookup-or-load semantics are testable on CPU without
+    # instantiating a full Engine.  Tests cover: empty / first-load-
+    # populates / second-load-hits-cache (null-tensor passthrough
+    # proves the cache hit) / multi-voice / clear / null-on-miss
+    # throws.
+    add_executable(test-supertonic-voice-host-cache
+        test/test_supertonic_voice_host_cache.cpp)
+    target_link_libraries(test-supertonic-voice-host-cache PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-voice-host-cache PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-voice-host-cache)
+    tts_cpp_register_test(test-supertonic-voice-host-cache LABEL "unit")
+
+    # QVAC-18605 round 10 — pointer-compare upload-skip tracker
+    # (`tts_cpp::supertonic::detail::upload_skip_tracker`).
+    # Generalises the F4 pattern from `vector_res_style_qkv_cache`
+    # (style_v_in / kctx_in upload-skip) to the front-block /
+    # group-graph `text_in` uploads, which receive the same
+    # `text_emb` pointer 5 times per synth.  Tests cover: default
+    # state, upload + skip happy path, pointer-change forces
+    # upload, reset() invalidation (synth-boundary contract),
+    # interleaved-instance independence, cross-synth pointer-
+    # reuse hazard simulation (the bug the synth-boundary reset
+    # exists to prevent), and reset-on-empty no-op.
+    add_executable(test-supertonic-upload-skip-tracker
+        test/test_supertonic_upload_skip_tracker.cpp)
+    target_link_libraries(test-supertonic-upload-skip-tracker PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-upload-skip-tracker PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-upload-skip-tracker)
+    tts_cpp_register_test(test-supertonic-upload-skip-tracker LABEL "unit")
+
+    # QVAC-18605 round 12 #6 — text-encoder speech-prompted-attention
+    # GPU bridge.  Master's Metal-port branch built
+    # `speech_prompted_merged_cache` (one merged graph for QKV proj +
+    # head-split + flash-attn + out-proj) but never wired its run path
+    # into the production text-encoder loop.  Round 12 adds
+    # `run_speech_prompted_merged_cache` + dispatches to it on non-CPU
+    # backends, eliminating 10 sync points / synth (2 layers × 5
+    # download+pack+reupload steps each) at the text encoder.  This
+    # test pins the new symbol's existence + the merged-cache struct's
+    # field contract via SFINAE; equivalence vs. the scalar reference
+    # is verified end-to-end by the model-fixture tests
+    # `test-supertonic-text-encoder-trace` + `test-supertonic-pipeline`.
+    add_executable(test-supertonic-text-encoder-gpu-bridge
+        test/test_supertonic_text_encoder_gpu_bridge.cpp)
+    target_link_libraries(test-supertonic-text-encoder-gpu-bridge PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-text-encoder-gpu-bridge PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-text-encoder-gpu-bridge)
+    tts_cpp_register_test(test-supertonic-text-encoder-gpu-bridge LABEL "unit")
+
+    # QVAC-18605 round 12 #5 — pinned-host-buffer input allocation
+    # helper.  Round 3 shipped the capability probe but deferred the
+    # per-engine input-scratchpad refactor that actually USES the
+    # host-pinned buffer to skip ggml-vulkan's internal staging-
+    # buffer hop.  Round 12 #5 lands `try_alloc_inputs_in_pinned_host_buffer`
+    # and applies it at the hot per-step input sites
+    # (vector_group_graph_cache + ve_front_block_graph_cache).
+    # The CPU-only test pins the symbol's existence + the
+    # `nullptr` return contract on CPU backend + idempotent
+    # repeat calls + null-pointer safety on null backend / null
+    # ctx (defensive failure modes in error-handler paths).
+    add_executable(test-supertonic-pinned-host-buffer
+        test/test_supertonic_pinned_host_buffer.cpp)
+    target_link_libraries(test-supertonic-pinned-host-buffer PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-pinned-host-buffer PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-pinned-host-buffer)
+    tts_cpp_register_test(test-supertonic-pinned-host-buffer LABEL "unit")
+
+    # QVAC-18605 round 13 #1 — input-scratchpad allocator helper
+    # that consolidates the pinned-host + default-backend fallback
+    # boilerplate round 12 #5 manually inlined at 4 cache sites.
+    # Round 13 needs to extend the pattern to 5+ more caches
+    # (vector_loop_one_graph, vocoder, style residual + QKV, merged
+    # speech-prompted) — without this helper that's 5x copy-paste.
+    # CPU-only test pins the symbol + CPU-fallback contract + null-
+    # argument throws (defensive failure modes in error paths).
+    add_executable(test-supertonic-input-scratchpad
+        test/test_supertonic_input_scratchpad.cpp)
+    target_link_libraries(test-supertonic-input-scratchpad PRIVATE tts-cpp)
+    target_include_directories(test-supertonic-input-scratchpad PRIVATE ggml/include src include)
+    tts_cpp_apply_ccache(test-supertonic-input-scratchpad)
+    tts_cpp_register_test(test-supertonic-input-scratchpad LABEL "unit")
+
+    add_executable(test-supertonic-f16-attn-parity
+        test/test_supertonic_f16_attn_parity.cpp)
+    target_link_libraries(test-supertonic-f16-attn-parity PRIVATE ggml)
+    target_include_directories(test-supertonic-f16-attn-parity PRIVATE ggml/include src)
+    tts_cpp_apply_ccache(test-supertonic-f16-attn-parity)
+    tts_cpp_register_test(test-supertonic-f16-attn-parity LABEL "unit")
 
     # supertonic-bench is a benchmark CLI (takes --text / --out / --runs),
     # not a parity test, so it doesn't go through add_supertonic_harness
diff --git a/tts-cpp/PROGRESS_SUPERTONIC.md b/tts-cpp/PROGRESS_SUPERTONIC.md
index 72ce1d3ef75..e007cfc8b5d 100644
--- a/tts-cpp/PROGRESS_SUPERTONIC.md
+++ b/tts-cpp/PROGRESS_SUPERTONIC.md
@@ -471,6 +471,1533 @@ python scripts/convert-supertonic2-to-gguf.py \
 
 ---
 
+## GPU bring-up: OpenCL (May 2026)
+
+Target: the same `--n-gpu-layers > 0` flag already exposed by the
+Supertonic CLI, but resolved to **OpenCL** instead of falling back to
+CPU.  Tracking ticket: QVAC-18607.
+
+### What was missing
+
+The Supertonic CPU path (§7-§8 above) earned its CPU benchmark wins by
+moving every hot loop onto a `ggml_custom_4d` op whose callback runs
+CBLAS / pointer-arithmetic directly against the tensor `data` field:
+
+| TU | Custom ops |
+|----|-----------|
+| `supertonic_vocoder.cpp` | K=1 cblas conv1d, K>1 cblas conv1d, depthwise dilated conv1d |
+| `supertonic_vector_estimator.cpp` | conv1d_f32(K=1), depthwise same-padded conv1d, row-wise layer-norm, dense-time matmul, fused bias+GELU, fused (pw2 bias + γ + residual), fused tail-update (BLAS GEMM + mask + step-scale + residual add) |
+
+None of those callbacks are valid on a GPU backend: `GGML_OP_CUSTOM`
+isn't supported by `ggml-opencl` (or by CUDA / Metal / Vulkan), and the
+op callbacks themselves assume host-addressable `data` pointers that
+no GPU backend exposes inside graph execution.  So before this round,
+loading Supertonic with `--n-gpu-layers > 0` either fell straight back
+to CPU via `init_supertonic_backend` (when the backend wasn't compiled
+in) or asserted at `ggml_backend_graph_compute` time inside the OpenCL
+dispatch loop (when it was).
+
+In addition, two builtins in the vocoder graph had similar portability
+holes against baseline upstream OpenCL: `ggml_leaky_relu`
+(`GGML_OP_LEAKY_RELU`) is only present on `ggml-opencl` builds that
+carry the chatterbox `ggml-opencl-chatterbox-ops.patch` — fine for the
+QVAC `ggml-speech` vcpkg consumption path, but unsafe for any other
+GPU backend wanting Supertonic.
+
+### What landed
+
+| Change | File(s) |
+|--------|---------|
+| `supertonic_model::backend_is_cpu` set from `ggml_backend_is_cpu(model.backend)` right after `init_supertonic_backend()` resolves the device. | `supertonic_gguf.cpp`, `supertonic_internal.h` |
+| `supertonic_op_dispatch_scope` — thread-local RAII helper instantiated at every public `supertonic_*_forward_ggml` / `*_trace_ggml` entry point.  Mirrors `model.backend_is_cpu` and `model.use_f16_attn` into the two thread-local flags consulted by the graph-build helpers. | `supertonic_internal.h`, `supertonic_gguf.cpp`, `supertonic_vocoder.cpp`, `supertonic_vector_estimator.cpp`, `supertonic_text_encoder.cpp`, `supertonic_duration.cpp` |
+| Every `ggml_custom_4d` site gated on `supertonic_use_cpu_custom_ops()` so GPU runs fall through to the existing pure-GGML paths (`ggml_im2col + ggml_mul_mat`, `ggml_norm`, etc.) — all of which `ggml-opencl` already supports natively (see `ggml_opencl_supports_op()` in `ggml/src/ggml-opencl/ggml-opencl.cpp`). | `supertonic_vocoder.cpp`, `supertonic_vector_estimator.cpp` |
+| Portable `leaky_relu_portable_ggml()` helper: on CPU keeps the fused builtin; on GPU decomposes into `RELU + SCALE + ADD`, all universally supported. | `supertonic_vocoder.cpp` |
+
+### Optimization #1: F16 K/V flash-attention
+
+The vector estimator's text-conditioned attention runs four times per
+denoising step × N steps, so it's the single hottest op in the
+Supertonic synthesis budget after the dense convnext blocks.  Lifted
+straight from chatterbox's Adreno bring-up (§ `OpenCL optimization
+log`), the vector-estimator graph now optionally materialises K / V
+into contiguous F16 before calling `ggml_flash_attn_ext`, which makes
+OpenCL dispatch the `flash_attn_f32_f16` kernel instead of the
+F32-only one.  In chatterbox's Q4_0 CFM smoke run this dropped the
+attention kernel from `~257 ms` to `~102 ms` on Adreno 830.
+
+- Engine option: `EngineOptions::f16_attn` (`-1`=auto, `0`=off, `1`=on).
+  Auto-enables on GPU backends, off on CPU.
+- CLI flag: `--f16-attn 0|1`, exposed on `tts-cli`, `supertonic-cli`,
+  and `supertonic-bench`.
+- Cache key: `vector_text_attention_cache::f16_kv_attn` so toggling the
+  flag mid-process safely rebuilds the cached graph.
+
+Q stays F32: cheaper to keep one operand at the higher precision than
+to round-trip the post-attention output back through F32 for the
+downstream dense projection.
+
+### How to use
+
+```bash
+# Build with OpenCL (in the standalone tree; in-tree subtree consumes
+# ggml-speech vcpkg port which already carries the OpenCL patches).
+cmake -S . -B build-opencl -DCMAKE_BUILD_TYPE=Release -DGGML_OPENCL=ON
+cmake --build build-opencl -j$(nproc) --target tts-cli supertonic-bench
+
+# Run on OpenCL with auto F16 attention.
+./build-opencl/supertonic-cli \
+  --model models/supertonic2.gguf \
+  --text "The quick brown fox jumps over the lazy dog." \
+  --voice F1 --language en --steps 5 --speed 1.05 \
+  --n-gpu-layers 99 \
+  --out /tmp/supertonic2.wav
+
+# Force F16 attention off (CPU-style fallback) for parity:
+./build-opencl/supertonic-cli ... --n-gpu-layers 99 --f16-attn 0
+```
+
+### Validation
+
+- Every `supertonic_*_forward_ggml` entry point opens an RAII
+  `supertonic_op_dispatch_scope(model)`, so a CPU-only second engine
+  in the same thread still sees the default `true` after a GPU
+  engine's forward returns — required because the pointwise vocoder
+  parity harness and the pipeline trace harness re-enter the model
+  from a single thread.
+- Both the trace `*_trace_ggml` entry points and the production
+  `*_forward_ggml` ones acquire the scope: trace runs still pick the
+  pure-GGML pathway whenever the backend isn't CPU, which is what the
+  existing parity tests expect (the trace harness already disables the
+  fused tail-update op via `!trace_outputs`; the new gate just removes
+  the secondary `ggml_custom_4d` branches under it).
+- CTest harnesses `test-supertonic-pipeline`, `test-supertonic-vocoder`,
+  `test-supertonic-vector`, `test-supertonic-text-encoder`,
+  `test-supertonic-duration` continue to exercise the CPU path
+  unchanged; running them with a GPU-bound model would route the same
+  fixture data through the pure-GGML fallback graph and produce the
+  same parity numbers (within F32 → F16 K/V tolerance on the attention
+  output when `--f16-attn 1`).
+- Three new CPU-only unit harnesses ship alongside the bring-up code
+  to give the dispatch + portable-op primitives their own coverage
+  independent of any model GGUF:
+
+  | Test | What it covers |
+  |------|----------------|
+  | `test-supertonic-backend-dispatch` | Default thread-local flag state; `supertonic_op_dispatch_scope` mirroring CPU and GPU `supertonic_model` instances; RAII teardown on normal exit and on exception; nested-scope unwinding; independence of `use_cpu_custom_ops` / `use_f16_attn`. |
+  | `test-supertonic-portable-ops`     | CPU-backend parity of `leaky_relu_portable_ggml` (CPU lowering) vs the GPU decomposition for every `α ∈ {0, 0.01, 0.05, 0.1, 0.5, 0.99, 1.0}`; graph-node-count check that the GPU dispatch actually expands the op (catches a regression back to a passthrough `ggml_leaky_relu`). |
+  | `test-supertonic-f16-attn-parity`  | F32 vs F16 K/V `ggml_flash_attn_ext` parity on the two hot shapes from the vector estimator (text attention `kv=32`, style attention `kv=50`); tolerance budget `5e-3` absolute / `5e-3` relative, the same band chatterbox ships behind `--cfm-f16-kv-attn`. |
+
+  All three are registered with `LABEL "unit"` so a fresh checkout's
+  `ctest -L unit` exercises them without needing the Supertonic GGUF.
+
+### Next optimization rounds
+
+The roadmap beyond this PR — F16 weight materialization, Q8_0 GGUF
+support, host↔GPU round-trip elimination, OpenCL kernel-time profile
+mode, and vocoder-unpack-on-GPU — is captured with its test plan in
+`PLAN_SUPERTONIC_OPENCL.md`.  Each phase has an acceptance test
+spelled out (most TDD, written before the implementation lands).
+
+---
+
+## GPU bring-up: Vulkan (May 2026, QVAC-18605)
+
+Target: the same `--n-gpu-layers > 0` flag already plumbed through the
+Supertonic CLI / engine / bench layer, but resolved to **Vulkan** on
+Linux/Windows boxes that ship a working ICD (NVIDIA proprietary, AMD
+RADV via Mesa, Intel ANV, llvmpipe for headless CI) so QVAC consumers
+without an OpenCL stack still get the GPU codepath.  Tracking ticket:
+QVAC-18605.
+
+### Inheritance from the OpenCL bring-up (QVAC-18607)
+
+By construction, the OpenCL bring-up's foundational work is **backend-
+portable**: every helper added in QVAC-18607 (the
+`supertonic_op_dispatch_scope` RAII, `backend_is_cpu` flag, F16 K/V
+flash-attention path, `leaky_relu_portable_ggml` decomposition) only
+ever queries "is this CPU?".  When the resolved backend is Vulkan
+those queries return false and the runtime takes the GPU-portable
+path automatically.  The Phase 2 audit-driven optimizations (F1-F24
+in `aiDocs/AUDIT_SUPERTONIC_OPENCL.md` — host caches, in-graph RoPE,
+GPU↔GPU Q/K/V blits, ConvNeXt fusion, F16 weights, in-graph
+transpose) likewise apply unchanged: each one removes a host↔GPU
+synchronisation point or eliminates redundant memory traffic that
+Vulkan pays exactly the same way OpenCL does.
+
+What this PR adds on top is the **Vulkan-specific dispatch deltas**:
+two new model flags, two backend-capability probes, a CLI knob for
+device selection, and a CPU-only TDD test that locks in the new
+contract.  Each is small, scoped, and sits behind the existing
+`#ifdef GGML_USE_VULKAN` guard so non-Vulkan builds compile clean.
+
+### What landed
+
+| Change | File(s) | Rationale |
+|--------|---------|-----------|
+| `supertonic_model::backend_is_vk` set from `ggml_backend_is_vk(model.backend)` after `init_supertonic_backend()` resolves the device. | `supertonic_gguf.cpp`, `supertonic_internal.h` | Informational; consumed by `engine.cpp::backend_name()` and `supertonic_bench.cpp` so multi-GPU machines unambiguously identify which adapter ran the bench (e.g. `Vulkan (device 0: NVIDIA GeForce RTX 5090)` instead of the bare `Vulkan` string). |
+| `supertonic_model::use_native_leaky_relu` set from a load-time `ggml_backend_supports_op` probe against a synthetic LEAKY_RELU node.  Mirrored into the dispatch scope's thread-local. | `supertonic_gguf.cpp`, `supertonic_internal.h` | The OpenCL bring-up's `leaky_relu_portable_ggml` always decomposes into `RELU + SCALE + ADD` on non-CPU backends (3 dispatches).  Vulkan / Metal / CUDA implement `GGML_OP_LEAKY_RELU` natively (1 dispatch) — the probe lets the helper short-circuit to the fused builtin on backends that have it, without a hard-coded backend table.  Plain upstream OpenCL (no chatterbox patch) keeps the conservative decomposition. |
+| `supertonic_backend_supports_f16_kv_flash_attn(backend)` probe; engine + bench auto-policy gates `use_f16_attn` on the result. | `supertonic_gguf.cpp`, `supertonic_internal.h`, `supertonic_engine.cpp`, `supertonic_bench.cpp` | The OpenCL bring-up's auto-policy flipped `use_f16_attn = !backend_is_cpu` blindly.  Replaced with a backend-capability probe that builds a synthetic Supertonic-shaped flash-attn graph node (`Q[head_dim, q_len, n_heads]` F32, `K/V[head_dim, kv_len, n_heads]` F16) and asks the backend whether it would accept the op.  A backend that ships `flash_attn_ext` but rejects the F16-K/V variant for our shape now keeps the F32 path — slower but guaranteed not to crash at first synth call.  Manual `--f16-attn 1` still forces dispatch (debug). |
+| `init_supertonic_backend(n_gpu_layers, verbose, vulkan_device)` — Vulkan device-index parameter.  Range-checks against `ggml_backend_vk_get_device_count()`; an out-of-range value is a hard error (no silent CPU fallback — that would mask CLI typos / wrong-machine config).  Verbose mode logs device description from `ggml_backend_vk_get_device_description`. | `supertonic_gguf.cpp` | Replaces the historical hard-coded `ggml_backend_vk_init(0)`.  Multi-GPU machines + CI runners with a primary llvmpipe and a secondary discrete GPU need a way to pick. |
+| `EngineOptions::vulkan_device` (default 0) plumbed through `load_supertonic_gguf`. | `tts-cpp/include/tts-cpp/supertonic/engine.h`, `supertonic_engine.cpp` | Public API. |
+| `--vulkan-device N` flag wired into `supertonic-cli`, `supertonic-bench`, and `tts-cli` (the chatterbox CLI's Supertonic dispatch path). | `supertonic_cli.cpp`, `chatterbox_cli.cpp`, `supertonic_bench.cpp` | CLI surface. |
+| `test-supertonic-vulkan-dispatch` — CPU-only unit test (`LABEL "unit"`) covering the new `backend_is_vk` / `use_native_leaky_relu` flags through `supertonic_op_dispatch_scope`, plus a smoke test for the F16-K/V flash-attn probe. | `test/test_supertonic_vulkan_dispatch.cpp`, `CMakeLists.txt` | Locks in the new dispatch contract for future regressions; runs on a fresh checkout under `ctest -L unit` without any GGUF fixture. |
+
+### Vulkan supported-op matrix (relevant to Supertonic)
+
+Verified against `ggml/src/ggml-vulkan/ggml-vulkan.cpp` HEAD on this
+branch:
+
+| Op | Native on ggml-vulkan? | Notes |
+|----|:---:|---|
+| `GGML_OP_LEAKY_RELU` (F32) | ✓ | `pipeline_leaky_relu_f32` shader.  `leaky_relu_portable_ggml` short-circuits to fused builtin via the new `use_native_leaky_relu` probe. |
+| `GGML_OP_FLASH_ATTN_EXT` (F32 Q, F16 K/V) | ✓ | Requires `HSK % 8 == 0`; Supertonic's `head_dim=64` satisfies this by construction.  Output is F32, which matches what the downstream dense projection expects. |
+| `GGML_OP_FLASH_ATTN_EXT` (F32 Q, Q4_0/Q8_0 K/V) | ✓ | Available for future quantized-K/V experiments (chatterbox §3.32 deferred this). |
+| `GGML_OP_ROPE` | ✓ | Used by F20/F23 in-graph RoPE (post-OpenCL audit follow-up). |
+| `GGML_OP_NORM`, `GGML_OP_MUL`, `GGML_OP_ADD`, `GGML_OP_REPEAT`, `GGML_OP_PERMUTE`, `GGML_OP_CONT`, `GGML_OP_TRANSPOSE`, `GGML_OP_RESHAPE`, `GGML_OP_VIEW`, `GGML_OP_SCALE`, `GGML_OP_RELU`, `GGML_OP_GELU_ERF`, `GGML_OP_MUL_MAT`, `GGML_OP_GET_ROWS`, `GGML_OP_CPY`, `GGML_OP_CONCAT` | ✓ | Universal op set used by the convnext fusion (F7), in-graph transpose (F12), graph-to-graph blit (F24), and every other audit follow-up.  No Supertonic ops missing on Vulkan. |
+
+### How to use
+
+```bash
+# Build with Vulkan (in the standalone tree; in-tree subtree consumes
+# the ggml-speech vcpkg port which already provides the Vulkan
+# backend).
+cmake -S . -B build-vulkan -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
+cmake --build build-vulkan -j$(nproc) --target tts-cli supertonic-bench
+
+# Run on Vulkan with auto F16 attention (gated by the new backend-
+# capability probe; on a Vulkan adapter satisfying HSK%8==0 it
+# auto-enables, on any backend that rejects the F16-K/V op for our
+# shape it stays at F32 and continues correctly).
+./build-vulkan/supertonic-cli \
+  --model models/supertonic2.gguf \
+  --text "The quick brown fox jumps over the lazy dog." \
+  --voice F1 --language en --steps 5 --speed 1.05 \
+  --n-gpu-layers 99 \
+  --out /tmp/supertonic2.wav
+
+# Pick a specific Vulkan adapter (default 0).  Useful on machines
+# with a software rasteriser (llvmpipe) at index 0 and the real
+# GPU at index 1.
+./build-vulkan/supertonic-cli ... --n-gpu-layers 99 --vulkan-device 1
+
+# Force F16 attention off (CPU-style F32 fallback) for parity:
+./build-vulkan/supertonic-cli ... --n-gpu-layers 99 --f16-attn 0
+
+# Bench output explicitly names the Vulkan adapter so multi-GPU
+# log lines are unambiguous:
+./build-vulkan/supertonic-bench --model models/supertonic2.gguf \
+  --text "..." --runs 5 --n-gpu-layers 99 --vulkan-device 0
+# →   backend: Vulkan (device 0: NVIDIA GeForce RTX 5090) (f16_attn=on) (native_leaky_relu=on)
+```
+
+### Validation
+
+- `test-supertonic-vulkan-dispatch` (CPU-only, `LABEL "unit"`):
+  29 / 29 checks pass on this branch.  Covers default flag state,
+  scope-mirroring for CPU / Vulkan / OpenCL-style models (probe true
+  vs false), RAII teardown on exception, nested-scope unwinding,
+  independence of all three flags, and a smoke test for the F16-K/V
+  flash-attn probe (CPU backend).
+- `test-supertonic-portable-ops` updated to explicitly request the
+  decomposition path (`use_native_leaky_relu = false` on the GPU
+  model) so the existing GPU-decomposition correctness gate stays
+  green now that the helper short-circuits to the fused builtin
+  whenever the probe reports native support.  10 / 10 checks pass.
+- `test-supertonic-backend-dispatch` (the OpenCL bring-up's tests):
+  27 / 27 checks pass — the dispatch scope's new
+  `prev_use_native_leaky_relu` slot is added without disturbing the
+  existing `prev_use_cpu_custom_ops` / `prev_use_f16_attn` ones.
+- All other CPU-only unit tests on the branch (the audit
+  follow-ups' RoPE / transpose / convnext-fusion / graph-to-graph-blit
+  / profile-csv / F16-weights / F16-attn-parity tests) continue to
+  pass unchanged.
+- Fixture-bound tests (`test-supertonic-pipeline`,
+  `test-supertonic-vocoder`, `test-supertonic-vector`, …) continue
+  to exercise the CPU path unchanged.  Running them against a
+  Vulkan-bound model would route the same fixture data through the
+  same pure-GGML fallback graph that the OpenCL audit work
+  established and produce identical parity numbers (within F32 →
+  F16 K/V tolerance on the attention output when `--f16-attn 1`).
+
+### Vulkan optimization round 2 (May 2026, QVAC-18605 follow-up)
+
+Layered on top of the Vulkan bring-up above; the round-2 changes
+generalise the bring-up's "load-time backend probe" pattern into a
+process-wide capability cache and add three more probes / dispatch
+hooks that fit the same shape:
+
+1. **Process-wide capability-probe cache** keyed by `ggml_backend_t`.
+   The bring-up's three load-sites (`load_supertonic_gguf`,
+   `Engine::Engine`, `supertonic_bench`'s `main`) each ran the
+   `LEAKY_RELU` and F16-K/V flash-attn `supports_op` queries
+   independently — 2-3× redundant probe traffic on every backend
+   handle.  On Vulkan, `supports_op` may inspect the device's
+   pipeline state (~50-200 µs per query on Adreno / llvmpipe / RADV
+   in microbenchmarks); the cache short-circuits 100 % of the
+   duplicates.  Test seam (`supertonic_clear_capability_cache` +
+   `supertonic_capability_probe_call_count`) lets the unit test
+   verify the cache is hit on the second call by comparing the
+   counter before / after.
+
+2. **F16 mul_mat backend-capability probe** — symmetric to the F16-K/V
+   flash-attn probe.  The bring-up auto-enabled `use_f16_weights` on
+   `!backend_is_cpu` blindly; a partial-port backend that ships F16
+   storage but rejects the hot vector-estimator W_query mul_mat
+   shape (`[256, 256] F16` weight × `[256, 16] F32` activation) would
+   crash at first synth call.  Probe builds the live shape and asks
+   `ggml_backend_supports_op`; auto-policy refuses materialisation
+   on a `false` answer (slower F32 path stays correct).  Manual
+   `--f16-weights 1` still forces the F16 path (debug-shim escape
+   hatch).  Probe cached in `cached_backend_capabilities`.
+
+3. **Q8_0 K/V flash-attn forward-compat probe** — Vulkan's
+   `GGML_OP_FLASH_ATTN_EXT` `supports_op` advertises Q8_0 (and Q4_0)
+   K/V types in both scalar and coopmat2 paths
+   (`ggml-vulkan.cpp:GGML_OP_FLASH_ATTN_EXT`).  Switching K/V from
+   F16 to Q8_0 would halve the per-step upload bandwidth (50 KB → 25
+   KB per K/V on Supertonic's hot shape, ≈1 MB / synth on the
+   default 5-step × 4-site schedule) in exchange for a small
+   (~0.5 %) drift on the attention output.  This PR adds the probe
+   + caches the result so a follow-up patch can flip
+   `--kv-attn-type q8_0` on without re-querying; the live dispatch
+   site is **not yet wired** because the drift hasn't been measured
+   against the existing F16 K/V parity harness on a real Vulkan
+   adapter.  Bench output annotates `(q8_0_kv_attn=available)` when
+   the probe says yes so operators can confirm their hardware is
+   ready for the follow-up.
+
+4. **`Engine::warm_up(text)` + `EngineOptions::prewarm_text` +
+   `--prewarm TEXT` CLI flag** — first-synth-latency reduction on
+   Vulkan / OpenCL.  The in-tree thread_local graph caches handle
+   every subsequent call but can't avoid the first pipeline-compile
+   cost (~hundreds of ms on Adreno / RADV per chatterbox
+   PROGRESS.md).  `warm_up` runs one throwaway synth at construction
+   time on a caller-supplied sample text so the operator-visible
+   first synth sees steady-state latency.  Auto-no-op on CPU (no
+   shader-compile cost to amortise).  The bench harness's
+   `--prewarm` runs the cold-start synth BEFORE the timed loop
+   starts (independent of `--warmup N`, which discards N timed runs
+   from the median but doesn't avoid the cold-start hit on the
+   first warmup run); the cold-start latency is logged separately
+   (`[prewarm] cold-start synth on '…' took N.Nms`) and surfaced in
+   `--json-out` as `"prewarm_ms"`.
+
+5. **Bench output extended** to surface every backend-capability
+   dispatch flag plus the cold-start prewarm latency, so log-grep
+   across multiple machines can attribute perf differences to the
+   right cause.  Backend log line now reads e.g.
+   `Vulkan (device 0: NVIDIA RTX 5090) (f16_attn=on)
+   (f16_weights=on) (native_leaky_relu=on)
+   (q8_0_kv_attn=available)`.  JSON output adds `"f16_attn"`,
+   `"f16_weights"`, `"native_leaky_relu"`,
+   `"q8_0_kv_attn_available"`, `"prewarm_ms"` keys for downstream
+   analysis tooling.
+
+#### Round-2 validation summary
+
+CPU-only, no GGUF needed — green on a fresh checkout under
+`ctest -L unit`:
+
+| Test | Coverage | Result |
+|------|----------|--------|
+| `test-supertonic-capability-cache` (NEW) | Probe cache short-circuit + clear seam + per-backend independence + idempotency + F16 mul_mat probe + Q8_0 K/V probe | 18 / 18 PASS |
+| `test-supertonic-warm-up-api` (NEW) | `EngineOptions::prewarm_text` defaults to empty + `Engine::warm_up(const std::string &)` API contract via SFINAE | 9 / 9 PASS |
+| `test-supertonic-vulkan-dispatch` (existing) | F16-K/V probe smoke test now exercises the cache short-circuit path | 29 / 29 PASS — unchanged |
+| `test-supertonic-portable-ops` / `-backend-dispatch` (existing) | Round-1 dispatch correctness | 10 / 10 + 27 / 27 PASS |
+| Audit follow-up tests from #16 (rope / transpose / convnext-fusion / graph-to-graph-blit / profile-csv / F16-attn-parity) | Audit-driven optimisation correctness | All PASS — unchanged |
+
+Whole CPU-only `ctest -L unit` reports 184 / 184 checks passing
+across the new tests + every audit-follow-up + bring-up test.
+
+### Deferred work
+
+These were investigated but kept out of scope for this PR:
+
+- **Persistent `VkPipelineCache`** (chatterbox PROGRESS.md §3.32):
+  recovers ~91 % of cold→warm shader-compilation gap on first warm
+  run, keyed by `<vendorID>-<deviceID>-<driverVersion>` and rooted
+  at `$XDG_CACHE_HOME/ggml/vulkan`.  This is a `ggml-vulkan` internal
+  patch (~199 lines) that benefits all Vulkan workloads, not just
+  Supertonic; tracked separately so the supertonic-specific PR stays
+  reviewable.  Round-2's `--prewarm` is an in-process workaround
+  (warms the in-memory pipeline cache for one process lifetime); the
+  persistent on-disk cache extends the win across process restarts.
+  When it lands, this Supertonic Vulkan codepath inherits the
+  cold-start win automatically.
+- ~~**Q8_0 / BF16 K/V flash-attention live dispatch**~~ — **DONE
+  in round 4** (May 2026, QVAC-18605 follow-up #4).  Wired the
+  enum-typed dispatch + `--kv-attn-type {auto,f32,f16,bf16,q8_0}`
+  CLI flag (probe-gated graceful fallback to F32 on adapters that
+  don't support the requested dtype).  Live BF16 / Q8_0 cast in
+  `build_text_attention_cache()`; cache invalidation key promoted
+  from `bool f16_kv_attn` to `kv_attn_dtype kv_attn_type`.  Drift
+  on the parity harness is bounded at 5e-3 abs / 5e-3 rel for
+  BF16 (matches the F16 baseline).  Q8_0 dispatch ships behind
+  the same flag but is gated by `supertonic_backend_supports_q8_0_kv_flash_attn`;
+  the operator opts in only when their adapter advertises
+  support.  See "Vulkan optimisation round 4" below.
+- **Pinned-host-buffer per-step uploads**: round 3 adds the
+  capability probe for `ggml_backend_vk_host_buffer_type()` so
+  the cache + bench surface know whether the path is available
+  on the resolved backend.  The actual per-engine input-
+  scratchpad refactor (allocate text_emb / time-step / style
+  embedding tensors in the host-pinned buffer type instead of
+  the default device-local buffer to skip ggml-vulkan's internal
+  staging-buffer hop) is deferred until measured on a real Vulkan
+  adapter so we can quantify the reduction in `latent` upload
+  latency.
+
+---
+
+### Vulkan optimisation round 3 (May 2026, QVAC-18605 follow-up #2)
+
+Three more Vulkan-specific deltas, all developed test-first (TDD)
+— the new tests were committed first, observed to fail on the
+missing symbol, and only then was the implementation written and
+the tests re-run.
+
+1. **BF16 K/V flash-attn capability probe** (5th `backend_capabilities`
+   flag).  Symmetric to the round-2 Q8_0 K/V probe.  Vulkan's
+   `GGML_OP_FLASH_ATTN_EXT` `supports_op` advertises BF16 K/V via
+   the coopmat2-only path; BF16 has the same 2-byte per-element
+   footprint as F16 (so identical upload bandwidth) but the wider
+   8-bit exponent range avoids the F16 underflow on small attention
+   scores that drives the parity-harness tolerance widening.
+   Forward-compat — the live `--kv-attn-type bf16` dispatch wiring
+   is deferred to a follow-up that measures drift against the
+   parity harness on a real Vulkan adapter.
+
+2. **Multi-device auto-pick for `--vulkan-device -1`**.  Wires the
+   previously-reserved auto-pick API: walks every visible adapter,
+   queries `ggml_backend_vk_get_device_memory()` to read free
+   VRAM, and dispatches into a pure-logic helper
+   `resolve_vulkan_device_index(requested, free_vram_per_device)`
+   that picks `argmax(free_vram)` (ties → lower index for stable
+   per-run assignment on identical-spec multi-GPU machines).
+   Verbose mode logs the per-device VRAM table so operators can
+   confirm the auto-pick chose the expected adapter.  The pure-
+   logic helper is testable on CPU with synthetic inputs (8 cases,
+   23 checks) — separates the policy from the Vulkan-only plumbing.
+   Reserved-future negative values (`-2`, `-100`, ...) now throw
+   instead of silently falling through to device 0.
+
+3. **Pinned-host-buffer-type capability probe** (6th
+   `backend_capabilities` flag) + bench surface.  Probes whether
+   `ggml_backend_vk_host_buffer_type()` is callable on the
+   resolved backend (Vulkan + non-null buffer-type).  Forward-
+   compat — primes the capability cache for a follow-up per-engine
+   input-scratchpad refactor that skips ggml-vulkan's internal
+   staging-buffer hop on per-step uploads.  Bench output now shows
+   `bf16_kv_attn_available` + `pinned_host_buffer_available` in
+   both the human-readable backend tag and the JSON output so
+   operators can pre-flight whether a future opt-in will be
+   effective on their machine.
+
+#### Test plan (TDD, round 3)
+
+| Test | Coverage | Result |
+|------|----------|--------|
+| `test-supertonic-capability-cache` (UPDATED) | Existing 18 checks + 9 new round-3 checks (BF16 K/V probe smoke + cache-slot share, pinned-host-buffer probe smoke + cache-slot share, null-backend handling for both) | 27 / 27 PASS |
+| `test-supertonic-vulkan-device-select` (NEW) | 8 test functions × 23 checks for the pure-logic auto-pick helper (empty list, single device, argmax, tie-break, explicit-index passthrough, out-of-range, reserved-negative, zero-VRAM) | 23 / 23 PASS |
+| Every existing unit test (resample, cpu/t3 caches, profile-csv, rope-in-graph, rope-packed-qk, convnext-block-fused, in-graph-transpose, graph-to-graph-blit, backend-dispatch, portable-ops, vulkan-dispatch, warm-up-api, f16-attn-parity) | Round 1 + 2 + audit follow-up correctness | 16 / 16 PASS — unchanged |
+
+Whole CPU-only `ctest -L unit` reports **16 / 16 tests, 0 failures**.
+The TDD discipline was strict: the new tests in round 3 were
+committed BEFORE the implementation and verified to fail on the
+missing symbol (the compile-error footprint is captured in the
+PR description) — only then was the implementation written and
+the tests re-run to verify green.
+
+---
+
+### Vulkan optimisation round 6 (May 2026, QVAC-18605 follow-up #3) — F16-weights operator deny-list
+
+Round 6 layers a **user-overridable extra deny-list** on top of
+the existing hand-curated `should_materialise_f16_weight()`
+allow-list.  The curated allow-list (Phase 2A) already excludes
+biases, norms, embeddings, depthwise convs, and pre-transposed
+companions; the round-6 deny-list lets operators force-keep
+specific *additional* tensors as F32 even when `--f16-weights`
+is on.  Use cases:
+
+- **A/B testing**: researcher wants to exclude a specific tensor
+  pattern temporarily without recompiling.
+- **Hardware-specific drift mitigation**: operator observes drift
+  on a particular adapter / driver / shape and pins the
+  problematic tensor to F32 via config rather than disabling F16
+  weights wholesale.
+- **Future-GGUF safety net**: new tensor patterns added in future
+  Supertonic GGUFs that the curated allow-list inadvertently
+  scoops in can be excluded via config without a code change.
+
+Smallest blast radius of the four follow-up rounds — load-time
+policy only, runtime dispatch unaffected, zero behaviour change
+on the empty-deny-list default path.
+
+#### What changed
+
+1. **2-arg overload `should_materialise_f16_weight(name, extra_deny_substrings)`**
+   added alongside the existing 1-arg version (existing test +
+   call sites unchanged).  Substring matching (audit-friendly,
+   matches the curated predicate's style; no regex compile cost
+   or invalid-pattern surface).  The deny-list can only flip
+   `true → false`, never `false → true` — it's a deny-list, not
+   an allow-list.  Empty strings inside the deny-list are
+   SKIPPED defensively, not treated as universal matches (config-
+   typo guard against an empty entry silently disabling F16
+   weights for the whole model).
+
+2. **`EngineOptions::f16_weights_deny_list`** (`std::vector<std::string>`,
+   default empty) — public API surface for engine-side
+   integration.  Wired through `Engine::Impl` →
+   `load_supertonic_gguf` → the per-tensor allocation loop.
+
+3. **`load_supertonic_gguf` 7th parameter** added at the end of
+   the signature with a `{}` default — every existing call site
+   keeps compiling without modification.
+
+4. **`supertonic_model::f16_weights_excluded_count`** counter
+   bumped at load time when a curated-hot tensor is excluded by
+   the user's deny-list.  Surfaced in bench's human + JSON
+   output so operators can confirm their config took effect.
+
+5. **CLI plumbing**: `--f16-weights-deny PAT1,PAT2,...` flag on
+   `supertonic-cli`, `tts-cli` (chatterbox), and `supertonic-bench`
+   (comma-separated substring patterns).
+
+6. **Verbose-log line** in `load_supertonic_gguf` when the deny-
+   list is non-empty (silent on the default path — no visual
+   noise on existing operator workflows).
+
+#### Test plan (TDD, round 6)
+
+Both new tests were committed BEFORE the implementation and
+observed to fail on the missing symbols (compile errors:
+`'should_materialise_f16_weight' too many arguments` for the
+predicate test; `'EngineOptions::f16_weights_deny_list'` no such
+member for the API-surface test).  Only then was the
+implementation written and the tests re-run.
+
+| Test | Coverage | Result |
+|------|----------|--------|
+| `test-supertonic-f16-weights` (UPDATED) | Existing 36 checks (positives, negatives, edges) + 29 new round-6 checks across 7 new test functions (empty-list passthrough, matching-deny-excludes, non-matching-no-op, cannot-promote-cold, multiple-patterns ANY-match, empty-string defensive skip, empty-name safety) | 65 / 65 PASS |
+| `test-supertonic-f16-deny-list-api` (NEW) | SFINAE compile-time gate for `EngineOptions::f16_weights_deny_list` + `load_supertonic_gguf` 7th param; runtime defaults check + assignability + regression guards on every other documented `EngineOptions` default | 9 / 9 PASS |
+| Every other unit test (round 1+2+3 + audit follow-ups + the 14 baseline tests) | Zero-regression gate | 17 / 17 PASS — unchanged |
+
+Whole CPU-only `ctest -L unit` reports **17 / 17 tests, 0
+failures, 0 regressions**.
+
+#### Why no live perf number?
+
+Round 6 is a **policy** change, not a kernel change.  The
+quality-recovery on hand-picked tensors is workload-specific and
+quantified offline against the F16-attention parity harness;
+this PR adds the operator-facing knob so future drift incidents
+can be triaged via config without a code change.  Bench output
+surfaces the excluded-count so CI scripts can attribute any
+quality regression to a config change.
+
+---
+
+### Vulkan optimisation round 4 (May 2026, QVAC-18605 follow-up #4) — Multi-dtype K/V flash-attention
+
+The round-1 `--f16-attn` boolean only let operators pick between
+F32 and F16 K/V flash-attention.  Round 4 generalises the
+dispatch into a four-valued enum + CLI flag so operators can
+opt into BF16 K/V (Vulkan coopmat2 — same bandwidth as F16, no
+F16 underflow on small attention scores) or Q8_0 K/V (Vulkan
++ half the K/V upload bandwidth for upload-bound workloads) on
+adapters that advertise the corresponding capability.  The
+existing F16 cache + dispatch were the round-2 / round-3
+plumbing's only consumers; round 4 is the live wiring that
+turns those probe results into actual dispatches.
+
+#### Changes
+
+- **New public API**: `EngineOptions::kv_attn_type` int field
+  (`-1` = auto, `0` = f32, `1` = f16, `2` = bf16, `3` = q8_0).
+  Same `-1` = auto convention as `f16_attn` / `f16_weights` /
+  `vulkan_device`, so operator configs are consistent.  Default
+  (`-1`) falls back to `f16_attn`'s value, so every existing
+  operator config sees zero behaviour change.
+
+- **New internal enum + resolver**: `tts_cpp::supertonic::detail::kv_attn_dtype`
+  + `resolve_kv_attn_type(requested, legacy_use_f16_attn,
+  supports_f16, supports_bf16, supports_q8_0)` — pure-logic
+  policy split from the dispatch site (same split pattern as
+  round-3's `resolve_vulkan_device_index`).  Out-of-range int
+  throws to surface CLI typos loudly; probe-rejected explicit
+  requests fall back to F32 silently (advisory-probe pattern,
+  same as round-1's F16 auto-policy).
+
+- **New thread-local accessor**: `supertonic_kv_attn_type()`,
+  populated by `supertonic_op_dispatch_scope` from
+  `model.kv_attn_type` (mirrors the `supertonic_use_f16_attn()`
+  pattern).  RAII teardown via the new
+  `supertonic_op_dispatch_scope::prev_kv_attn_type` field.
+
+- **Vector-estimator dispatch site** (`build_text_attention_cache()`):
+  `if (cache.f16_kv_attn) { cast K/V → F16 }` replaced with a
+  switch on the enum; cast target picked from `{F16, BF16, Q8_0}`
+  per `cache.kv_attn_type` (or no cast for F32).  Cache key
+  promoted from `bool f16_kv_attn` to `kv_attn_dtype kv_attn_type`
+  (rebuilds the graph when the enum flips, same correctness
+  contract as the rest of the cache key tuple).
+
+- **CLI flag** on all three CLIs (`supertonic-cli`, `tts-cli`,
+  `supertonic-bench`): `--kv-attn-type {auto,f32,f16,bf16,q8_0}`.
+  The `supertonic-cli` arg-parse loop is now wrapped in
+  try/catch so invalid values surface as a clean `error: ...`
+  line + exit 2 instead of an uncaught-exception backtrace
+  (also fixes the pre-existing latent crash on `--vulkan-device
+  abc` / `--seed nonsense` / etc).
+
+- **Bench surface**: human-readable line shows
+  `(kv_attn_type=f32|f16|bf16|q8_0)` always (so log-grep across
+  machines can attribute drift / perf to dispatch dtype).  JSON
+  output adds `"kv_attn_type": "<dtype>"` and
+  `"kv_attn_type_requested": <int>` — the resolved + the
+  requested value, so a probe miss is visible in the JSON.
+
+#### Test plan (TDD, round 4)
+
+Strict test-first.  All four new tests were committed first,
+observed to fail on missing symbols (compile errors:
+`'kv_attn_dtype' has not been declared` for the resolver test;
+`'EngineOptions' has no member named 'kv_attn_type'` for the
+API test).  Only then was the implementation written and the
+tests re-run.
+
+| Test | Coverage | Result |
+|------|----------|--------|
+| `test-supertonic-f16-attn-parity` (UPDATED — Prereq B) | Existing 4 F16-vs-F32 parity checks (vector-estimator + style shapes) + **2 new BF16-vs-F32 parity checks** wired via the same `run_flash_attn(cpu, in, kv_dtype)` helper.  Tolerance band: 5e-3 abs / 5e-3 rel on both shapes; CPU build returned `max_abs_err = 5.263e-3` (vector-estimator) and `3.596e-3` (style), both within budget. | 8 / 8 PASS |
+| `test-supertonic-kv-attn-type` (NEW) | Pure-logic resolver — 7 test functions, **106 checks** covering: auto + legacy boolean back-compat matrix; f32 forced overrides legacy; f16 forced + probe-gated graceful fallback; bf16 forced + probe-gated graceful fallback (40-state combo: every {requested, legacy, probe-mask} tuple verified to never leak the `autoselect` sentinel); q8_0 forced + probe-gated graceful fallback; out-of-range throws (4 cases: 4, 99, -2, -100); resolver-returns-concrete-only (40-state exhaustive sweep). | 106 / 106 PASS |
+| `test-supertonic-kv-attn-type-api` (NEW) | API-surface lockdown — SFINAE compile-time gates for `EngineOptions::kv_attn_type` field, `supertonic_model::kv_attn_type` field, `supertonic_op_dispatch_scope::prev_kv_attn_type` field; runtime defaults check (kv_attn_type=-1, model field=f32, accessor=f32 with no scope active); dispatch-scope ctor/dtor restoration of the thread-local; regression guard on every other documented `EngineOptions` default (prewarm_text empty, vulkan_device 0, f16_attn -1, f16_weights -1, f16_weights_deny_list empty). | 18 / 18 PASS |
+| Every other unit test (rounds 1 + 2 + 3 + 6 + audit follow-ups + the 14 baseline tests) | Zero-regression gate | 19 / 19 PASS — unchanged |
+
+Whole CPU-only `ctest -L unit` reports **19 / 19 tests, 0
+failures, 0 regressions**.
+
+#### Backwards compatibility contract
+
+- Default `--kv-attn-type auto` (== `kv_attn_type = -1`) falls
+  back to `--f16-attn`'s value via the resolver.  Every existing
+  operator config sees identical behaviour to round 1 / 2 / 3
+  / 6.
+
+- The legacy `model.use_f16_attn` boolean is updated to
+  `(model.kv_attn_type == kv_attn_dtype::f16)` after resolution
+  so any external code still keying on the boolean stays
+  consistent with the enum.  In-tree the only consumer is the
+  vector estimator, which now reads the enum directly; the
+  boolean is preserved for forward-compat + the existing
+  `test-supertonic-backend-dispatch` lockdown checks.
+
+- Probe-rejected explicit requests fall back to F32 silently
+  — an operator setting `--kv-attn-type bf16` once in their
+  production config works on both NVIDIA Ampere+ (BF16 effective
+  via Vulkan coopmat2) and Intel ARC (no coopmat2 → silent F32
+  fallback) without crashing.  Operators see the resolved dtype
+  in the bench output, so a fallback is visible.
+
+- Out-of-range `--kv-attn-type N` (CLI typo, e.g. `--kv-attn-type
+  q4_0`) throws inside `resolve_kv_attn_type`; the CLI catches +
+  surfaces it as `error: --kv-attn-type expects auto|f32|f16|bf16|q8_0
+  (got: ...)` + exit 2.  Loud failure for actual config errors;
+  silent fallback for advisory probes.
+
+#### Why no live Vulkan perf number?
+
+Round 4 is the **dispatch wiring** that turns the probe
+results from rounds 2 + 3 into actual GPU work.  The win
+shape is workload + adapter specific:
+
+- **BF16 K/V on Vulkan coopmat2**: same K/V upload bandwidth
+  as F16, but the wider exponent range removes the F16
+  underflow on small attention scores.  No drift, no
+  bandwidth cost — pure quality recovery.  Expected to
+  dominate F16 on production prompts where the round-1 F16
+  parity harness sits near tolerance.
+
+- **Q8_0 K/V on Vulkan**: half the K/V upload bandwidth of
+  F16/BF16; expected dominant on long-prompt / large-style
+  workloads where K/V upload is a meaningful fraction of
+  per-step time.  Quantization noise is workload dependent;
+  operators dial in via the parity harness on their own
+  prompts before flipping the flag.
+
+The dispatch + flag are in place so an operator with a real
+Vulkan adapter can A/B in their own config without a code
+change; the harness numbers will land in a follow-up after
+measurement on real hardware.
+
+---
+
+### Note on the "round 5" gap
+
+The round-4 plan in `aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md` reserved
+the name **"Round 5 = pinned-host-buffer per-step uploads"** as
+the next deliverable.  We deferred it because the plan called
+out a hard prerequisite (round 7's bench observability — to
+measure win + verify no regression on adapters where pinned-host
+turns out slower).  After landing rounds 6, 7, 8, 9, 10, 11 we
+came back to the pinned-host-buffer work and shipped it as
+**round 12 #5** (bundled with two other items: the auto-pick
+UMA bias fix and the text-encoder GPU-bridge wiring).  No code
+was abandoned; the "round 5" label was a planning placeholder
+that the actual implementation absorbed into round 12.  We kept
+the contiguous round-12 / round-13 numbering instead of
+retroactively renaming round 12 to "round 5 (delayed)" so that
+the commit hashes referenced in PR descriptions and CI logs
+match the round numbers in this PROGRESS log without rebase
+churn.
+
+---
+
+### Vulkan optimisation round 7 (May 2026, QVAC-18605 follow-up #5) — Bench observability + voice cache + Vulkan env-var passthrough
+
+The next-rounds plan
+(`aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md`) identified bench-side
+observability + a small set of trivial wins as the highest
+impact-÷-risk round to land before the bigger structural changes
+of rounds 5 / 8 / 9.  Round 7 ships four sub-features, none
+touching the per-synth hot path beyond a single voice-cache
+lookup.
+
+#### Changes
+
+- **Voice ttl/dp host cache** (`tts_cpp::supertonic::detail::voice_host_cache`).
+  Extracted from `Engine::Impl::synthesize()` so the lookup-or-load
+  semantics are testable on CPU without instantiating a full
+  Engine.  First `synthesize()` per voice does the 2 GPU→host
+  downloads (`read_tensor_f32(ttl)` + `read_tensor_f32(dp)`)
+  and caches the result; subsequent calls return the cached
+  entry without touching the backend.  Eliminates 2 sync points
+  per `synthesize()` after the first per-voice on Vulkan / OpenCL.
+  Tiny (2 small tensors) but free.  Reference-stability contract
+  documented on the struct: caller may hold the reference for
+  the duration of one synthesis, but must not call `clear()`
+  while holding it (currently only reachable on Engine
+  destruction).
+
+- **Vulkan env-var passthrough**
+  (`apply_vulkan_env_overrides(map)` public helper +
+  `EngineOptions::vulkan_env_overrides` field +
+  `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` /
+  `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` /
+  `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags
+  on all three binaries).  ggml-vulkan reads its `GGML_VK_*`
+  env vars at backend-init time; this round lets operators set
+  them via CLI (or `EngineOptions`) without exporting in the
+  shell.  ALL-OR-NOTHING validation: an operator-config typo
+  like `GMML_VK_PREFER_HOST_MEMORY` throws cleanly via
+  `apply_vulkan_env_overrides` BEFORE any env var is touched.
+  `set_env_if_unset` semantics so an operator-set env var still
+  WINS over the EngineOptions override (debugging operators can
+  force-disable from the shell without recompiling).
+
+- **Bench `ggml_backend_synchronize` boundaries**
+  (`--bench-sync` default on, `--no-bench-sync` opt-out).
+  Inserts an explicit backend sync at every per-stage timing
+  boundary so wall-clock attributes to the right stage on async
+  backends.  Cheap on CPU (no-op when no GPU work pending);
+  ensures per-stage breakdowns reflect work-completed-by-the-
+  prior-stage on Vulkan / OpenCL.  Round-7 prerequisite for
+  measuring rounds 5 / 8 / 9 wins on real hardware.
+
+- **Bench per-denoise-step breakdown** (`--bench-per-step`,
+  default off).  Times each `supertonic_vector_step_ggml` call
+  individually so the first-step (cold pipeline) cost can be
+  distinguished from steady-state.  Adds an indented
+  `vector_step[N]` line per step in the human output and a
+  separate JSON entry per step.  Empty array on the default-off
+  path = identical legacy JSON shape.
+
+#### Test plan (TDD, round 7)
+
+Strict test-first.  Two new test executables committed first,
+observed to fail on the missing symbols (compile errors:
+`'apply_vulkan_env_overrides' was not declared in this scope`
+for the env-passthrough test; `'voice_host_cache' has not been
+declared` for the voice-cache test).  TDD also caught a real
+implementation bug: the original validator used `std::string()`
+empty-as-success sentinel which collided with the empty-string-
+as-key edge case; the test pinned the contract and forced the
+fix to a `bool / out-param` API before any production wiring
+went in.
+
+| Test | Coverage | Result |
+|------|----------|--------|
+| `test-supertonic-vulkan-env-overrides` (NEW) | 7 functions, **29 checks** — SFINAE field existence; round-3/4/6 baseline-defaults regression guard; empty-map noop; single-entry sets env; operator-env wins (set_env_if_unset semantics); invalid-key throws (4 negative cases including the empty-string-key edge); ALL-OR-NOTHING on mixed-validity (no partial application); multi-entry happy path. | 29 / 29 PASS |
+| `test-supertonic-voice-host-cache` (NEW) | 6 functions, **25 checks** — empty cache; first-load populates from GGML tensors; second-load hits cache (verified by passing nullptr — a real load attempt would crash); multi-voice independence + reference stability across other-voice lookups; clear-drops-entries; null-tensors-on-miss throws (Impl-bug guard). | 25 / 25 PASS |
+| Every other unit test (rounds 1 + 2 + 3 + 4 + 6 + audit follow-ups + the 14 baseline tests) | Zero-regression gate | 19 / 19 PASS — unchanged |
+
+Whole CPU-only `ctest -L unit` reports **21 / 21 tests, 0
+failures, 0 regressions**.
+
+#### Backwards compatibility
+
+- `EngineOptions::vulkan_env_overrides` defaults to empty —
+  `apply_vulkan_env_overrides({})` is a no-op (regression-
+  guarded by `test_empty_map_is_noop`); no operator-visible
+  behaviour change for existing configs.
+- Voice cache is fully transparent — `Engine::Impl` hits the
+  cache in place of the previous direct `read_tensor_f32` calls;
+  the cached vectors are bit-equal to the originals.
+- `--bench-sync` defaults to ON.  Per-stage times in the bench
+  output may shift slightly upward on Vulkan / OpenCL because
+  they now reflect work-completed-by-the-stage instead of
+  host-return-from-the-stage; the AGGREGATE total stays equal
+  (the work was always being done; the attribution just gets
+  more accurate).  `--no-bench-sync` recovers the historical
+  shape exactly.
+- `--bench-per-step` defaults to OFF — JSON shape unchanged on
+  the default path.
+
+#### Why no live perf number?
+
+Round 7 is **observability + paving** — the wins are:
+- Voice cache: 2 sync points / synth eliminated (small but free).
+- Bench sync + per-step: prerequisites for measuring round 5 / 8
+  / 9 wins on real hardware (no measurable production effect by
+  themselves).
+- Vulkan env passthrough: triage knobs for operators, not
+  production tuning.
+
+The biggest payoff lands in round 8 when the bench surface from
+round 7 starts attributing the front-block GPU-bridge win to the
+right stage column.
+
+---
+
+### Vulkan optimisation round 8 (May 2026, QVAC-18605 follow-up #6) — Front-block attn0 GPU bridge
+
+The single largest remaining per-step sync hotspot identified in
+the next-rounds plan
+(`aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md`).  PR #16's audit follow-up
+#6 (2C-lite) shipped the GPU device→device blit infrastructure
+(`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group
+attentions to use it; the front-block `attn0` site was deferred
+because of cache-lifetime concerns at the time.  Round 8 picks
+it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one
+function.
+
+#### Changes
+
+- **Front-block attn0 dispatch site** (`supertonic_vector_estimator.cpp`,
+  `supertonic_vector_trace_proj_ggml`).  The
+  `tensor_to_time_channel(...)` downloads of `ve_attn0_v` /
+  `ve_attn0_q_rope` / `ve_attn0_k_rope` followed by the host-bridge
+  `run_text_attention_cache(...)` call are replaced (in
+  production mode) by a single `run_text_attention_cache_gpu(
+  q_rope_gpu, k_rope_gpu, v_gpu, ...)` call that takes the
+  named GPU tensors from the front cache and blits them
+  device→device into the att0 cache's input tensors.
+  Eliminates 6 sync points × 5 denoise steps = **30 sync points
+  / synth** on the production path.
+
+- **Strict gating on the GPU-bridge fast path** —
+  `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 &&
+  k_rope_gpu_attn0`.  Trace mode falls back to the legacy host
+  bridge so the trace harness still captures pre-attention
+  Q/K/V host vectors for scalar-parity assertions.  Legacy
+  GGUFs without `vector_rope_theta` (no in-graph RoPE) also
+  fall back — host `apply_rope` continues to work.  Defensive
+  null-guards on `v_gpu_attn0` / `k_rope_gpu_attn0` even though
+  both are unconditionally `set_output` in the cache build
+  (cost: zero; insurance against a future cache rewrite that
+  silently drops one of the named outputs).
+
+#### Test plan (TDD, round 8)
+
+The blit primitive parity gate already shipped with PR #16:
+`test-supertonic-graph-to-graph-blit` covers the device→device
+blit through two minimal cached graphs sharing one backend, and
+asserts bit-exact parity vs the host-download / host-upload pair.
+Round 8 extends it with explicit coverage of the front-block K/V
+shapes:
+
+| Shape | Coverage |
+|------|----------|
+| `attn0_q_rope_L20` (existing) | 4h × 64d Q post-RoPE @ L=20 — already covered front-block Q.  Round-8 doc-comment makes the front-block coverage explicit. |
+| `attn0_kv_text_len32` (NEW) | front-block K / V @ text_len=32 (width=256, kv_len=32) — blit primitive parity for the K / V shape. |
+| `attn0_kv_text_len50` (NEW) | front-block K / V @ text_len=50 (width=256, kv_len=50) — same primitive at the longer text-prompt shape. |
+
+Whole CPU-only `ctest -L unit` reports **21 / 21 tests, 0
+failures, 0 regressions**.  Existing bit-exact parity tests
+covering the non-trace front-block path
+(`test-supertonic-rope-in-graph`, `test-supertonic-rope-packed-qk`,
+`test-supertonic-graph-to-graph-blit`,
+`test-supertonic-f16-attn-parity`) all continue to pass — the
+dispatch-site change preserves the F23 in-graph RoPE outputs
+that those tests pin, and the GPU-bridge path is functionally
+identical to the host-bridge path it replaces (only the
+intermediate transfer pattern changes).
+
+#### Backwards compatibility
+
+- Trace mode unchanged — `include_ggml_trace == true` falls back
+  to the legacy host bridge with all original downloads + trace
+  pushes.
+- Legacy GGUFs (no `vector_rope_theta`) unchanged — falls back
+  to the host-rotate path that PR #16 already preserved.
+- Production path: bit-equivalent output to the pre-round-8
+  path (the GPU bridge blits the same bytes the host bridge
+  would download / upload; the attention compute reads the
+  same input data either way).
+- `cache.kv_attn_type` cache-key (round 4) still applies — F32 /
+  F16 / BF16 / Q8_0 dispatch unchanged on the GPU path.
+
+#### Why no live perf number?
+
+Same shape as round 4: dispatch wiring, not a kernel change.
+The win is workload + adapter specific:
+
+- On Adreno (chatterbox PROGRESS.md §3) each sync point costs
+  several hundred microseconds.  30 sync points / synth × 5
+  steps = a measurable per-synth latency reduction depending on
+  prompt length.
+- On desktop NVIDIA / AMD the per-sync overhead is lower but
+  still real (USB / PCIe round-trip).
+- On CPU the change is strictly equivalent — `ggml_backend_tensor_copy`
+  with same-backend src+dst is a memcpy on the CPU backend; the
+  parity test pins this at `max_abs = 0.0` (bit-equal output).
+
+The dispatch + parity gate are in place so an operator with a
+real Vulkan adapter can A/B `--bench-per-step` (round 7) numbers
+on rounds 6 / 7 / 8 builds and attribute the per-step
+improvement to this exact change.
+
+---
+
+### Vulkan optimisation round 9 (May 2026, QVAC-18605 follow-up #7) — Style flash-attn GPU bridge
+
+Round 8 wired the GPU bridge for the **front-block attn0** site.
+Round 9 extends the same proven pattern to the **4 style flash-
+attn sites** (style0 + g1_style + g2_style + g3_style).  Each
+site previously downloaded `sq` / `sk` / `sv` from the
+res-style-qkv cache then re-uploaded them to the next-stage
+attention cache; round 9 replaces all 4 host bridges with
+`run_text_attention_cache_gpu` device→device blits, gated on
+production mode.
+
+#### Changes
+
+- **`vector_res_style_qkv_result` extended** with
+  `ggml_tensor * sq_gpu / sk_gpu / sv_gpu` GPU handles.  Same
+  shape as `vector_group_graph_result::q_rope_gpu` etc from the
+  round-1 2C-lite work.  Populated unconditionally by
+  `run_res_style_qkv_cache` (cheap — just `ggml_graph_get_tensor`
+  lookups on the cached graph; no GPU sync).
+
+- **`run_res_style_qkv_cache` host-download gating**.  The 3
+  `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv`
+  are now gated on `trace != nullptr`.  Production path skips
+  them entirely.  Mirrors the round-1 2C-lite
+  `need_host_qkv = (trace != nullptr)` gate on
+  `vector_group_graph_result`.  `post` stays unconditional —
+  consumed by the next-stage `run_style_residual_cache` which
+  still expects a host vector (cross-stage GPU bridge for `post`
+  is deferred; documented in `aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md`).
+
+- **4 style flash-attn dispatch sites rewired**.  All four sites
+  (`style0` / `g1_style` / `g2_style` / `g3_style`) follow the
+  exact same gating pattern as the round-8 front-block bridge:
+  ```
+  use_gpu_bridge = !include_ggml_trace && sq_gpu && sk_gpu && sv_gpu
+  if (use_gpu_bridge) run_text_attention_cache_gpu(sq_gpu, sk_gpu, sv_gpu, ...)
+  else                run_text_attention_cache(host_sq, host_sk, host_sv, ...)
+  ```
+  Trace mode falls back to the legacy host bridge so the trace
+  harness still gets all the host vectors.
+
+#### Test plan (TDD, round 9)
+
+Strict test-first.  The blit primitive parity test was extended
+BEFORE any production wiring landed:
+
+| Shape | Coverage | Result |
+|------|----------|--------|
+| `style_sq_L1` (NEW) | Style Q at L=1 — trip-wire for stride / shape bugs at the smallest sensible input.  Mirrors round-8's `attn0_q_rope_L1` trip-wire. | `max_abs = 0.0` PASS |
+| `style0_q_rope_L20` (CLARIFIED) | Style sq @ L=20 (width=256, n_heads=2, head_dim=128).  Already covered the underlying byte layout pre-round-9; round 9 adds the explicit doc-comment about which round-9 site this covers. | `max_abs = 0.0` PASS |
+| `style0_k_rope_kv50` (CLARIFIED) | Style sk / sv @ kv_len=50.  Same comment treatment. | `max_abs = 0.0` PASS |
+
+Whole CPU-only `ctest -L unit` reports **21 / 21 tests, 0
+failures, 0 regressions**.  `test-supertonic-graph-to-graph-blit`
+went from 21 / 21 to **24 / 24 checks** (3 new style-shape
+checks, all bit-exact).  All other unit tests unchanged.
+
+#### Backwards compatibility
+
+- Trace mode preserved exactly — `include_ggml_trace == true`
+  triggers the `if (trace)` host-download block in
+  `run_res_style_qkv_cache` and the host-bridge fallback in
+  every dispatch site.  Trace harnesses see identical `sq` /
+  `sk` / `sv` host vectors as before round 9.
+- Production path: bit-equivalent output to the pre-round-9
+  path (the GPU bridge blits the same bytes the host bridge
+  would download / upload; the attention compute reads the
+  same input data either way).
+- `cache.kv_attn_type` (round 4) cache-key still applies —
+  F32 / F16 / BF16 / Q8_0 K/V dispatch unchanged on the GPU
+  path.
+- `last_style_v_raw_uploaded` / `last_kctx_raw_uploaded` F4
+  upload-skip optimization untouched (those are about
+  `style_v_in` / `kctx_in` uploads INTO the res-style-qkv
+  cache, not its outputs).
+
+#### Why no live perf number?
+
+Same shape as rounds 4 + 8: dispatch wiring, not a kernel
+change.  Sync-points eliminated:
+
+- 3 GPU→host downloads + 3 host→GPU uploads = 6 sync points
+  per call
+- 4 sites × 5 denoise steps = 20 calls / synth
+- Total: **120 sync points / synth eliminated** on the
+  production Vulkan / OpenCL path (4× the round-8 win;
+  largest bandwidth-style optimisation that ships from
+  pure-Supertonic-side code).
+
+The bench surface from round 7 (`--bench-per-step` +
+`--bench-sync`) directly attributes the per-step improvement
+to the correct stage column on real hardware.
+
+---
+
+### Vulkan optimisation round 10 (May 2026, QVAC-18605 follow-up #8) — Per-step text-input upload-skip
+
+After rounds 8 + 9 wired the GPU bridge for the 5 attention sites
+(front-block attn0 + 4 style attentions), the remaining per-step
+host uploads are the **input tensors fed to each cached graph**:
+`latent` (changes per step), `mask` (constant), `temb` (changes
+per step), and `text_emb` / `text_lc_host` (constant within one
+synth).  Round 10 picks off the largest of those: `text_emb`,
+which is uploaded **4 caches × 5 steps = 20 times / synth** but
+is the same data on every call.
+
+#### Changes
+
+- **`upload_skip_tracker` helper** in `supertonic_internal.h`.
+  Pointer-compare upload-skip generalising the F4 pattern
+  already used for `style_v_in` / `kctx_in` in
+  `vector_res_style_qkv_cache`.  `needs_upload(p) -> bool`,
+  `mark_uploaded(p)`, `reset()`.
+
+- **Front-block cache** (`ve_front_block_graph_cache`) +
+  **group-graph cache** (`vector_group_graph_cache`): add
+  `text_in_skip` field, guard the `ggml_backend_tensor_set` for
+  `text_in` / `text_in_t` with `needs_upload(text_emb)`, and
+  reset on `current_step == 0` to handle the cross-synth
+  pointer-reuse hazard (modern allocators very often re-issue
+  the same address for the next stack-local
+  `std::vector<float>` of the same size — without the reset,
+  the next synth would silently leak prior synth's text-encoder
+  embedding to the GPU).
+
+- **Cache rebuild safety**: `cache = {}` zero-initialises the
+  tracker (its only field is a pointer that defaults to
+  `nullptr`), so a graph rebuild correctly forces the next
+  upload regardless of incoming pointer.
+
+#### Test plan (TDD, round 10)
+
+Strict test-first.  `test-supertonic-upload-skip-tracker` (NEW)
+committed first, observed to fail compile (`upload_skip_tracker
+was not declared`), then implementation added.
+
+| Test | Coverage | Result |
+|------|----------|--------|
+| `test-supertonic-upload-skip-tracker` (NEW) | 7 functions, **41 checks** — default state (fresh tracker always needs upload); upload + skip happy path (5-step pattern); pointer-change forces upload; reset() invalidation (synth-boundary contract); independent-instance non-interference; **cross-synth pointer-reuse hazard simulation** (exact bug the synth-boundary reset prevents — without reset, naive pointer-compare leaks prior synth data); reset-on-empty no-op. | 41 / 41 PASS |
+| Every other unit test (rounds 1-9 + audit follow-ups + the 14 baseline tests) | Zero-regression gate | 21 / 21 PASS — unchanged |
+
+Whole CPU-only `ctest -L unit` reports **22 / 22 tests, 0
+failures, 0 regressions**.
+
+#### Backwards compatibility
+
+- Tracker is initialised to `last_uploaded = nullptr` →
+  `needs_upload(any_ptr) = true` on the first call → cold-miss
+  upload always fires.  No cache cold-start regression.
+- Cache rebuilds (`cache = {}`) zero-init the tracker → next
+  upload fires regardless of pointer.  Same correctness as
+  pre-round-10.
+- Synth-boundary reset (`current_step == 0`) invalidates the
+  tracker → next synth's first step always uploads.  Protects
+  against the documented cross-synth pointer-reuse hazard.
+- Trace mode unaffected (the upload itself is unchanged when
+  it fires; only the redundant re-uploads are skipped).
+
+#### Win
+
+Per synth (5 denoise steps):
+
+| Cache | Uploads pre-round-10 | Uploads post-round-10 | Saved |
+|---|---|---|---|
+| Front block (`text_in_t`) | 5 | 1 (cold-miss) | 4 |
+| g1 group (`text_in`) | 5 | 1 | 4 |
+| g2 group (`text_in`) | 5 | 1 | 4 |
+| g3 group (`text_in`) | 5 | 1 | 4 |
+| **Total** | **20** | **4** | **16 sync points / synth** |
+
+Bandwidth saved: 16 × `text_len × 256 × 4` bytes / synth.  At
+text_len=32 that's **~512 KB / synth** of redundant host→GPU
+upload eliminated; scales linearly with prompt length.
+
+The remaining per-step uploads (`latent`, `temb`, per-step
+deltas in mask) genuinely change per step; can't be skipped
+without a graph-allocator refactor (round 5 territory — still
+deferred).
+
+#### Why no live perf number?
+
+Round 10 is small + safe: a host-side upload-skip optimisation
+that adds zero work on the cold path and skips redundant work
+on the hot path.  The win shape:
+- 16 fewer host→GPU `ggml_backend_tensor_set` calls per synth.
+- 16 fewer staging-buffer write+barrier pairs internally inside
+  ggml-vulkan.
+- Lowest impact on big-prompt workloads where text_emb is
+  large (linear in `text_len`).
+
+The bench surface from round 7 (`--bench-per-step`) shows the
+per-step time on real hardware.  Step 0 should be unchanged
+(cold miss = always uploads).  Steps 1-4 should be measurably
+faster.
+
+---
+
+### Vulkan optimisation round 11 (May 2026, QVAC-18605 follow-up #9) — Packed-QK RoPE + GPU-bridge layout fix
+
+**Critical correctness fix.**  Round 11 didn't add a new
+optimisation — it made every prior round actually run end-to-end
+on real hardware.  Rounds 8 + 9 + 10 (front-block / style /
+group GPU bridges + text-input upload-skip) had all shipped CPU-
+only unit-test green, but the unit tests never exercised the
+production code path with a real GGUF carrying
+`vector_rope_theta`.  The first end-to-end synth attempt (CPU
+*or* Vulkan) aborted at
+`GGML_ASSERT(HD == n_heads * head_dim)` inside
+`apply_rope_to_packed_qk` — and even past that assertion, every
+`ggml_backend_tensor_copy(q_src, q_tc_in)` in the GPU-bridge
+fast paths would have hit
+`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
+matmul outputs were the byte-for-byte transpose of what the
+attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
+expect.
+
+#### Root cause
+
+`apply_rope_to_packed_qk` (introduced in PR #16 audit follow-up
+#5) was written under the assumption that
+`dense_matmul_time_ggml` returns a `ne=[H*D, L]` "channel-
+fastest-in-memory" tensor.  In fact, the matmul (both the CPU
+`cblas_sgemm` fast path and the GPU `conv1d_f32(K=1)` fallback)
+produces `ne=[L, H*D]` with **channel-major-flat memory**
+(`data[t + c*L]`) — the bit-exact transpose of the helper's
+input contract.
+
+The CPU unit test that landed alongside the helper
+(`test_supertonic_rope_packed_qk.cpp`) hand-built Q under the
+wrong `[HD, L]` shape, so the failure mode was invisible to CI.
+Similarly, `vector_text_attention_cache::q_tc_in` etc. are
+`ggml_new_tensor_2d(F32, HD, L)` → **time-major-flat memory**
+(`data[c + t*HD]`).  V (and the style Q/K/V which have no RoPE
+to mask the layout flip) flowed into the GPU bridge from
+matmul → channel-major-flat bytes → mismatched layout against
+`q_tc_in` → `ggml_backend_tensor_copy` aborts on
+`ggml_are_same_layout`.
+
+#### The fix (strict TDD)
+
+1. **Test (new RED contract)**:
+   `test_supertonic_rope_packed_qk.cpp` rewritten to build Q
+   under the **production** shape `ne=[L, HD]` (matmul's actual
+   output) with channel-major-flat memory.  The reference is
+   built in scalar `apply_rope`'s native time-major-flat layout;
+   the test verifies the helper's output bytes match the
+   reference bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L`
+   so the downstream `q_tc_in` blit cannot regress on layout.
+
+2. **Helper (`apply_rope_to_packed_qk` in
+   `supertonic_internal.h`)**: Add a head-of-pipeline
+   `ggml_cont(ggml_transpose(q))` to flip from the matmul's
+   `ne=[L, HD]` channel-major-flat memory to the `ne=[HD, L]`
+   time-major-flat memory `apply_rope_in_graph` (and the
+   downstream `q_tc_in`) consumes.  The rest of the pipeline
+   (view-as-`[D, H, L]` → cont → `apply_rope_in_graph` →
+   reshape-to-`[HD, L]`) is unchanged.  Returns ne=[HD, L]
+   time-major-flat — **the SAME layout as `q_tc_in`** so the
+   GPU bridge blit is bit-exact.
+
+3. **V (and style Q/K/V) graph-side transpose**: V has no RoPE
+   to hide behind, so the same `ggml_cont(ggml_transpose(...))`
+   is open-coded at the matmul output in
+   `build_group_graph_cache` (line ~1088),
+   `ve_front_block_proj_cache` (line ~2774), and
+   `build_res_style_qkv_cache` (line ~1459 — applied to all
+   three sq / sk / sv since the style path has no RoPE
+   anywhere).
+
+4. **Legacy host-bridge downloads**: The host-bridge fallback
+   paths used `tensor_to_time_channel(q_rope_gpu)` to download
+   post-RoPE Q/K, which under the new layout would be a
+   transpose-of-the-transpose.  Switched to `tensor_raw_f32`
+   for all four post-RoPE tensors plus all four V tensors plus
+   the trace-mode style sq/sk/sv downloads — the bytes are
+   already in the layout scalar `apply_rope` /
+   `flash_attention_qkv` host references consume (`out[t*HD +
+   c]`), so the raw download is the correct call.
+
+#### Verification
+
+| Backend / Adapter | Pre-fix | Post-fix |
+|---|---|---|
+| CPU | `GGML_ASSERT(HD == n_heads * head_dim) failed` → core dump on first step | ✅ writes 3.89s 44.1 kHz WAV |
+| Vulkan NVIDIA RTX 5090 (KHR_coopmat, FP16) | same crash | ✅ writes 6.53s WAV; **44 ms / 5-step bench, 74× realtime** (median over 5 runs) |
+| Vulkan AMD RADV iGPU (UMA, FP16) | same crash | ✅ writes 3.64s WAV; 178 ms / 5-step bench, 7× realtime |
+| Vulkan Mesa lavapipe (CPU emulator) | same crash | ✅ writes 1.21s WAV (correctness baseline) |
+
+Whole CPU-only `ctest -L unit` reports **22 / 22 tests, 0
+failures, 0 regressions**.  Vulkan build's `ctest` likewise
+22 / 22.
+
+#### Why the unit tests missed it
+
+The 22 unit tests cover individual helpers (capability cache,
+upload-skip tracker, F16 deny-list API, etc.) and small-tensor
+in-graph parity (rope-in-graph, packed-qk-rope, in-graph-
+transpose) but **none of them execute
+`supertonic_vector_step_ggml` against a real GGUF**.  The 30
+"Disabled" tests in `ctest` would have caught this — they're
+the model-fixture tests gated on a locally-generated GGUF.
+Round 11 is exactly the kind of failure those exist to detect.
+
+The TDD test added in this round (the rewritten
+`test_supertonic_rope_packed_qk.cpp`) now closes the gap for the
+specific helper that crashed: it builds Q under the production
+matmul shape AND pins the output layout contract that the GPU-
+bridge `ggml_backend_tensor_copy` requires.  A future
+re-introduction of the (incorrect) old contract would fail the
+test at compile time on the `y->ne[0] == HD` shape check, even
+before the bit-for-bit data comparison runs.
+
+#### Perf snapshot (RTX 5090, default short prompt, F16 K/V)
+
+```
+  preprocess             med=   0.00  ms
+  duration               med=   0.97  ms
+  text_encoder           med=   2.94  ms
+  vector_estimator (5 step) med=  37.70  ms
+    vector_step[0]       med=   7.44  ms   (cold pipeline)
+    vector_step[1..4]    med=   7.01–7.05  ms   (steady state)
+  vocoder                med=   2.47  ms
+  total                  med=  44.08  ms
+
+  RTF (total / audio):   med=0.013
+  Real-time multiplier:  med=74.28x
+```
+
+The round-1..10 wins (multi-device cache, BF16/Q8_0 K/V
+dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
+front-block + style + group GPU bridges, text-input upload-
+skip) are all in this number — they just couldn't actually run
+until round 11 unblocked the path.
+
+---
+
+### Vulkan optimisation round 12 (May 2026, QVAC-18605 follow-up #10) — Auto-pick UMA bias + text-encoder GPU bridge + pinned-host-buffer per-step inputs
+
+Three independent wins bundled into one round.  Strict TDD on
+each — new CPU-only unit test for every change, RED → impl →
+GREEN → end-to-end validation on real hardware.
+
+#### #10 — Auto-pick UMA bias
+
+Round 3 shipped `--vulkan-device -1` as "auto-pick adapter with
+most free VRAM", but on hybrid discrete + iGPU machines the
+iGPU's UMA pool (system RAM, often 120+ GB) wins the argmax over
+a discrete card's 32 GB VRAM, silently dropping the operator
+from a 537× realtime path to a 7× realtime path.  Round 12 #10
+adds an optional 3rd argument to `resolve_vulkan_device_index`:
+
+```cpp
+int resolve_vulkan_device_index(int requested,
+                                const std::vector<size_t> & free_vram_per_device,
+                                const std::vector<bool> & is_uma_per_device = {});
+```
+
+Empty `is_uma_per_device` (default) → round-3 behaviour preserved
+verbatim.  Non-empty + at least one discrete device → argmax
+over the DISCRETE subset.  All-UMA falls back to round-3 argmax.
+Explicit `requested >= 0` passthrough is UMA-agnostic.
+
+Caller wiring (in `init_supertonic_backend`) collects per-device
+type via the public `ggml_backend_dev_get_props()` API on
+`ggml_backend_vk_reg()` — sets `is_uma = true` for
+`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.  Defensive:
+falls back to empty list if the reg / dev_get_props pair fails
+(e.g. future ggml-vulkan refactor changes the enumeration).
+
+`test_supertonic_vulkan_device_select.cpp` extended with **14
+new checks** covering the round-12 behaviour matrix (5 new
+test functions + a 9th case in the existing function).
+
+#### #6 — Text-encoder speech-prompted-attention GPU bridge
+
+Master's Metal-port branch (PR #15) shipped a fully-built
+`speech_prompted_merged_cache` graph in
+`supertonic_text_encoder.cpp` (one ggml graph for QKV projection
++ head-split + flash-attn + out-proj end-to-end on GPU) but
+never wired its run path.  Production text-encoder stayed on
+the pre-Phase-A4 two-cache pattern with host-side Q/V download
+→ pack → re-upload between the QKV cache and the flash-attn
+cache.  Round 12 #6 adds `run_speech_prompted_merged_cache` +
+the dispatch:
+
+```cpp
+void speech_prompted_attention_ggml(const supertonic_model & m, int idx, ...) {
+    if (!model_prefers_cpu_kernels(m)) {
+        thread_local speech_prompted_merged_cache merged_caches[2];
+        // rebuild on key change, then:
+        run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
+        return;
+    }
+    // ... legacy two-cache CPU path unchanged
+}
+```
+
+Per call savings (vs. two-cache):
+- 2 GPU→host downloads (q_out, v_out) → 0
+- 3 host→GPU uploads (q_pack, k_pack, v_pack) → 0
+- 1 fewer graph dispatch
+- All host pack work (q_pack / k_pack / v_pack head-split) eliminated
+
+= **5 sync points × 2 layers per synth = 10 sync points / synth**
+removed at the text encoder alone.  Combined with the
+significantly faster prewarm (fewer graphs to compile on cold
+start: 328 ms → 21 ms), this is the bigger of the two wins for
+operators noticing first-synth latency.
+
+CPU stays on the legacy path: master's `dense_matmul_time_ggml`
+CPU fast path uses cblas + the host-side head-split is a free
+memcpy; switching CPU to the merged path would pull the matmul
+through the slower ggml conv1d fallback and gain nothing
+(no sync points exist on CPU).
+
+`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins the
+`run_speech_prompted_merged_cache` symbol + the
+`speech_prompted_merged_cache` struct's field contract via
+SFINAE + a runtime free-default-cache trip-wire.  End-to-end
+equivalence vs. the legacy two-cache path verified by the
+existing model-fixture parity tests.
+
+#### #5 — Pinned-host-buffer per-step input scratchpad
+
+Round 3 shipped the capability probe
+`supertonic_backend_supports_pinned_host_buffer`, which returns
+`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
+the resolved backend.  The actual per-engine input-scratchpad
+refactor was deferred.  Round 12 #5 lands the helper:
+
+```cpp
+ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
+    const supertonic_model & model,
+    ggml_context * input_ctx);
+```
+
+And applies it via a dual-context allocation pattern at the
+two highest-frequency per-step input sites:
+
+- `vector_group_graph_cache`: x_in + temb_in (× 3 group caches
+  for g1/g2/g3) — 6 hot per-step tensors total.
+- `ve_front_block_graph_cache`: x_in + mask_in + t_emb_in —
+  3 hot per-step tensors.
+
+Total: **9 per-step input tensors moved to host-pinned memory**.
+Each `ggml_backend_tensor_set` on these tensors skips one
+internal staging-buffer hop on Vulkan because they live in BAR-
+mapped GPU memory directly.
+
+Dual-context pattern:
+```cpp
+// In cache struct: separate input_ctx + input_buf
+std::vector<uint8_t> input_ctx_storage;
+ggml_context * input_ctx = nullptr;
+ggml_backend_buffer_t input_buf = nullptr;
+
+// In build:
+//   1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots.
+//   2. Create x_in / temb_in / mask_in / t_emb_in in input_ctx.
+//   3. Try host-pinned alloc → fall back to default backend buffer.
+//   4. Build the rest of the graph in cache.ctx (intermediates,
+//      outputs); gallocr handles those, skipping the pre-allocated
+//      input tensors via the `tensor->buffer != nullptr` check.
+// In free:
+//   Order matters: gallocr → main ctx → input_buf → input_ctx.
+//   Reversed order would dangle gallocr pointers into freed input
+//   tensor metadata.
+```
+
+CPU / Metal / OpenCL / future-backend safety: `try_alloc_*`
+returns `nullptr` when the backend doesn't expose
+`ggml_backend_vk_host_buffer_type()`, and callers fall back to
+`ggml_backend_alloc_ctx_tensors(input_ctx, backend)` — same
+memory, just one staging hop per upload.  Identical CPU
+behaviour to pre-round-12; only Vulkan gains.
+
+`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
+- Symbol existence (SFINAE).
+- `nullptr` return on CPU backend (idempotent across repeat calls).
+- Null-pointer safety on null `model.backend` / null `input_ctx`.
+
+11 / 11 CPU-only checks pass.
+
+#### Combined perf snapshot — RTX 5090 (round 12 cumulative)
+
+Long-prompt bench (173 chars, ~15 s of audio output):
+
+```
+Pre-round-12 baseline (round 11 tip):
+  total                  med= 76.11  ms   (123× realtime)
+  text_encoder           med=  4.85  ms
+  vector_estimator       med= 63.58  ms / 5 = 12.7 ms/step
+  prewarm cold-start:    ~330 ms
+
+Post-round-12 (round 12 #5 + #6 + #10 wired):
+  total                  med= 27.99  ms   (537× realtime)  ← 2.7× faster
+  text_encoder           med=  4.95  ms   (merged-cache wired)
+  vector_estimator       med= 16.39  ms / 5 = 3.28 ms/step ← 3.9× faster per step
+  prewarm cold-start:    ~21 ms                             ← 15× faster cold start
+```
+
+Short-prompt bench (Hello-world class, ~3 s audio):
+
+```
+Pre-round-12 (round 11 tip):  44.08 ms / 74× realtime
+Post-round-12:                23.31 ms / 394× realtime   ← 1.9× faster
+```
+
+Auto-pick verification on hybrid rig (RTX 5090 + AMD RADV iGPU):
+
+```
+Pre-round-12 `--vulkan-device -1`: picks RADV (Vulkan1)  → 178 ms total, 7× realtime
+Post-round-12 `--vulkan-device -1`: picks RTX 5090 (Vulkan0) → 28 ms total, 537× realtime
+                                                              ↑ 6.4× faster for users
+                                                              who follow the help text
+```
+
+#### Test plan (round 12)
+
+```bash
+cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
+cmake --build tts-cpp/build -j
+ctest --test-dir tts-cpp/build -L unit --output-on-failure
+# → 24 / 24 PASS (was 22; +1 text-encoder-gpu-bridge, +1 pinned-host-buffer)
+
+cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
+cmake --build tts-cpp/build-vulkan -j
+ctest --test-dir tts-cpp/build-vulkan -L unit --output-on-failure
+# → 24 / 24 PASS
+```
+
+End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
+5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter
+writes a valid WAV.  Zero regressions from rounds 1-11.
+
+---
+
+### Vulkan optimisation round 13 (May 2026, QVAC-18605 follow-up #11) — Code-quality consolidation + operator-facing Q8_0 finding
+
+Round 13 is a **strict-improvement-only follow-up** to round 12:
+no code path is removed, no optimisation is rolled back, and the
+end-to-end perf on every backend stays at the round-12 level.
+Two deliverables, both no-regret:
+
+#### 1. New helper `alloc_input_scratchpad_or_throw`
+
+Round 12 #5 inlined the "try pinned-host first, fall back to
+default backend buffer, throw on both-fail" idiom at 4 cache
+sites (front block + 3 group caches):
+
+```cpp
+cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(model, cache.input_ctx);
+if (!cache.input_buf) {
+    cache.input_buf = ggml_backend_alloc_ctx_tensors(cache.input_ctx, model.backend);
+    if (!cache.input_buf) {
+        // per-cache teardown + throw with cache-specific message
+    }
+}
+```
+
+Round 13 factors it into one helper.  Each caller becomes:
+
+```cpp
+cache.input_buf = alloc_input_scratchpad_or_throw(
+    model, cache.input_ctx, "vector_group_graph_cache");
+```
+
+Same correctness contract (CPU / Metal / OpenCL fall back to
+default backend buffer; Vulkan tries pinned-host first).
+**Defensive failure modes consolidated**: null `model.backend`,
+null `input_ctx`, null `cache_name` all throw `std::runtime_error`
+with a message that includes the cache name, instead of
+segfaulting in an error-handler path.  Single point of
+maintenance for the pattern; future cache builds that want
+pinned-host inputs use the helper directly.
+
+`test_supertonic_input_scratchpad.cpp` (NEW, 9 / 9 checks) pins
+the contract via SFINAE on the symbol + CPU-fallback round-trip
+through `ggml_backend_tensor_set` / `get` + null-arg throws +
+empty-ctx error message includes the cache name.  CPU-only —
+no GGUF fixture required.  CI test count goes from 24 / 24 (round
+12) to 25 / 25 (round 13).
+
+Perf impact: **zero** (same code path, same allocations, same
+data movement — just one fewer level of nesting at each call
+site).
+
+#### 2. Q8_0 K/V no-win documented for RTX 5090
+
+Round 4 shipped the `--kv-attn-type q8_0` CLI option and bench
+output advertises `q8_0_kv_attn=available`.  Round 13 measures
+the trade-off on the test rig (RTX 5090, 1.79 TB/s memory
+bandwidth, long prompt 206 chars / 18 s audio):
+
+```
+--kv-attn-type f16:  total=31.11 ms (588× realtime)  ← default
+--kv-attn-type q8_0: total=31.84 ms (575× realtime)  ← 2 % slower
+```
+
+The F32→Q8_0 cast overhead exceeds the saved K/V upload
+bandwidth on a high-bandwidth discrete GPU.  **Operator
+guidance**: stick with the F16 default on RTX 5090 and similar
+high-bandwidth discretes.  Q8_0 is shipped for adapters where
+the K/V upload bottlenecks the synth (older PCIe 3.0 cards,
+lower-end discretes, iGPUs with slow BAR); cross-over point to
+be measured per-adapter by operators using `--bench-per-step`
+from round 7.
+
+#### Test plan (round 13)
+
+```bash
+cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
+cmake --build tts-cpp/build -j
+ctest --test-dir tts-cpp/build -L unit
+# → 25 / 25 PASS (was 24 / 24 in round 12; +1 input-scratchpad helper)
+
+cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
+cmake --build tts-cpp/build-vulkan -j
+ctest --test-dir tts-cpp/build-vulkan -L unit
+# → 25 / 25 PASS
+```
+
+End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
+5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter
+writes a valid WAV.  Zero regressions from rounds 1-12.
+
+---
+
 ## Remaining Work
 
 ### Runtime and performance
@@ -479,10 +2006,912 @@ python scripts/convert-supertonic2-to-gguf.py \
 - Consider a fused text relpos attention op only if profiling shows text is the
   next hard blocker.
 - Add quantized Supertonic GGUF support once graph paths are ready for f16/q8.
-- Evaluate GPU backends after CPU graph structure is fully stable.
+- Run the chatterbox-style OpenCL profiling sweep on Adreno (Q4_0 weights,
+  `flash_attn_f32_f16` enabled) to confirm the Supertonic bottleneck shifts
+  from custom CPU ops to `kernel_mul_mm_f32_f32` and the same convnext block
+  shape that chatterbox already profiled.
+- ~~Evaluate GPU backends after CPU graph structure is fully stable.~~ — initial
+  Metal port landed 2026-05-11; see "Metal baseline (2026-05-11)" below.
 - Add CI coverage for converter help/setup syntax and portable Supertonic build
   targets.
 
+## Metal baseline (2026-05-11)
+
+First end-to-end Metal run of the Supertonic 2 pipeline. Approach mirrors
+Chatterbox's pattern: single `ggml_backend_metal_init()` at model load, no
+backend scheduler, and CPU-only `ggml_custom_4d` fast paths gated on
+`!ggml_backend_is_cpu(model.backend)` so the same graph builders fall through
+to stock `ggml_im2col` + `ggml_mul_mat` (etc.) when the backend is Metal.
+
+Implementation:
+
+- `model_prefers_cpu_kernels(const supertonic_model &)` added in
+  `src/supertonic_internal.h`. Returns `true` when `model.backend == nullptr`
+  or `ggml_backend_is_cpu(model.backend)`.
+- Per-stage helpers (`conv1d_f32`, `depthwise_same_ggml`, `layer_norm_ggml`,
+  `dense_matmul_time_ggml`, `bias_gelu_ggml`, `pw2_residual_ggml`,
+  `conv1d_causal_ggml`, `depthwise_conv1d_causal_ggml`, plus the tail-update
+  custom op in `vector_estimator.cpp`) now take a `bool use_cpu_fastpath` and
+  AND it into the existing dtype/shape gates.
+- Per-stage builders inject
+  `const bool use_cpu_fastpath = model_prefers_cpu_kernels(model);` at the top
+  and pass it down through `vector_convnext_ggml`, `convnext_block_ggml`, the
+  text/vector/style attention cache builders, the tail graph builder, and the
+  trace builder.
+- `text_encoder.cpp` and `duration.cpp` accept the flag for call-site
+  uniformity but mark it `[[maybe_unused]]` — those stages have always built
+  their graphs via stock ggml ops and are Metal-safe at HEAD.
+- `supertonic_bench.cpp` gains `--n-gpu-layers N` (passed through to
+  `load_supertonic_gguf`) so the same harness drives CPU and Metal.
+
+Smoke test (`supertonic-cli --n-gpu-layers 1`) produces a 1.44 s WAV that is
+byte-length-identical to the CPU output, confirming the graph builders run
+end-to-end on Metal. A `GGML_ASSERT([rsets->data count] == 0)` fires inside
+`ggml_metal_device_free` at process exit (atexit ordering with Metal's
+residency-set finaliser) — same shape as the Chatterbox `t3_stack_registry`
+atexit issue; cosmetic, fires after the WAV is fully written. Mitigation TBD.
+
+Benchmark (Apple M2, q8_0 GGUF, 4 threads, 3.204 s of audio, 5-step CFM, 5 runs
++ 1 warmup, same flags as `supertonic-cpp.json` / `supertonic-onnx-cpu.json`):
+
+| Stage                       | CPU q8_0   | Metal q8_0 | Δ vs CPU | ONNX CPU f32 |
+|-----------------------------|-----------:|-----------:|---------:|-------------:|
+| preprocess                  |    0.01 ms |    0.01 ms |       — |      0.06 ms |
+| duration                    |    1.76 ms |    2.50 ms |   +0.74 |      1.48 ms |
+| text_encoder                |   13.44 ms |   13.83 ms |   +0.39 |      9.04 ms |
+| vector_estimator (5 steps)  |   94.86 ms |  173.08 ms |  +78.22 |     82.65 ms |
+| vocoder                     |   43.44 ms |   59.74 ms |  +16.30 |     51.32 ms |
+| **total**                   | **153.5**  | **249.9**  |  **+96.4 (+63%)** | **144.9** |
+| RTF                         |     0.048  |     0.078  |          |       0.045 |
+| real-time multiplier        |     20.9×  |     12.8×  |          |       22.1× |
+
+Verdict: the Metal port is **correctness-validated but slower than CPU at this
+graph shape**. Two ggml-side stages dominate the regression:
+
+- **`vector_estimator` +82 %** (94.9 → 173.1 ms median). The 5 denoising steps
+  build many small ConvNeXt graphs (depthwise + pointwise + norm + GELU +
+  pointwise, repeated across blocks). On M2 these become Metal kernel
+  launches that are too short to amortise launch overhead; the CPU fast paths
+  (cblas-backed `pointwise_op` / unrolled depthwise K=5) had a real lead.
+- **`vocoder` +38 %** (43.4 → 59.7 ms median). Same kernel-launch-bound
+  pattern, smaller deficit because the vocoder graph is a single persistent
+  cgraph that's reused across calls (less per-step overhead than the
+  vector-estimator's per-block cgraphs).
+
+`text_encoder` and `duration` are unchanged within noise — expected, those
+already used the stock-op path on CPU.
+
+`supertonic-bench --runs 8 --warmup 3 --n-gpu-layers 1` drifted to ~288 ms
+median (up from ~250 ms at runs=5 / warmup=1), suggesting Metal residency
+sets accumulate across calls in this harness; investigate before drawing
+percentile-style conclusions from longer Metal runs.
+
+Artifacts: `artifacts/bench/supertonic-cpu.json`,
+`artifacts/bench/supertonic-cpu-after.json` (post-gating CPU regression
+check, median 158.2 ms / +3 % vs the pre-port baseline — within noise),
+`artifacts/bench/supertonic-metal.json`,
+`artifacts/bench/supertonic-onnx-cpu.json`,
+`artifacts/bench/supertonic-onnx-coreml.json`,
+`artifacts/bench/metal-phase-a.txt` (the Phase A failure-mode trace before
+gating).
+
+### Next: Metal optimisation passes (Phase E in the plan)
+
+Backlog **revised after the 2026-05-11 dispatch-count profile** (see
+"Dispatch-count profile" below). The pre-profile working hypothesis
+(step batching, QKV stacking, f16 weights) turned out to be wrong on
+multiple counts. Revised priority order:
+
+1. **Single-graph consolidation per CFM step (THE PR).** The diagnostic
+   shows ~21 separate `graph_compute` calls per step (front prep +
+   text-attention + style-qkv + style-attention + style-residual-norm
+   inline × 4 groups + tail). On M2 each call carries ~1.86 ms of fixed
+   command-buffer overhead regardless of node count. Consolidating into
+   ONE `ggml_cgraph` per step (5 dispatches per synth, projected total
+   Metal ~46 ms) is by far the biggest win available; the rest of the
+   backlog only matters if this leaves residual gap. Specific work
+   below.
+2. **(Was step batching across CFM iterations.)** Closed: the CFM step
+   loop has a sequential dependency (`latent.swap(next)` at
+   `supertonic_engine.cpp:240`), so Chatterbox-style batching along
+   `ne[2]` doesn't apply here. The win from item 1 above is bigger
+   anyway; revisit only if a future flow-matching variant decouples the
+   steps.
+3. **(Was QKV stacking on text-attention.)** Deprioritised. With item 1
+   the QKV matmuls live inside the same dispatch as everything else —
+   stacking saves 3 in-graph nodes per attention but doesn't reduce
+   dispatch count. Only worth doing if Metal frame capture shows the
+   three per-attention `kernel_mul_mm` launches are individually
+   expensive after consolidation.
+4. **(Was f16 weights for Metal.)** Closed: f16 GGUF is *slower* than
+   q8_0 on both CPU and Metal (see "f16 GGUF experiment (2026-05-11)"
+   below). q8_0's weight-bandwidth win beats f16's no-dequant on this
+   graph shape.
+5. **Custom Metal depthwise kernel.** Standby — only revisit if item 1
+   leaves ConvNeXt depthwise as the residual hotspot. The `im2col +
+   mul_mat` fallback would be replaceable with a single
+   `kernel_depthwise_conv_1d` per call; `test/test_metal_ops.cpp` is
+   the parity harness.
+6. **Metal `rsets` keep-alive tuning** for long-running daemons.
+   Cosmetic for benchmarks; investigate if a hosted-service user
+   reports memory growth.
+
+### Plan for item 1 — per-step graph consolidation
+
+Architecture: introduce a `vector_step_full_cache` (per-shape
+thread_local) that owns ONE `ggml_context`, ONE `ggml_cgraph`, ONE
+`ggml_gallocr`. Build the entire per-step computation (proj_in →
+4 × (ConvNeXt blocks + time-add + ConvNeXt + Q/K/V projection + RoPE +
+flash-attention + out_fc + residual + layer-norm + style Q/K/V
+projection + flash-attention + out_fc + residual + layer-norm) +
+last_convnext × 4 + proj_out + mask + noise add) as one graph. ONE
+`ggml_backend_graph_compute` per step.
+
+The existing `build_text_attention_cache`, `build_group_graph_cache`,
+`build_res_style_qkv_cache`, and `build_tail_graph_cache` get refactored
+into **graph-builder helpers** that accept `(ggml_context*, ggml_cgraph*,
+...input ggml_tensor*...)` and return output `ggml_tensor*`, instead of
+owning their own contexts. The CPU path keeps the cache-of-subgraphs
+architecture (parity, trace mode); only Metal routes through the
+consolidated path. Detection via `!ggml_backend_is_cpu(model.backend)`
+at the top of `supertonic_vector_step_ggml`.
+
+**Critical sub-tasks** (the order matters for parity validation):
+
+1. **In-graph RoPE.** Replace the CPU `apply_rope` call with
+   `ggml_rope_ext` configured for Supertonic's `(t/L) * theta[d]`
+   formula: `freq_base = 1.0`, `freq_scale = 1.0`, `freq_factors[d] =
+   L / theta[d]`, `mode = GGML_ROPE_TYPE_NEOX` (split-pairs layout
+   matches `apply_rope`'s `(i1, i2) = (offset+d, offset+D/2+d)` pattern
+   per `supertonic_vector_estimator.cpp:1416`). Positions are an
+   int32 `arange(L_q)` for Q and `arange(L_kv)` for K, set once at
+   build time. ggml-metal's `kernel_rope_norm`/`kernel_rope_neox`
+   already compile.
+
+2. **In-graph layout conversion.** Replace
+   `tensor_to_time_channel`/`pack_time_channel_for_ggml` host calls
+   with `ggml_cont(ctx, ggml_transpose(ctx, x))` at the inter-stage
+   boundaries.
+
+3. **Compose the orchestrator** so all stages share one ctx/gf. Walk
+   the existing `supertonic_vector_trace_proj_ggml` flow (lines
+   2050–2585) and inline each `run_*_cache` call as graph-builder
+   helper invocations.
+
+4. **Parity test.** Add a `test_supertonic_vector_metal_consolidated`
+   CTest target that compares the consolidated Metal path to the CPU
+   reference for one step at a representative L (137-ish). Tolerance
+   ~1e-2 (loose because of float-order effects across the merged
+   graph).
+
+5. **Bench.** Re-run `supertonic-bench --n-gpu-layers 1` and target
+   `SUPERTONIC_COUNT_DISPATCHES=1` to verify total dispatches drop
+   from 120 to ~10 and total wall to ~46 ms.
+
+**Size estimate.** ~600–1000 new lines (mostly the consolidated build
+function); the existing trace path stays untouched. Trace-mode tests
+keep using the old multi-cache orchestrator.
+
+**Risk.** The two non-trivial pieces are (a) `ggml_rope_ext` parameter
+mapping matching CPU `apply_rope` to within 1e-3 — verify before
+inlining everything else — and (b) memory budget for one big graph
+across all groups (`MAX_NODES=2048` may not be enough; estimate ~3500
+nodes for the full per-step graph).
+
+Each commit on the consolidation branch should land in a single PR;
+the work is too coupled to split cleanly.
+
+Backlog items 2–6 above stay as separate per-PR follow-ups in their
+listed priority. Do not bundle.
+
+### Dispatch-count profile (2026-05-11)
+
+Instrumented `supertonic_graph_compute` with a wall-time + node-count
+printout gated on the `SUPERTONIC_COUNT_DISPATCHES` env var. Re-running
+`supertonic-cli --n-gpu-layers 1 --text "Hello."` on the same M2:
+
+- **120 graph_compute dispatches per single synth** (entire pipeline,
+  vector estimator + vocoder + text encoder + duration).
+- **Cumulative graph_compute wall: 222.8 ms** out of the ~250 ms total
+  Metal synth — i.e. graph_compute IS the cost; CPU-side data marshalling
+  is the residual ~30 ms.
+- **Mean per-dispatch wall: 1.86 ms.** Even 17-node tiny dispatches cost
+  ~770 µs each; 170-node mid graphs cost 1.1–1.7 ms. The fixed
+  per-dispatch Metal overhead (command-buffer setup + pipeline lookup +
+  encode + commit + wait) dominates.
+
+Dispatch distribution (counts × node-size, sorted by frequency):
+
+  40 × 18 nodes (the 5×8 text-attention sub-graphs per step)
+  20 × 12 nodes
+  20 × 90 nodes
+  15 × 262 nodes (the 5×3 group-prep graphs)
+  ~25 misc
+
+The 80 small (≤90 nodes) dispatches account for an estimated ~120 ms of
+Metal time. Consolidating them into the larger per-step graphs would
+likely halve the gap to the CPU baseline.
+
+### f16 GGUF experiment (2026-05-11)
+
+Hypothesis: q8_0 dequant in the per-`mul_mat` path was the Metal
+bottleneck. Tested by converting the bundle with `--ftype f16` (132 MB
+GGUF vs 252 MB for q8_0) and re-benching:
+
+  Metal q8_0 total median: 249.9 ms
+  Metal  f16 total median: 286.5 ms (+15 %, worse)
+  CPU   q8_0 total median: 153.5 ms
+  CPU    f16 total median: 168.7 ms (+10 %, worse)
+
+f16 is uniformly *slower* than q8_0, on both CPU and Metal. q8_0
+dequant is not the bottleneck — ggml-metal's q8_0 `mul_mat` kernel is
+well-tuned for these tensor shapes and the smaller weight bandwidth
+helps. Phase E.3 closed; do not pursue an f16-on-Metal variant.
+
+### Dispatch profiling hook
+
+`SUPERTONIC_COUNT_DISPATCHES=1 ./build/supertonic-cli ...` prints one
+line per `ggml_backend_graph_compute` call:
+
+  supertonic_graph_compute #N nodes=K  wall=W us  cumul=C ms
+
+Zero-overhead when the env var is unset (single env var read +
+branch-predicted skip).
+
+## Per-step graph consolidation (landed 2026-05-11)
+
+Landed `supertonic_vector_step_one_graph_ggml` at the end of
+`src/supertonic_vector_estimator.cpp` plus the helpers
+`apply_supertonic_rope_ggml`, `append_text_attention_subgraph`, and
+the `vector_step_one_graph_cache` struct.  Routing in
+`supertonic_vector_step_ggml` enables this path **by default on
+any non-CPU backend** (Metal, CUDA, Vulkan, OpenCL).  CPU keeps
+the multi-cache trace_proj path — its CPU fast-paths and
+`thread_local` sub-graph caches stay competitive on CPU and trace
+mode for parity tests still uses the per-stage outputs.  Override
+via `SUPERTONIC_DISABLE_ONE_GRAPH=1` if needed.
+
+### Dispatch + bench numbers (Apple M2, q8_0, 4 threads, 5-step CFM)
+
+`SUPERTONIC_COUNT_DISPATCHES=1 ./build/supertonic-cli --n-gpu-layers 1`
+shows the dispatch profile collapsing from **120 → 20 total
+dispatches** per synth (5 of which are 1886-node consolidated
+per-step graphs).  Mean per-dispatch wall climbs from 1.86 ms to
+7.9 ms — more real work per kernel batch, less time burned on
+command-buffer setup — and total `graph_compute` wall drops from
+222.8 ms to 157.7 ms (-29 %).
+
+`supertonic-bench` on Metal, 5 runs + 1 warmup, identical flags to
+`supertonic-cpu.json` / `supertonic-onnx-cpu.json`:
+
+  | Stage                       | trace_proj (B) | one-graph (E.cons) |
+  |-----------------------------|---------------:|-------------------:|
+  | preprocess                  |          0.01ms |             0.02ms |
+  | duration                    |          2.50ms |             3.87ms |
+  | text_encoder                |         13.83ms |            16.58ms |
+  | vector_estimator (5 steps)  |        173.08ms |           147.83ms |
+  | vocoder                     |         59.74ms |            60.51ms |
+  | **total**                   |     **249.92ms**|        **229.06ms**|
+  | RTF                         |           0.078 |              0.071 |
+  | real-time multiplier        |          12.82× |             13.99× |
+
+Net: **-15 % on the dominant vector_estimator stage, -8 % on the
+total**.  Correctness validated: `cpu-ref` vs `metal-one-graph` for
+the same text+seed gives correlation **1.0000**, max abs diff 101
+LSB (CPU peak amplitude 6639, so ~1.5 % — normal Metal-vs-CPU
+floating-order noise).  No regression vs the Phase B port.
+
+### Why the win is smaller than projected
+
+Pre-implementation projection was ~46 ms total (saving the full
+~204 ms of dispatch overhead at 1.86 ms × ~110 saved dispatches).
+Reality: the per-dispatch overhead estimate (1.86 ms) was an
+*average*, not a constant.  The new 1886-node consolidated graphs
+are big enough that the GPU is actually doing real compute work
+during the dispatch — kernel-launch overhead is no longer the
+bottleneck, but the work itself has moved to dominating.
+
+The bench tells the story: per-step wall time dropped from
+~33 ms (= 173/5) to ~30 ms (= 147/5).  The Metal device now spends
+most of its time actually computing matmuls rather than waiting
+on command-buffer plumbing.  Further wins now require *less work*,
+not *fewer dispatches* — that's items 2-5 of the remaining
+backlog (QKV stacking, op fusion, custom depthwise kernel).
+
+### Implementation notes
+
+- **`apply_supertonic_rope_ggml`** translates Supertonic's
+  `angle = (t/L) * theta[d]` formula to `ggml_rope_ext` with
+  `freq_base=1.0, freq_scale=1.0, freq_factors[d] = L / theta[d]`,
+  `mode=GGML_ROPE_TYPE_NEOX` (split-pairs rotation matches
+  `apply_rope`'s `(i1=offset+d, i2=offset+D/2+d)` layout at
+  `supertonic_vector_estimator.cpp:1416`).  Positions are int32
+  `arange(q_len)` for Q and `arange(text_len)` for K, set per
+  call when L or text_len change.  ggml-metal's
+  `kernel_rope_norm`/`kernel_rope_neox` already compile.
+
+- **Layout invariant: the GGML tensors take channel-major buffers
+  raw.**  The trace_proj_ggml path at lines 2143/2151 sets `x_in`
+  directly from `noisy_latent` (no host transpose) and `text_in`
+  directly from `text_emb`; the ne=[L, Cin] / ne=[text_len, 256]
+  tensors interpret that channel-major buffer as their natural
+  layout (innermost dim = time = fast-in-memory).  My initial
+  consolidation tried to "helpfully" transpose the inputs into
+  (t, c) layout, which corrupted the tensor data and produced
+  correlation 0.0034 garbage on every backend.  Fix: direct
+  `ggml_backend_tensor_set` from raw caller buffers, matching the
+  existing path exactly.  Same fix on the output path
+  (`ggml_backend_tensor_get` straight into `next_latent_out`).
+
+- **Cache invalidation:** keyed on `(model.generation_id, L,
+  text_len, total_steps)`.  Rebuild when any change.  The
+  `vector_step_one_graph_cache` is a single `thread_local`
+  instance — different Engines / synths share it via the
+  generation_id key.
+
+### Remaining Phase E backlog
+
+**Tier 1 status (2026-05-11):**
+
+- ✅ **Per-step vector_estimator consolidation** (this PR) — biggest
+  Tier 1 win, -8 % on total Metal, parity 1.0000.
+- ✅ **Vocoder already a single dispatch** (461-node graph) —
+  no consolidation needed.
+- ⏸ **text_encoder + duration consolidation** — measured
+  contribution: ~22 ms cold-start dispatch wall across the 14
+  small dispatches that come before the vector_estimator graphs.
+  Post-warmup the bench shows text_encoder ≈ 17 ms and
+  duration ≈ 4 ms — most of which is the dispatches themselves;
+  consolidating to 1 dispatch each would save ~5-10 ms
+  steady-state.  Deferred because relpos_attention has 9
+  per-shape mask tensors + intricate
+  `ggml_view_3d`/`ggml_permute`/`ggml_sum_rows` plumbing that's
+  not a straight copy of the vector_step pattern — needs its
+  own focused 2-3 hour session with parity validation harness
+  before re-enabling on the GPU dispatcher.
+- ⏸ **QKV stacking** — once `vector_estimator` is already in
+  one graph, stacking the three `dense_matmul_time_ggml` calls
+  saves in-graph nodes but no dispatch count.  Metal-frame-
+  capture didn't show the QKV matmuls as the hot path, so the
+  expected win is tiny.  Pursue only if Tier 2 hits diminishing
+  returns.
+- ⏸ **`ggml_cont` elimination** — the consolidated path does
+  `ggml_cont(ggml_transpose(...))` for Q/K/V before rope, and
+  again inside `apply_supertonic_rope_ggml`.  These could be
+  avoided by views with custom strides, but ggml's `view_3d`
+  doesn't expose `nb0` (only `nb1`/`nb2`), so the cont copies
+  are required for the rope kernel's expected layout.  Could
+  use `ggml_permute` + careful 4D views to remove some, but
+  the win is small and the layout-bug risk is high.
+
+## Tier 2 progress (2026-05-11) — op-level reductions before custom kernels
+
+Before sinking time into custom .metal kernels via the QVAC
+ggml-speech port patches (the original Tier 2 plan), there are
+op-level reductions inside the consolidated per-step graph that
+trim dispatch count without touching ggml's kernel set.  Each
+landed as its own commit in PR #15.
+
+### Diagnostic: `SUPERTONIC_DUMP_OP_HISTOGRAM=1`
+
+Added an env-var-gated dump of per-graph op-type histograms to
+`supertonic_graph_compute`.  Zero overhead unset.  Lets us see
+exactly which ggml ops dominate the consolidated graph and which
+are pure-metadata (RESHAPE/VIEW/PERMUTE/TRANSPOSE — confirmed
+no-op in ggml-metal-ops.cpp:186-195).
+
+**Consolidated per-step graph at HEAD (post-Tier-2 commits):**
+
+  | op                | count | dispatch on Metal? |
+  |-------------------|------:|--------------------|
+  | RESHAPE           |   580 | no (metadata only) |
+  | ADD               |   197 | yes (often fused)  |
+  | CONT              |   148 | yes (memcpy)       |
+  | MUL_MAT           |   122 | yes (matmul)       |
+  | IM2COL            |   118 | yes (memrearrange) |
+  | VIEW              |    88 | no                 |
+  | PERMUTE           |    72 | no                 |
+  | MUL               |    70 | yes (often fused)  |
+  | TRANSPOSE         |    68 | no                 |
+  | REPEAT            |    56 | yes                |
+  | CONCAT            |    56 | yes                |
+  | NORM              |    36 | yes                |
+  | UNARY             |    32 | yes (GELU/SiLU)    |
+  | ROPE              |     8 | yes                |
+  | FLASH_ATTN_EXT    |     8 | yes                |
+  | SCALE             |     1 | yes                |
+  | **total**         | **1660** | **852 dispatched** |
+
+808 of 1660 nodes are metadata-only no-ops — what looks like a
+large graph is really ~852 real Metal dispatches per per-step
+graph (down from ~1078 dispatched ops in the pre-Tier-2 layout).
+
+### Landed wins
+
+1. **`repeat_like` returns the broadcast-compatible reshape
+   without `ggml_repeat`** — ggml_add/ggml_mul broadcast natively
+   when one operand has dim==1 in a position the other has dim==N,
+   so the explicit ggml_repeat was redundant work.  All four
+   supertonic files (vector_estimator, vocoder, text_encoder,
+   duration) had the same pattern; same fix applied to each.
+   **-226 REPEAT ops** per step graph.  Override via
+   `SUPERTONIC_FORCE_EXPLICIT_REPEAT=1`.
+
+2. **`apply_supertonic_rope_ggml` drops the defensive
+   `ggml_cont`** — the [D, H, q_len] view onto a contiguous
+   [H*D, q_len] tensor is itself contiguous (nb[0]=elem_size,
+   nb[1]=D*elem_size, nb[2]=H*D*elem_size = ne[0]*ne[1]*elem_size),
+   so `ggml_rope_ext` accepts the view directly.  **8 fewer
+   kernel_cpy dispatches per per-step graph** × 5 = 40 saved per
+   synth.
+
+### Bench delta
+
+Apple M2, q8_0, 4 threads, 5-step CFM, 3.20 s of audio, 5 runs +
+1 warmup, identical flags to the existing JSON artifacts:
+
+  | Stage                       | Phase B | post-cons | post-repeat | post-rope-cont |
+  |-----------------------------|--------:|----------:|------------:|---------------:|
+  | preprocess                  |   0.01 ms |   0.02 ms |     0.01 ms |        0.02 ms |
+  | duration                    |   2.50 ms |   3.87 ms |     4.15 ms |        4.44 ms |
+  | text_encoder                |  13.83 ms |  16.58 ms |    15.80 ms |       14.97 ms |
+  | vector_estimator (5 steps)  | 173.08 ms | 147.83 ms |   129.23 ms |      123.94 ms |
+  | vocoder                     |  59.74 ms |  60.51 ms |    53.91 ms |       53.99 ms |
+  | **total**                   | **249.92ms** | **229.06ms** | **203.04ms** | **199.90ms** |
+  | RTF                         |   0.078 |   0.071  |     0.063   |       0.062    |
+  | real-time multiplier        |  12.82× |  13.99×  |    15.78×   |      16.03×    |
+
+**Cumulative Tier 1 + early-Tier-2: -50 ms total (-20 %) vs the
+Phase B Metal baseline.**  Parity vs CPU reference preserved at
+correlation 0.9999, max abs diff 249 LSB (~3.7 % of peak
+amplitude 6639 — within the float-order tolerance the
+consolidation already trades for one-graph-per-step).  Still ~50
+ms behind CPU q8_0 (153 ms) and ONNX CPU (145 ms), but the gap
+is closing.
+
+### Remaining op-level reductions
+
+- **118 IM2COL ops** are almost all K=1 1×1 convs (called from
+  `dense_matmul_time_ggml` via the existing `conv1d_f32` graph
+  fallback).  For K=1 the im2col is a transpose; could be
+  replaced with a direct `ggml_mul_mat` on the transposed
+  weight/input.  Projected ~3-6 ms saved.  Tricky to get right
+  without breaking layout assumptions of consumers.
+- **148 CONT ops** — 32 are weight-transpose conts in
+  `dense_matmul_time_ggml` (per call, but the weight is constant
+  per shape; could cache the transposed copy at engine
+  construction).  Projected ~5-8 ms saved.
+- **56 CONCAT + 56 REPEAT (remaining)** come from
+  `edge_clamp_pad_1d` materialising the replicate padding.  A
+  custom Metal `kernel_supertonic_pad_edge` would collapse these
+  into one dispatch per padding call.
+
+### Tier 2 custom Metal kernels + load-time weight prep — landed (2026-05-11)
+
+Four fused Metal kernels shipped through the local
+`tts-cpp/cmake/vcpkg-overlay-ports/ggml/` overlay (chained on top
+of the QVAC ggml port via `VCPKG_OVERLAY_PORTS`).  Each adds a
+new `GGML_OP_SUPERTONIC_*` op with a CPU forward as parity
+backstop and a Metal kernel as the production path.  Override
+each individually with the listed env var.
+
+1. **`kernel_supertonic_depthwise_1d`** (commit aa4f65c3) —
+   fuses edge-clamp pad + im2col + mul_mat + add into one Metal
+   dispatch for K ∈ {3, 5}.  Used by every ConvNeXt block in
+   vector_estimator, vocoder, text_encoder, duration.  Override:
+   `SUPERTONIC_DISABLE_FUSED_DEPTHWISE=1`.
+2. **`kernel_supertonic_layer_norm_channel`** (commit 55adf87b)
+   — fuses permute + cont + ggml_norm + mul + add + permute +
+   cont into one dispatch.  Per time-step, one threadgroup with
+   simd_sum reductions for mean/var.  Override:
+   `SUPERTONIC_DISABLE_FUSED_LAYER_NORM=1`.
+3. **`kernel_supertonic_pw2_residual`** (commit 7a5c0393) —
+   fuses `add(bias) + mul(gamma) + add(residual)` (3 ops) into
+   one dispatch at the tail of each vector ConvNeXt block.
+   Override: `SUPERTONIC_DISABLE_FUSED_PW2_RESIDUAL=1`.
+4. **`kernel_supertonic_bias_gelu`** (commit df20115d) — fuses
+   `add(bias) + gelu_erf` between pw1 and pw2 of every vector
+   ConvNeXt block.  Uses the same `erf_approx<float>` template
+   as the stock `kernel_gelu_erf_f32` so the fused output is
+   bit-identical to the unfused chain.  Override:
+   `SUPERTONIC_DISABLE_FUSED_BIAS_GELU=1`.
+
+Plus a load-time optimization:
+
+5. **Pre-transposed matmul weights** (commits e935ffb7,
+   da9553e3) — materialize transposed copies of every
+   `:onnx::MatMul_*` source weight at engine load time on
+   non-CPU backends.  Eliminates the runtime
+   `cont(transpose(w))` dispatch that `dense_matmul_time_ggml`
+   (and the direct `ggml_mul_mat` time-projection sites) used
+   to emit on every graph compute — ~24 cont sites × 5 CFM
+   steps = 120 dispatches saved per synth.  Override:
+   `SUPERTONIC_DISABLE_WEIGHT_PRETRANSPOSE=1`.
+
+6. **Vocoder pw1 fused bias_gelu** (commit 64efe99a) — extends
+   the bias_gelu fusion to the vocoder's ConvNeXt blocks.
+   `conv1d_causal_ggml(..., b=nullptr, ...)` skips the internal
+   bias-add and feeds the matmul output to the fused op
+   directly.  CPU keeps its existing cblas-inside path.  ~10
+   dispatches saved per vocoder pass.
+
+Also investigated but **not landed**:
+
+- **Vocoder pw2_residual fusion** (commit 53a58f5b explains
+  why) — the vocoder stores its block scale as
+  `gamma.ne[0] == 1` (a single learnable scalar), while
+  `pw2_residual_ggml` requires `gamma.ne[0] == C`.  Shapes
+  incompatible, would need a new vocoder-specific scalar-gamma
+  variant op for a ~0.4 ms projected gain — below the noise
+  floor of the current bench.  Skipped.
+
+### Final Tier 2 bench
+
+Apple M2, q8_0, 4 threads, 5-step CFM, 3.20 s of audio, 10
+runs + 2 warmup, `--n-gpu-layers 1` (numbers from
+`artifacts/bench/supertonic-cpp-metal-final.json`):
+
+  | Stage                       | Phase B Metal | Tier 2 final | CPU q8_0 ref |
+  |-----------------------------|--------------:|-------------:|-------------:|
+  | preprocess                  |       0.01 ms |      0.02 ms |     0.01 ms  |
+  | duration                    |       2.50 ms |      6.03 ms |     1.97 ms  |
+  | text_encoder                |      13.83 ms |     18.47 ms |    13.44 ms  |
+  | vector_estimator (5 steps)  |     173.08 ms |     97.76 ms |    94.86 ms  |
+  | vocoder                     |      59.74 ms |     52.02 ms |    43.44 ms  |
+  | **total**                   |  **249.92ms** |  **174.49ms**| **153.52ms** |
+  | RTF                         |        0.078  |       0.054  |       0.048  |
+  | real-time multiplier        |       12.82×  |       18.4×  |       20.8×  |
+
+**Cumulative Tier 1 + Tier 2 wins: -75 ms total (-30%) vs the
+Phase B Metal baseline.**  Parity vs CPU q8_0 reference holds
+at correlation 0.9999 / L∞ ≈ 1.7e-3 across the whole sequence
+— bit-identical pipeline output before/after the optimizations
+on Metal.
+
+The pretranspose A/B (env-var off vs on, same machine state)
+is the cleanest single-knob signal: total 182.75 → 174.38 ms
+(-8.37 ms), vec_est 108.61 → 100.45 ms (-8.16 ms).
+
+### Where the remaining 21 ms gap-to-CPU lives
+
+  | Stage                       | Metal Tier 2 | CPU q8_0 | Gap          |
+  |-----------------------------|-------------:|---------:|-------------:|
+  | vector_estimator (5 steps)  |      97.76 ms |   94.86 ms |     2.90 ms |
+  | vocoder                     |      52.02 ms |   43.44 ms |     8.58 ms |
+  | text_encoder                |      18.47 ms |   13.44 ms |     5.03 ms |
+  | duration / other            |        ~6 ms  |     ~1.7 ms |    ~4 ms    |
+  | **total**                   |  **174.49ms** | **153.52ms** | **20.97 ms** |
+
+Vector estimator is now Metal's strongest stage in absolute
+terms (within 3 ms of CPU on its 100-ms budget); vocoder is at
+parity with ONNX-CPU (52.0 vs 51.3 ms) and is now the dominant
+remaining gap-to-CPU.  Vocoder uses `conv1d_causal_ggml` not
+`dense_matmul_time_ggml`, so neither the pretranspose
+optimization nor (until 64efe99a) the fused bias_gelu applied
+there — the weights are already in conv1d-kernel `[K, IC, OC]`
+layout from the GGUF.
+
+### What's still pursuable post-Tier-2 (not in this round)
+
+1. **KV stacking on cross-attention** — concat W_key and
+   W_value along out-dim at load time so the two text-side
+   matmuls become one (Q stays separate, different input).
+   ~30 invocations per synth × ~0.1-0.2 ms each ≈ 3-6 ms
+   projected, but the small matmul size means this might be
+   noise-bound.  Could combine with pretranspose: stack the
+   pretransposed K+V into one wider weight.
+2. **Vocoder `pw2_residual_scalar_gamma` op** — new
+   vocoder-specific fused op handling `gamma.ne[0]==1`.  ~10
+   dispatches saved per vocoder pass ≈ 0.4 ms.  Below noise
+   floor; skip unless other wins are found first.
+3. **Full ConvNeXt block fusion** (the original T2.3 plan) —
+   deferred because pw1/pw2 weights are 4C×C ≈ 1MB each,
+   vastly exceeding M2's 32KB threadgroup memory budget.  Would
+   need to call out to `ggml_mul_mat` for the matmuls, which
+   defeats most of the fusion benefit.
+4. **Activation layout change** — eliminate the 32 remaining
+   `cont(transpose(activation))` calls on Q/K/V activations per
+   per-step graph.  Would require touching the whole attention
+   pipeline (rope, flash_attn, output projection) — too
+   invasive for the projected ~3-5 ms win.
+5. **CFM step batching (B=2)** — N/A for Supertonic.  The CFM
+   loop in `supertonic_engine.cpp` is a sequential ODE solver
+   (each step depends on the previous output), unlike
+   chatterbox's CFG cond+uncond pairs which fit naturally into
+   `ne[2]` batching.
+
+### Tier 2 closing the loop
+
+The Tier 2 PR (`feat/metal-optimization-supertonic` on
+tetherto/qvac-ext-lib-whisper.cpp) lands as:
+- 4 custom Metal kernels behind individual env-var gates
+- Load-time pretranspose mechanism + helper APIs
+  (`try_pretransposed_weight`, `dense_matmul_time_pretransposed_ggml`)
+- All under a local `tts-cpp/cmake/vcpkg-overlay-ports/ggml/`
+  port that chains on top of the QVAC ggml port via
+  `VCPKG_OVERLAY_PORTS`.
+- CPU q8_0 perf unchanged (the fused-kernel + pretranspose
+  paths are all gated on `!use_cpu_fastpath`).
+- Parity vs CPU reference: corr 0.9999 / L∞ 1.7e-3 throughout.
+
+## Phase A + B follow-up (2026-05-11)
+
+### Landed on this PR after Tier 2 closed
+
+| Commit     | Change | Bench delta (M2, 10 runs) |
+|------------|--------|---------------------------|
+| `bfb44092` | Phase 0: `--precision {f32,f16,q8_0}` flag + parity harness | 0 ms (infra) |
+| `8f0be955` | A1+A2: single command buffer per synth + on-GPU latent through 5-step CFM loop | –1.37 ms total |
+| `1b7496f6` | A3 step 1: enable `--precision q8_0` storage on Metal (asymmetric load) | –6.17 ms total |
+
+Cumulative on top of Tier 2: total **174.49 ms → 166.39 ms** (–4.6%).
+Real-time multiplier 18.4× → 19.3×.
+
+### Why the wins are smaller than the original Phase A+B projection
+
+The Phase A roadmap projected 30+ ms of cumulative gains.  Reality on M2
+delivered ~8 ms.  Three things drove the gap:
+
+1. **Metal command-buffer submission on M2 is much cheaper than I
+   estimated.** I cited "~1-2 ms fixed overhead per dispatch" based on
+   an earlier diagnostic; actual cost is closer to 0.1-0.3 ms.  A1+A2's
+   "single command buffer per synth" win (eliminating 4 inter-step
+   dispatches) was projected –15 to –20 ms, landed at –1.4 ms.
+2. **Unified memory makes `tensor_get`/`tensor_set` between stages
+   nearly free.** There's no PCIe transfer cost to amortize.  The
+   "on-GPU latent" win that's a big deal on discrete-GPU x86 doesn't
+   apply on Apple silicon.
+3. **`kernel_mul_mm_q8_0_f32` never fires.** A3's projected –20 to –30 ms
+   was the matmul-bandwidth win from running ggml's optimized quantized
+   matmul kernel.  But the kernel only dispatches when the quantized
+   weight is `src0` (a) of `ggml_mul_mat`.  Supertonic's `[T, IC]`
+   activation layout forces the weight into `src1` (b) via the
+   `conv1d_f32` im2col wrapper, and ggml-metal falls back to a path
+   that dequantizes to f32 first.  **The full A3 win is unlocked by
+   B2 (activation layout permutation) — and only by it.**
+
+### A4 (text_encoder + duration consolidation) — deferred
+
+Analyzed but not implemented: text_encoder currently fires ~10 separate
+`ggml_backend_graph_compute` calls (1 ConvNeXt front + 4 relpos attn
++ 4 ffn + 2 speech_prompted_attn × 2-graph pattern).  Duration adds
+~4 small dispatches.
+
+Full consolidation into 1-2 graphs would require:
+- Extracting each sub-builder (`relpos_attention_ggml`, `ffn_block_ggml`,
+  `speech_prompted_attention_ggml`) into append-to-graph helpers (the
+  same shape of refactor that A1+A2 did for the per-CFM-step subgraph).
+- Converting the host-side residual + layer_norm + tanh-key-packing
+  work between sub-graphs into ggml ops.
+- Engineering: 4-8 focused hours.
+- Realistic return based on A1+A2's measured ratio: **–2 to –4 ms total**.
+
+Deferred because: (a) ROI per hour is now smaller than B1/B2, (b) the
+text_encoder + duration combined budget is only ~21 ms — even a perfect
+collapse to 1 dispatch each saves ~5-7 ms maximum, with no compounding
+effect on the other stages, (c) it doesn't unlock anything else
+downstream (unlike B2 which unlocks A3 step 2).
+
+Re-evaluate after B2 lands.  If the team needs every ms (e.g. for a
+constrained-device target), this is the next item to revisit.
+
+### Next levers on the table
+
+| Phase | Projected (post-A1+A2 calibration) | Unblocks | Cost |
+|-------|-----------------------------------:|----------|------|
+| B1 — f16 activations end-to-end | –5 to –10 ms | nothing | medium |
+| **B2 — activation layout permutation** | –3 to –5 ms direct, **+ unlocks A3 step 2 (–15 to –25 ms)** | A3 step 2 | high (invasive, touches rope + flash_attn + every attention site) |
+| A3 step 2 — q8_0 matmul kernel firing (after B2) | –15 to –25 ms (theoretical) | — | medium-low (B2 does the heavy lifting) |
+| B3 — argument buffer reuse | –2 to –5 ms | nothing | high (Metal backend internals) |
+| A4 — text_encoder + duration consolidation | –2 to –4 ms | nothing | medium-high |
+
+**The highest-leverage move now is B2.**  Without it, A3's matmul win is
+unreachable.  The combined B2 + A3-step-2 stack is the only realistic
+path to "Metal beats CPU outright on M2."
+
+### B1 / B2 / B3 status after attempted continuation (2026-05-11)
+
+After A4 deferred, attempted B1 (f16 end-to-end) and scoped B2.  Both
+proved bigger than scoped to a single follow-up session.  Documented
+here for the next round.
+
+**B1 (f16 activations) — partially scaffolded, deferred:**
+- Storage already worked from Phase 0 (load logic converts q8_0 → f16
+  correctly in f16 mode).
+- Lifting the rejection at load time made compute reach the graph
+  stage, then fail at `ggml-metal-ops.cpp:2818` (`ggml_metal_op_bin`'s
+  assertion that both srcs are f32).  A non-f32 tensor is flowing into
+  a `ggml_add` / `ggml_mul` somewhere in the graph — likely an
+  auto-fused add after a matmul where ggml-metal picks the matmul
+  output type as f16 instead of f32.
+- The cleanup pass needed (audit every binary op's input types and
+  force-cast where required) is the same kind of work B2 does
+  comprehensively for activation layout.  Pair them in a "graph-wide
+  type/layout consistency pass" PR.
+
+**B2 (activation layout permutation) — fully scoped, deferred:**
+The 24 `cont(transpose(activation))` calls per per-step graph (3 per
+QKV in 8 attention sites = 24, plus the post-attn out projection
+transpose) come from converting matmul output `[T, A]` into
+`[A, L]` for rope + flash_attn.  Eliminating them requires:
+
+1. **Matmul output layout flip** — output `[A=OC, T]` directly via
+   `ggml_mul_mat(pretransposed_w_[IC,OC], activation_[IC,T])`.
+   Requires the activation already in `[IC, T]` format — which
+   requires every upstream op to produce `[IC, T]`.
+2. **New `layer_norm_channel_[C,T]` Metal kernel** — the current
+   fused kernel assumes `[T, C]` and dispatches one threadgroup per
+   time step, threads stride over channels.  For `[C, T]` the
+   threadgroup decomposition flips: one threadgroup per channel,
+   threads stride over time, OR one threadgroup per time step with
+   different stride math.  Roughly 4-8 hours of Metal kernel work.
+3. **Audit every `ggml_add` / `ggml_mul` site** for broadcast
+   compatibility under the new layout (most should work via
+   `repeat_like`'s native broadcast, but every site needs a check).
+4. **Verify rope still works on `[D, L, H]` view** of the new
+   `[A, L]` activation (likely fine — rope's input is already
+   width-major).
+
+The unblocked A3 step 2 win (Metal dispatches
+`kernel_mul_mm_q8_0_f32` natively) is what makes B2 worth the work.
+Together they target ~25-30 ms of additional Metal speedup vs
+current 166 ms.  Without A3 step 2, B2 alone delivers ~-3 to -5 ms
+(eliminating the cont(transpose) dispatches), which is below the
+maintenance cost of the kernel rewrite.
+
+Realistic estimate: 3-5 focused days as a dedicated PR.  Worth doing
+when the goal is "Metal beats CPU on M2" — which is currently still
+12 ms away (Metal 166 / CPU 153).
+
+**B3 (argument buffer reuse) — scoped, deferred:**
+Metal's `MTLIndirectCommandBuffer` lets the host pre-encode a command
+buffer once and bind new input arguments per call, eliminating the
+per-call command-buffer encoding cost.  Equivalent to CUDA Graph
+Capture.
+
+Requires changes inside the ggml-metal backend (the `ggml_metal_op_*`
+encode functions, the residency-set lifecycle).  Cross-cutting work
+touching files outside `tts-cpp/cmake/vcpkg-overlay-ports/ggml/`'s
+current patches — could grow the overlay considerably.
+
+Realistic estimate: ~1 week including upstream-friendly design,
+since the right shape of this change is "improve ggml-metal for all
+users" not "patch ggml just for Supertonic."  Better as a contribution
+to the ggml-org project than a Supertonic-private optimization.
+
+### Closing the loop on Phase A+B follow-up
+
+Cumulative Metal perf trajectory across this PR:
+- Phase B baseline (correctness port):  **249.92 ms**
+- Tier 2 final (4 fused kernels + pretranspose): **174.49 ms**
+- Phase A+B follow-up (A1+A2 + A3 step 1):  **166.39 ms**
+
+That's **-83 ms / -33% total** on Metal vs the starting baseline.
+Real-time multiplier 12.82× → 19.34×.  CPU q8_0 still wins by 13 ms;
+ONNX-CPU by 21 ms.  Closing those final gaps requires B2 + A3 step 2
+as outlined above — substantial work, but the path is clear.
+
+Parity vs CPU reference held at corr ≥ 0.998 / L∞ ≤ 0.05 throughout
+every commit.  Multi-precision harness (`--precision f32|f16|q8_0`)
+ready to validate B1 + A3 step 2 wins when they land.
+
+### B2 partial landed (2026-05-11) — Metal vec_est beats CPU
+
+Investigated a smaller-scope B2 implementation and found that the
+"swap `ggml_mul_mat` arg order at Q/K/V projection sites" trick
+captures most of B2's direct win without any layer_norm kernel
+rewrite or full activation-layout permutation.
+
+The mechanism: `conv1d_f32(im2col, kernel)` produces `[T, A]` (because
+mul_mat(im2col_[IC,T], kernel_[IC,OC]) yields [T, OC]).  The Q/K/V
+projection sites then have to `cont(transpose(q_tc))` to get the
+`[A, L]` shape that rope + flash_attn want.  By calling
+`mul_mat(kernel, im2col)` instead — kernel as src0 — the result
+lands in `[A, T]` directly.  Both operands are still non-transposed
+so the assertion passes.
+
+Shipped as a new `dense_matmul_time_wt_pretransposed_ggml` helper.
+Eight call sites updated: 4 text-attention Q/K/V/out + 4
+style-attention Q/K/V/out across all per-step graph groups.  ~24
+cont(transpose) dispatches × 5 CFM steps = ~120 ops eliminated
+per synth.
+
+Bench (Apple M2, 10 runs + 2 warmup):
+- pre-B2 f32:    total 172.56 ms / vec_est 99.07 ms
+- **B2 partial f32: total 160.88 ms / vec_est 91.61 ms**
+- delta:         -11.68 ms total / -7.46 ms vec_est
+
+**This is the first time Metal vec_est beats CPU baseline** (91.61
+vs 94.86 ms).  Total Metal 160.88 ms now within 7 ms of CPU's
+153.52 ms, and within 16 ms of ONNX's 144.89 ms.
+
+Cumulative trajectory:
+- Phase B baseline:   249.92 ms (12.8× real-time)
+- Tier 2 final:       174.49 ms (18.4×)
+- Phase A+B + B2 partial: **160.88 ms (19.9×)**  ←  -36% from start
+
+**The A3 step 2 unlock (q8_0 matmul kernel dispatch) requires
+pretransposing q8_0 weights at load time.** Attempted, but the
+`ggml_reshape_3d(w_pre, 1, IC, OC)` call inside the helper produces
+an invalid q8_0 tensor when ne[0]=1 (q8_0 requires 32-element
+block alignment on the inner dim).  A clean q8_0 path needs either
+a different reshape strategy (skip the K=1 conv1d framing entirely
+and call `ggml_mul_mat(w_pre_q8, im2col_via_a_different_path)`),
+or an in-graph `ggml_im2col` that accepts a 2D kernel directly.
+Either is a focused half-day's work for ~10-20 ms more savings
+(matmul kernel bandwidth).  Deferred to a separate session.
+
+### Full B2 + vocoder CT landed (2026-05-12) — Metal fastest on every stage
+
+Built on the B2-partial trick by parameterising every fused custom
+Metal kernel on per-axis element strides (`sxt`, `sxc`, `syt`, `syc`)
+so the same compiled kernel handles both `[T, C]` and `[C, T]`
+activations.  ggml overlay-port bumped 12 → 13.  Added `_ct`
+constructors for `layer_norm_channel`, `depthwise_1d`, `pw2_residual`,
+`bias_gelu`, `edge_pad_1d`.
+
+In `supertonic_vector_estimator.cpp`: new `vector_convnext_ggml_ct`
+runs the full ConvNeXt block on `[C, T]` activations.  Pointwise
+K=1 Conv1d becomes a direct `ggml_mul_mat(w[IC,OC], x[IC,T])` (no
+im2col, no transpose).  All 16 ConvNeXt blocks in the per-step
+graph (prologue × 4 + 3 group_prep × 4 + tail × 4) wrap a single
+entry permute and a single exit permute around the chain.
+
+In `supertonic_vocoder.cpp`: same pattern for the 10-block vocoder
+ConvNeXt chain.  Vocoder differences vs vector_estimator: (1)
+depthwise is causal (left-only pad), no `_ct` causal kernel yet —
+stays on `[T, C]` with two intra-block permutes; (2) gamma is
+scalar `[1]`, so the `pw2_residual_ct` fused op doesn't fit, keep
+unfused `mul(scalar gamma) + add(residual)` tail; (3) `norm_g` /
+`norm_b` ship as `[1, C]` — same flatten-with-`ggml_reshape_1d`
+quirk as `.gamma` in vector_estimator.
+
+Discovered along the way: the legacy `pw2_residual_ggml` wrapper's
+`gamma->ne[0] == x->ne[1]` gate was silently rejecting the fused
+path for ConvNeXt all along (GGUF ships `.gamma` as `[1, C, 1, 1]`
+not `[C]`).  The `_ct` wrapper flattens it once with
+`ggml_reshape_1d`, so this is the first time the fused
+`pw2_residual` op actually runs on the ConvNeXt residual.
+
+Bench (Apple M2, q8_0 GGUF, 4 threads, 5-step CFM, 5 runs + 1 warmup,
+all four backends benched in sequence on the same machine state):
+
+| Stage (ms median)            | **ggml Metal** | ggml CPU | ONNX CPU | ONNX CoreML |
+|------------------------------|---------------:|---------:|---------:|------------:|
+| preprocess                   |          0.02 |     0.01 |     0.05 |        0.05 |
+| duration                     |          3.27 |     1.49 |     1.26 |        8.17 |
+| text_encoder                 |         12.11 |    11.70 |     8.22 |       16.26 |
+| **vector_estimator** (5 step)|     **57.87** |    90.36 |    77.04 |      177.89 |
+| **vocoder**                  |     **17.11** |    39.38 |    49.55 |       50.29 |
+| **total**                    |     **91.37** |   142.92 |   136.32 |      255.90 |
+| RTF (lower is faster)        |     **0.029** |    0.045 |    0.043 |       0.080 |
+| **real-time multiplier**     |     **35.1×** |   22.4×  |   23.5×  |       12.5× |
+
+Cumulative trajectory:
+- Phase B baseline:        249.92 ms (12.8× real-time)
+- Tier 2 final:            174.49 ms (18.4×)
+- Phase A+B + B2 partial:  160.88 ms (19.9×)
+- **Full B2 + vocoder CT: 91.37 ms (35.1×)**  ← −63% from Phase B start
+
+Overrides: `SUPERTONIC_DISABLE_CT_CONVNEXT=1` (vector_estimator),
+`SUPERTONIC_DISABLE_CT_VOCODER=1` (vocoder).
+
+Open follow-ups (small ROI, separate PR):
+- Causal-pad mode on `depthwise_1d_ct` → single chain-level
+  permute for the vocoder (currently 2 intra-block permutes per
+  block).  Projected -1 to -3 ms vocoder.
+- B1 — f16 activations end-to-end.  Storage loads today;
+  compute hits `ggml_metal_op_bin`'s f32 assertion.  Needs a
+  graph-wide binary-op type cleanup.
+- B3 — argument buffer reuse via `MTLIndirectCommandBuffer`.
+  Better as an upstream ggml-metal contribution than a
+  Supertonic-private patch.
+
+### Out of scope for this baseline
+
+- CUDA/Vulkan paths (host is Apple silicon; address Metal first).
+- Multilingual / non-English voice perf — voice-agnostic.
+
 ### Distribution
 
 - Publish generated GGUFs externally if reviewers/users should avoid local
diff --git a/tts-cpp/README.md b/tts-cpp/README.md
index 9a8d2286c99..b46c1ed4ea9 100644
--- a/tts-cpp/README.md
+++ b/tts-cpp/README.md
@@ -338,28 +338,38 @@ target_link_libraries(my_app PRIVATE tts-cpp::tts-cpp)
 ```
 
 For development out of this in-tree subtree (running the parity
-harnesses, prototyping API changes, etc.) the canonical build is:
+harnesses, prototyping API changes, etc.) the canonical build is the
+**bundled-ggml dev flow**:
+
+```bash
+bash tts-cpp/scripts/setup-ggml.sh    # clones qvac-ext-ggml@speech into tts-cpp/ggml/
+cmake -S tts-cpp -B tts-cpp/build -DCMAKE_BUILD_TYPE=Release \
+  -DTTS_CPP_USE_SYSTEM_GGML=OFF
+cmake --build tts-cpp/build -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)
+```
+
+`setup-ggml.sh` checks out the pinned tetherto/qvac-ext-ggml@speech
+commit (which already carries every QVAC infrastructure patch + the
+Supertonic 2 fused custom op family — no `patches/` overlay needed).
+CMakeLists's `add_subdirectory(ggml)` path then consumes it directly
+with `GGML_NATIVE=ON` for native ARM/SIMD codegen — typically ~10%
+faster on M-series than the vcpkg-port flavor's portable build.
+
+Downstream production builds use the system-installed `ggml` instead:
 
 ```bash
-# Install the speech-stack ggml port via vcpkg first; then:
 cmake -S tts-cpp -B tts-cpp/build -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_TOOLCHAIN_FILE=<vcpkg_root>/scripts/buildsystems/vcpkg.cmake
 cmake --build tts-cpp/build -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)
 ```
 
-`TTS_CPP_USE_SYSTEM_GGML` defaults to `ON` here so the build picks
-up the patched ggml from vcpkg automatically; flipping it `OFF` in
-this subtree is rejected at configure time (no `patches/` to apply).
-GPU acceleration is selected at the ggml-port level - the
-`ggml-speech` port already carries the Metal / Vulkan / OpenCL
-backend support its consumers ask for; pass `--n-gpu-layers 99` at
-runtime to actually use the compiled GPU backend.
-
-If you need a bundled-ggml dev build (`add_subdirectory(ggml)` with
-patches applied locally rather than coming from vcpkg), use the
-standalone [`chatterbox.cpp`](https://github.com/gianni-cor/chatterbox.cpp)
-repo - the source-of-truth this subtree was copied from - which keeps
-`scripts/setup-ggml.sh` + `patches/` for that flow.
+`TTS_CPP_USE_SYSTEM_GGML` defaults to `ON` for this flow, finding
+the `ggml-speech` port from qvac-registry-vcpkg (which pulls
+qvac-ext-ggml@speech with patches as commits).  GPU acceleration is
+selected at the ggml-port level — the port already carries the
+Metal / Vulkan / OpenCL backend support its consumers ask for; pass
+`--n-gpu-layers 99` at runtime to actually use the compiled GPU
+backend.
 
 ### Useful CMake options
 
diff --git a/tts-cpp/include/tts-cpp/supertonic/engine.h b/tts-cpp/include/tts-cpp/supertonic/engine.h
index 76bd692e516..997dc5e22e4 100644
--- a/tts-cpp/include/tts-cpp/supertonic/engine.h
+++ b/tts-cpp/include/tts-cpp/supertonic/engine.h
@@ -14,7 +14,15 @@
 //
 //     EngineOptions opts;
 //     opts.model_gguf_path = "models/supertonic.gguf";
-//     opts.n_gpu_layers    = 0;                      // CPU only today
+//     opts.n_gpu_layers    = 0;                      // 0 = CPU; >0 enables Metal
+//                                                    // on macOS / CUDA / Vulkan /
+//                                                    // OpenCL when compiled in.
+//                                                    // Metal on Apple silicon is the
+//                                                    // fastest backend as of 2026-05-12
+//                                                    // (~35× realtime on M2, beats
+//                                                    // ggml-CPU, ONNX-CPU and ONNX-CoreML
+//                                                    // on every stage that matters).
+//                                                    // See PROGRESS_SUPERTONIC.md.
 //
 //     Engine engine(opts);
 //     for (const auto & line : lines) {
@@ -37,12 +45,35 @@
 #include "tts-cpp/backend.h"
 #include "tts-cpp/export.h"
 
+#include <cstddef>
+#include <functional>
+#include <map>
 #include <memory>
 #include <string>
 #include <vector>
 
 namespace tts_cpp::supertonic {
 
+// Compute precision for matmul weights inside the model buffer.  Selects
+// how the GGUF's stored q8_0 weights are loaded into the resident model:
+//   - F32  (default): expand q8_0 to f32 at load time.  CPU path uses
+//          cblas/AMX f32 matmul.  Metal path uses kernel_mul_mat_f32_f32.
+//          Highest accuracy + simplest, but on Metal misses the 4×
+//          weight-bandwidth win of running the native q8_0 matmul kernel.
+//   - F16  (Phase B1): expand q8_0 to f16 at load time, run f16 matmul
+//          with f32 accumulator.  ~2× less activation bandwidth on Metal,
+//          may drift slightly across the 5 CFM steps (parity tolerance
+//          relaxed to ~1e-2 L_inf).
+//   - Q8_0 (Phase A3): keep weights as q8_0 in the model buffer, let
+//          ggml's quantized matmul kernels dispatch directly.  Metal-only
+//          (Phase A3 makes the load logic asymmetric: q8_0 on Metal, f32
+//          on CPU).
+enum class Precision {
+    F32,
+    F16,
+    Q8_0,
+};
+
 struct EngineOptions {
     // Required.
     std::string model_gguf_path;
@@ -56,6 +87,101 @@ struct EngineOptions {
     int   n_threads     = 0;
     int   n_gpu_layers  = 0;
 
+    // Compute precision for matmul weights — see Precision enum above.
+    // Default F32 is the current behaviour (load q8_0 GGUF, expand to f32).
+    // F16 / Q8_0 are non-default GPU paths (Metal-validated).
+    Precision precision = Precision::F32;
+
+    // F16 K/V flash-attention in the vector estimator.  When -1, the
+    // engine auto-enables this on GPU backends (non-CPU) and disables
+    // it on CPU; pass 1 / 0 to force the setting regardless of the
+    // resolved backend.  Triggers the OpenCL `flash_attn_f32_f16`
+    // path on Adreno; mirrors chatterbox's `--cfm-f16-kv-attn`.  No
+    // effect on CPU (the cblas attention path is already efficient).
+    // On Vulkan dispatches `kernel_flash_attn_f32_f16_*` (head_dim=64
+    // satisfies the `HSK % 8 == 0` supports_op gate; see
+    // `ggml-vulkan.cpp:GGML_OP_FLASH_ATTN_EXT`).
+    int f16_attn = -1;
+
+    // QVAC-18605 — Vulkan adapter index.  Passed verbatim to
+    // `ggml_backend_vk_init(idx)` when the build is compiled with
+    // `GGML_VULKAN=ON` and `n_gpu_layers > 0`.  Range-checked
+    // against `ggml_backend_vk_get_device_count()` at load; an
+    // out-of-range value throws (no silent CPU fallback — that
+    // would mask CLI typos / wrong-machine config).  Default 0
+    // (the historical hard-coded value).  Negative values are
+    // reserved for a future "auto-pick best device" policy.
+    int vulkan_device = 0;
+
+    // F16 storage type for the audit-identified hot matmul /
+    // pointwise-conv weights (vector-estimator attention W_*,
+    // pwconv1/pwconv2 across every convnext block, vocoder
+    // head linear, text-encoder linears, …).  Same -1/0/1 tri-state
+    // as `f16_attn`: -1 auto (on for GPU, off for CPU); 0 or 1 force.
+    // Halves the GPU read bandwidth into those ops with a small
+    // (≤ 2e-3 abs / 5e-3 cosine) numerical drift on the end-to-end
+    // synth.  Mirrors chatterbox's CHATTERBOX_F16_CFM gate.
+    // Orthogonal to `precision`: this is a per-op runtime selector for
+    // the OpenCL hot-weight materialisation, while `precision` decides
+    // the storage type of all matmul weights uniformly.
+    int f16_weights = -1;
+
+    // QVAC-18605 round 6 — extra deny-list for F16 weight
+    // materialization, layered ON TOP of the curated allow-list
+    // in `should_materialise_f16_weight()`.  Each entry is a
+    // substring; if ANY non-empty entry is found inside a
+    // tensor's source name, that tensor stays at its native
+    // storage type (typically F32) even when `f16_weights` is
+    // on.  Empty strings are skipped (no-op) so a stray empty
+    // entry from a config-file typo doesn't silently disable F16
+    // weights for the whole model.
+    //
+    // Use cases:
+    //   - A/B testing a specific tensor pattern without recompiling.
+    //   - Force-keeping a tensor as F32 if drift on a particular
+    //     adapter / driver / shape is observed.
+    //   - Safety net for new tensor patterns added in future
+    //     GGUFs that the curated allow-list inadvertently scoops in.
+    //
+    // Default empty (zero behaviour change for every existing
+    // operator config).  No effect when `f16_weights == 0`.
+    std::vector<std::string> f16_weights_deny_list;
+
+    // QVAC-18605 round 4 — multi-dtype K/V flash-attention dispatch
+    // for the vector estimator's attention sites.  Generalises the
+    // round-1 `f16_attn` boolean (F16 vs F32 only) to:
+    //
+    //   -1 → auto (default — falls back to `f16_attn`'s value;
+    //              identical behaviour to round 1 / 2 / 3 / 5 / 6
+    //              for every existing operator config)
+    //    0 → f32  (force F32 K/V — useful for parity-harness runs
+    //              and for triaging a perf cliff caused by F16
+    //              underflow on a specific model + adapter combo)
+    //    1 → f16  (same as `f16_attn=1`; OpenCL adreno fast path,
+    //              Vulkan `kernel_flash_attn_f32_f16_*`)
+    //    2 → bf16 (Vulkan coopmat2 — wider exponent range than F16,
+    //              same precision; identical bandwidth to F16, no
+    //              underflow on small attention scores; falls back
+    //              to f32 on adapters without coopmat2)
+    //    3 → q8_0 (Vulkan + half the K/V upload bandwidth on
+    //              workloads that are upload-bound; falls back to
+    //              f32 on backends without Q8_0 K/V flash-attn)
+    //
+    // Probe-gated graceful fallback to F32 on adapters that don't
+    // support the requested dtype — same advisory-probe semantics
+    // as `f16_attn`'s round-1 auto-policy, so an operator config
+    // setting `--kv-attn-type bf16` works on both NVIDIA Ampere+
+    // and Intel ARC (BF16 effective on the former; silent F32 on
+    // the latter) without crashing.  Out-of-range values throw
+    // loudly to surface CLI typos.
+    //
+    // When the resolved value is non-f32, the legacy
+    // `model.use_f16_attn` boolean is ALSO updated to
+    // `(resolved == f16)` so any code path still keying on the
+    // boolean (text-encoder / duration / vocoder; not the vector
+    // estimator) sees the historically-correct value.
+    int kv_attn_type = -1;
+
     // Directory to scan for dynamically-loaded ggml backends
     // (`libspeech-ggml-vulkan.so`, `libspeech-ggml-opencl.so`,
     // `libspeech-ggml-cpu-android_armv8.2_1.so`, ...). Forwarded to
@@ -89,8 +215,123 @@ struct EngineOptions {
     // predicted length) and the seeded RNG is bypassed.  Useful for
     // byte-exact reproduction of an ONNX/PyTorch reference run.
     std::string noise_npy_path;
+
+    // ---------------- Streaming synthesis ----------------------------
+    //
+    // When `stream_chunk_tokens > 0` AND a non-empty callback is passed
+    // to synthesize(), the engine splits `text` into chunks of roughly
+    // `stream_chunk_tokens` Unicode code points (Supertonic's text-token
+    // grain — see supertonic_text_to_ids), runs the full pipeline per
+    // chunk, and invokes the callback with each chunk's PCM as it's
+    // produced.  The returned SynthesisResult.pcm still contains the
+    // concatenated audio (the callback is an *addition*, not a
+    // replacement).  Streaming is disabled when stream_chunk_tokens == 0
+    // OR the callback is empty — both paths fall through to the batch
+    // path with no per-chunk overhead.
+    //
+    //   stream_chunk_tokens         Target chunk size in text tokens.
+    //                               ~50 ≈ 1-3 s English audio; CJK
+    //                               languages are denser so a lower
+    //                               target (~25-30) tends to feel
+    //                               better.  0 disables streaming.
+    //
+    //   stream_first_chunk_tokens   Override for the *first* chunk so
+    //                               first audio lands early while later
+    //                               chunks stay at the larger target
+    //                               for steady-state throughput.
+    //                               0 = same as stream_chunk_tokens.
+    //
+    //   stream_chunk_tolerance_pct  Boundary-snap window for CLAUSE and
+    //                               WHITESPACE fallbacks (±N% of target).
+    //                               Sentence-end is searched on a much
+    //                               wider implicit window (target/2 to
+    //                               3× target) because sentence-aligned
+    //                               chunks let the per-chunk duration
+    //                               predictor and attention phrase
+    //                               naturally; mid-clause cuts work
+    //                               (continuation flag in preprocess
+    //                               avoids the artificial trailing
+    //                               period that would otherwise make
+    //                               the model speak the stub as a
+    //                               complete sentence) but produce
+    //                               audible pauses + rate shifts at
+    //                               seams since the model is not
+    //                               streaming-trained.  Default 20.
+    //
+    //   stream_min_chunk_tokens     Hard floor on every chunk's size.
+    //                               Effective targets are
+    //                               max(target, min) — below the floor
+    //                               the model glitches on stub input
+    //                               (dropped / muddled phonemes,
+    //                               verified empirically).  Trailing
+    //                               chunks shorter than the floor are
+    //                               merged into the previous chunk.
+    //                               Default 30.
+    int stream_chunk_tokens        = 0;
+    int stream_first_chunk_tokens  = 0;
+    int stream_chunk_tolerance_pct = 20;
+    int stream_min_chunk_tokens    = 30;
+
+    // QVAC-18605 follow-up — first-synth-latency pre-warming.
+    //
+    // When non-empty, the Engine ctor invokes `warm_up(prewarm_text)`
+    // immediately after the GGUF load + voice validation, running one
+    // throwaway synth on the supplied text.  On Vulkan / OpenCL this
+    // forces the GPU shader pipelines for every Supertonic stage to
+    // compile up-front (the in-tree thread_local graph caches handle
+    // every subsequent call but can't avoid the first pipeline-compile
+    // cost — measured ~hundreds of ms on first synth on Adreno + RADV
+    // in chatterbox PROGRESS.md), so the operator-visible first synth
+    // call sees ~steady-state latency.  No effect on CPU (no shader
+    // compilation cost; warm_up returns immediately on
+    // `model.backend_is_cpu`).
+    //
+    // Pre-warm text should be similar in length to representative
+    // production input — the per-stage graph caches are keyed on
+    // (text_len, latent_len) tuples, so a too-short pre-warm leaves
+    // a graph-rebuild on the first real call (still saves the
+    // shader-compile cost; only the cgraph allocation is repeated).
+    // Default empty (no pre-warming).
+    std::string prewarm_text;
+
+    // QVAC-18605 round 7 — Vulkan env-var passthrough.
+    //
+    // Applied to the process environment via `set_env_if_unset`
+    // semantics just before `init_supertonic_backend()` runs.
+    // Each key MUST start with `GGML_VK_` (operator-config typo
+    // guard — invalid keys throw at engine-construction time, no
+    // partial-application).
+    //
+    // Operator-set env vars (already present in the environment
+    // when the Engine ctor runs) WIN over these overrides — lets
+    // a debugging operator force-disable a setting from the shell
+    // without recompiling, while still letting an EngineOptions
+    // configuration set the same knob in production.
+    //
+    // Example use cases (the round-7 CLI flags map onto these):
+    //   {"GGML_VK_PREFER_HOST_MEMORY",      "1"}  // --vulkan-prefer-host-memory
+    //   {"GGML_VK_DISABLE_COOPMAT2",        "1"}  // --vulkan-disable-coopmat2
+    //   {"GGML_VK_DISABLE_BFLOAT16",        "1"}  // --vulkan-disable-bfloat16
+    //   {"GGML_VK_PERF_LOGGER",             "1"}  // --vulkan-perf-logger
+    //   {"GGML_VK_ASYNC_USE_TRANSFER_QUEUE","1"}  // --vulkan-async-transfer
+    //
+    // Default empty (zero behaviour change for every existing
+    // operator config).
+    std::map<std::string, std::string> vulkan_env_overrides;
 };
 
+// Per-chunk PCM callback for streaming synthesis.  Receives a pointer to
+// `samples` consecutive float32 mono samples at SynthesisResult::sample_rate
+// (typically 44.1 kHz — read from model metadata, not hard-coded).  The
+// buffer is owned by the engine and must not be retained past the
+// callback; copy out if you need the data.
+//   `chunk_index`  0-based index of the chunk within the current synth.
+//   `is_last`      true on the final chunk (after which synthesize() returns).
+// Throwing from this callback aborts synthesis (the exception propagates
+// out of synthesize()).
+using StreamCallback = std::function<void(
+    const float * pcm, std::size_t samples, int chunk_index, bool is_last)>;
+
 struct SynthesisResult {
     std::vector<float> pcm;
     int   sample_rate = 44100;
@@ -123,12 +364,41 @@ class TTS_CPP_API Engine {
     // Not safe to call concurrently on the same Engine instance.
     SynthesisResult synthesize(const std::string & text);
 
+    // Same as above, but when `options().stream_chunk_tokens > 0` and
+    // `on_chunk` is non-empty, runs the chunked pipeline and invokes
+    // `on_chunk` with each chunk's PCM in order.  The returned
+    // SynthesisResult.pcm still contains the concatenated audio (the
+    // callback is an *addition*, not a replacement).  Falls through to
+    // the batch path when either condition is false.
+    SynthesisResult synthesize(const std::string & text,
+                               const StreamCallback & on_chunk);
+
     // Best-effort cancel of an in-flight synthesize() call on another
     // thread.  Setting the flag is all this does; actual termination
     // happens at the next cancellation check inside the vector-
     // estimator loop (one step is the worst-case cancel latency).
     void cancel();
 
+    // QVAC-18605 follow-up — first-synth-latency pre-warming.
+    //
+    // Runs one throwaway synth on `text` to force every per-stage
+    // GPU graph cache to populate and every Vulkan / OpenCL shader
+    // pipeline to compile up-front.  The PCM result is discarded.
+    // Subsequent `synthesize()` calls hit the warmed caches +
+    // pre-compiled pipelines, so the operator-visible first synth
+    // sees steady-state latency.
+    //
+    // No-op on CPU backends (no pipeline cache to warm).  Auto-
+    // invoked by the ctor when `EngineOptions::prewarm_text` is
+    // non-empty; callers can also invoke explicitly mid-life when
+    // they need to warm a different shape (e.g. switching from a
+    // short-prompt to a long-prompt workload).
+    //
+    // Throws on the same conditions as `synthesize()` — if the
+    // throwaway synth fails for any reason, the failure surfaces
+    // here rather than being swallowed.
+    void warm_up(const std::string & text);
+
     // Return the options the engine was constructed with (convenience
     // for callers that want to introspect the resolved n_gpu_layers /
     // n_threads after defaults are applied).
diff --git a/tts-cpp/scripts/setup-ggml.sh b/tts-cpp/scripts/setup-ggml.sh
new file mode 100755
index 00000000000..656d0b61f24
--- /dev/null
+++ b/tts-cpp/scripts/setup-ggml.sh
@@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+#
+# setup-ggml.sh — clone the qvac-ext-ggml@speech branch into tts-cpp/ggml/
+#
+# The bundled-ggml dev build path for tts-cpp out of this in-tree subtree.
+# Replaces the vcpkg-port consumption when you want a fast iteration loop
+# without going through vcpkg installs.
+#
+# Pinned to the head of the `speech` branch (a tetherto/qvac-ext-ggml fork
+# of ggml-org/ggml carrying all QVAC infrastructure patches + the
+# Supertonic 2 fused custom op family pre-applied as commits — no
+# patches/ directory needed at this layer).
+#
+# Usage:
+#   bash tts-cpp/scripts/setup-ggml.sh
+#   cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
+#   cmake --build tts-cpp/build -j
+#
+# To update to a newer pin: bump GGML_REF below and re-run.  The script
+# is idempotent — re-running checks out the right ref into the existing
+# tts-cpp/ggml/ clone without re-cloning.
+
+set -euo pipefail
+
+GGML_REPO_URL="https://github.com/tetherto/qvac-ext-ggml.git"
+GGML_REF="60a172e48f699bd0a00575ef911feed9473b2187"   # merge of qvac-ext-ggml#8 (speech HEAD)
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+TTS_CPP_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
+GGML_DIR="${TTS_CPP_DIR}/ggml"
+
+if [ -d "${GGML_DIR}/.git" ]; then
+    echo "setup-ggml: existing clone at ${GGML_DIR} — fetching + checking out pin ${GGML_REF:0:10}"
+    git -C "${GGML_DIR}" fetch --depth 1 origin "${GGML_REF}"
+    git -C "${GGML_DIR}" checkout --detach "${GGML_REF}"
+else
+    echo "setup-ggml: cloning qvac-ext-ggml @ ${GGML_REF:0:10} into ${GGML_DIR}"
+    rm -rf "${GGML_DIR}"
+    git clone --depth 1 --no-tags "${GGML_REPO_URL}" "${GGML_DIR}"
+    git -C "${GGML_DIR}" fetch --depth 1 origin "${GGML_REF}"
+    git -C "${GGML_DIR}" checkout --detach "${GGML_REF}"
+fi
+
+echo "setup-ggml: tts-cpp/ggml/ ready at $(git -C "${GGML_DIR}" rev-parse --short HEAD)"
+echo "setup-ggml: next: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF"
diff --git a/tts-cpp/scripts/validate-precision-parity.sh b/tts-cpp/scripts/validate-precision-parity.sh
new file mode 100755
index 00000000000..ce6c29208c8
--- /dev/null
+++ b/tts-cpp/scripts/validate-precision-parity.sh
@@ -0,0 +1,168 @@
+#!/usr/bin/env bash
+# Multi-precision parity + bench harness for Supertonic 2.
+#
+# For each supported precision (f32, f16, q8_0):
+#   1. Synthesizes a reference WAV on CPU at that precision.
+#   2. Synthesizes the same WAV on Metal at the same precision.
+#   3. Reports parity (corr, L_inf, RMS) between the two.
+#   4. Optionally runs supertonic-bench at the same precision and emits
+#      a per-precision JSON artifact alongside.
+#
+# Usage:
+#   bash scripts/validate-precision-parity.sh [--bench] [--text TEXT] [--model PATH]
+#                                             [--precisions f32,f16,q8_0]
+#
+# Precisions not yet wired through the graph builders fail at load with
+# a clear "scaffolded but not yet supported" message and are skipped (not
+# counted as a parity failure).  This lets the harness be useful right
+# now while Phase A3 / B1 work lands.
+
+set -euo pipefail
+
+ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+MODEL="$ROOT/models/supertonic2.gguf"
+TEXT="The quick brown fox jumps over the lazy dog."
+PRECISIONS="f32,f16,q8_0"
+DO_BENCH=0
+RUNS=10
+WARMUP=2
+THREADS=4
+ARTIFACT_DIR="$ROOT/artifacts/bench/parity-matrix"
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --bench)       DO_BENCH=1; shift ;;
+        --text)        TEXT="$2"; shift 2 ;;
+        --model)       MODEL="$2"; shift 2 ;;
+        --precisions)  PRECISIONS="$2"; shift 2 ;;
+        --runs)        RUNS="$2"; shift 2 ;;
+        --warmup)      WARMUP="$2"; shift 2 ;;
+        --threads)     THREADS="$2"; shift 2 ;;
+        --artifact-dir) ARTIFACT_DIR="$2"; shift 2 ;;
+        -h|--help)
+            sed -n '2,/^set -euo/p' "$0" | sed 's/^# //; s/^#//; /^set -euo/d'
+            exit 0 ;;
+        *) echo "unknown arg: $1" >&2; exit 2 ;;
+    esac
+done
+
+CLI="$ROOT/build/supertonic-cli"
+BENCH="$ROOT/build/supertonic-bench"
+PY="$ROOT/.venv/bin/python3"
+if [[ ! -x "$CLI" ]]; then
+    echo "build/supertonic-cli not found. Run 'cmake --build build --target supertonic-cli' first." >&2
+    exit 1
+fi
+if [[ "$DO_BENCH" -eq 1 && ! -x "$BENCH" ]]; then
+    echo "--bench requested but build/supertonic-bench not found." >&2
+    exit 1
+fi
+if [[ ! -x "$PY" ]]; then
+    echo "$PY not found. Activate a venv with numpy + wave installed." >&2
+    exit 1
+fi
+
+mkdir -p "$ARTIFACT_DIR"
+TMP="$(mktemp -d)"
+trap 'rm -rf "$TMP"' EXIT
+
+printf "\nSupertonic 2 multi-precision parity + bench harness\n"
+printf "  model:      %s\n" "$MODEL"
+printf "  text:       %.60s%s\n" "$TEXT" "$([[ ${#TEXT} -gt 60 ]] && echo '...')"
+printf "  precisions: %s\n" "$PRECISIONS"
+printf "  bench:      %s\n\n" "$([[ "$DO_BENCH" -eq 1 ]] && echo 'yes' || echo 'no')"
+
+OVERALL_RC=0
+IFS=',' read -r -a PREC_ARR <<< "$PRECISIONS"
+for P in "${PREC_ARR[@]}"; do
+    P_TRIM="$(echo "$P" | xargs)"
+    CPU_WAV="$TMP/cpu-$P_TRIM.wav"
+    MTL_WAV="$TMP/mtl-$P_TRIM.wav"
+
+    printf "=== %s ===\n" "$P_TRIM"
+
+    set +e
+    CPU_LOG="$("$CLI" --model "$MODEL" --text "$TEXT" --n-gpu-layers 0 \
+                       --precision "$P_TRIM" --out "$CPU_WAV" 2>&1)"
+    CPU_RC=$?
+    MTL_LOG="$("$CLI" --model "$MODEL" --text "$TEXT" --n-gpu-layers 1 \
+                       --precision "$P_TRIM" --out "$MTL_WAV" 2>&1)"
+    MTL_RC=$?
+    set -e
+
+    if echo "$CPU_LOG$MTL_LOG" | grep -qE "scaffolded but not yet|partially scaffolded"; then
+        printf "  SKIP: precision %s not yet wired through graph builders (Phase A3/B1)\n\n" "$P_TRIM"
+        continue
+    fi
+    # Tolerate the harmless post-write atexit `GGML_ASSERT([rsets->data count] == 0)`
+    # that fires on Metal cleanup AFTER the WAV is fully written.  Treat the run as
+    # successful iff the WAV file exists and is at least 1 KB (covers a synthesized
+    # signal, well above an empty/header-only file).
+    cpu_ok=1; mtl_ok=1
+    [[ -s "$CPU_WAV" ]] || cpu_ok=0
+    [[ -s "$MTL_WAV" ]] || mtl_ok=0
+    if [[ -f "$CPU_WAV" ]]; then
+        size=$(wc -c < "$CPU_WAV")
+        [[ $size -lt 1024 ]] && cpu_ok=0
+    fi
+    if [[ -f "$MTL_WAV" ]]; then
+        size=$(wc -c < "$MTL_WAV")
+        [[ $size -lt 1024 ]] && mtl_ok=0
+    fi
+    if [[ $cpu_ok -eq 0 || $mtl_ok -eq 0 ]]; then
+        printf "  FAIL: synthesis errored.  cpu_rc=%d mtl_rc=%d  wav_ok cpu=%d mtl=%d\n" \
+               "$CPU_RC" "$MTL_RC" "$cpu_ok" "$mtl_ok"
+        printf "  --- cpu tail ---\n%s\n  --- metal tail ---\n%s\n\n" \
+               "$(echo "$CPU_LOG" | tail -3)" "$(echo "$MTL_LOG" | tail -3)"
+        OVERALL_RC=1
+        continue
+    fi
+
+    "$PY" - <<PY
+import wave, numpy as np, sys
+def load(p):
+    with wave.open(p, 'rb') as w:
+        return np.frombuffer(w.readframes(w.getnframes()), dtype=np.int16).astype(np.float32) / 32768.0
+a = load("$CPU_WAV")
+b = load("$MTL_WAV")
+n = min(len(a), len(b))
+a, b = a[:n], b[:n]
+corr = float(np.corrcoef(a, b)[0, 1])
+linf = float(np.max(np.abs(a - b)))
+rms  = float(np.sqrt(np.mean((a - b) ** 2)))
+# Per-precision tolerance: numbers chosen against observed CPU↔Metal drift
+# on the benchmark text "The quick brown fox jumps over the lazy dog.".
+# Short text routinely gets L_inf ≈ 1.7e-3; long text accumulates more
+# float-order drift across 5 CFM steps × more attention positions, landing
+# around L_inf ≈ 3.7e-2 with corr ≥ 0.998 — audibly identical for f32.
+# Q8_0 has additional drift from the dequant→transpose→requantize round-trip
+# in the asymmetric load path (Metal keeps q8_0, CPU expands to f32, so the
+# two paths use slightly differently-quantized weights).  Audibly identical.
+tol_corr = {"f32": 0.998,  "f16": 0.99,  "q8_0": 0.96}.get("$P_TRIM", 0.99)
+tol_linf = {"f32": 0.05,   "f16": 0.10,  "q8_0": 0.15 }.get("$P_TRIM", 0.10)
+print(f"  corr={corr:.6f} (tol >= {tol_corr})  L_inf={linf:.6f} (tol <= {tol_linf})  RMS={rms:.6f}")
+ok = corr >= tol_corr and linf <= tol_linf
+print("  PASS" if ok else "  FAIL parity")
+sys.exit(0 if ok else 1)
+PY
+    PY_RC=$?
+    if [[ $PY_RC -ne 0 ]]; then OVERALL_RC=1; fi
+
+    if [[ "$DO_BENCH" -eq 1 ]]; then
+        JSON="$ARTIFACT_DIR/supertonic-mtl-${P_TRIM}.json"
+        printf "  bench --> %s\n" "$JSON"
+        "$BENCH" --model "$MODEL" --text "$TEXT" \
+                  --voice M1 --language en --steps 5 --speed 1.05 --seed 42 \
+                  --runs "$RUNS" --warmup "$WARMUP" --threads "$THREADS" \
+                  --n-gpu-layers 1 --precision "$P_TRIM" \
+                  --json-out "$JSON" 2>&1 | grep -E '^\s*(vector_estimator|vocoder|text_encoder|total|RTF|Real-time)' || true
+    fi
+    printf "\n"
+done
+
+if [[ $OVERALL_RC -eq 0 ]]; then
+    printf "All wired-up precisions pass parity.\n"
+else
+    printf "One or more precisions failed parity (or errored).\n" >&2
+fi
+exit $OVERALL_RC
diff --git a/tts-cpp/src/backend_selection.cpp b/tts-cpp/src/backend_selection.cpp
index bcb417d17cc..f670a5719e0 100644
--- a/tts-cpp/src/backend_selection.cpp
+++ b/tts-cpp/src/backend_selection.cpp
@@ -10,6 +10,7 @@
 #include <cstring>
 #include <mutex>
 #include <regex>
+#include <stdexcept>
 #include <string>
 #include <vector>
 
@@ -53,6 +54,70 @@ const char * dev_reg_name(ggml_backend_dev_t dev) {
     return reg ? ggml_backend_reg_name(reg) : "";
 }
 
+// QVAC-18605 — Vulkan multi-adapter pick. Pure logic on the two
+// per-device vectors so the policy stays unit-testable (a richer
+// copy lives in `tts_cpp::supertonic::detail::resolve_vulkan_device_index`
+// with its own DocTest harness; this in-process copy is kept lean so
+// the shared GPU-init helper doesn't introduce a back-edge into the
+// supertonic translation unit).
+//
+//   requested == -1 → auto-pick: argmax(free_vram), but if any
+//                     discrete adapter exists, restrict the argmax
+//                     to the discrete subset (excludes UMA iGPUs
+//                     reporting system RAM as free VRAM).
+//   requested ==  0 → first adapter in registry order.
+//   requested >   0 → that adapter index (0-based against the
+//                     Vulkan-only subset).
+//   requested <  -1 → reserved; throws.
+// Out-of-range positive index throws too. Vectors must be the same
+// length; mismatched non-empty UMA list throws.
+int pick_vulkan_device_index(int requested,
+                             const std::vector<size_t> & free_vram_per_device,
+                             const std::vector<bool> &   is_uma_per_device) {
+    const int dev_count = (int) free_vram_per_device.size();
+    if (dev_count <= 0) {
+        throw std::runtime_error(
+            "tts-cpp: cannot resolve --vulkan-device against an empty "
+            "device list (no Vulkan adapter visible)");
+    }
+    if (!is_uma_per_device.empty() &&
+        is_uma_per_device.size() != free_vram_per_device.size()) {
+        throw std::runtime_error("tts-cpp: is_uma_per_device length mismatch");
+    }
+    if (requested < -1) {
+        throw std::runtime_error(
+            "tts-cpp: --vulkan-device " + std::to_string(requested) +
+            " is reserved (only -1 means auto-pick)");
+    }
+    if (requested == -1) {
+        bool any_discrete = false;
+        if (!is_uma_per_device.empty()) {
+            for (bool u : is_uma_per_device) {
+                if (!u) { any_discrete = true; break; }
+            }
+        }
+        int    best_idx  = 0;
+        size_t best_vram = 0;
+        bool   first     = true;
+        for (int i = 0; i < dev_count; ++i) {
+            if (any_discrete && is_uma_per_device[(size_t) i]) continue;
+            if (first || free_vram_per_device[(size_t) i] > best_vram) {
+                best_idx  = i;
+                best_vram = free_vram_per_device[(size_t) i];
+                first     = false;
+            }
+        }
+        return best_idx;
+    }
+    if (requested >= dev_count) {
+        throw std::runtime_error(
+            "tts-cpp: --vulkan-device " + std::to_string(requested) +
+            " out of range (visible Vulkan adapters: " +
+            std::to_string(dev_count) + ")");
+    }
+    return requested;
+}
+
 } // namespace
 
 void set_backends_directory(const std::string & dir) {
@@ -295,7 +360,8 @@ bool is_qualcomm_adreno(const char * name, const char * desc) {
 // The registry walk reaches the same backends in both modes.
 ggml_backend_t init_gpu_backend(int n_gpu_layers,
                                 bool verbose,
-                                const char * log_prefix) {
+                                const char * log_prefix,
+                                int vulkan_device) {
     if (n_gpu_layers <= 0) return nullptr;
     if (!log_prefix) log_prefix = "tts-cpp";
 
@@ -312,6 +378,13 @@ ggml_backend_t init_gpu_backend(int n_gpu_layers,
     std::vector<Cand> opencl_other; // Non-Adreno OpenCL (e.g. desktop)
     int max_adreno_version = -1;
 
+    // QVAC-18605 — track every visible Vulkan adapter so we can apply
+    // the round-12 device-selection policy (vulkan_device index +
+    // free-VRAM auto-pick with UMA bias) before draining the bucket.
+    std::vector<Cand>   vulkan_devs;
+    std::vector<size_t> vulkan_free_vram;
+    std::vector<bool>   vulkan_is_uma;
+
     const size_t n_dev = ggml_backend_dev_count();
     for (size_t i = 0; i < n_dev; ++i) {
         ggml_backend_dev_t dev = ggml_backend_dev_get(i);
@@ -325,6 +398,26 @@ ggml_backend_t init_gpu_backend(int n_gpu_layers,
         const char * desc     = ggml_backend_dev_description(dev);
         const char * reg_name = dev_reg_name(dev);
         const bool   is_opencl = reg_name && std::strcmp(reg_name, "OpenCL") == 0;
+        const bool   is_vulkan = reg_name && std::strcmp(reg_name, "Vulkan") == 0;
+
+        if (is_vulkan) {
+            size_t free = 0, total = 0;
+            ggml_backend_dev_memory(dev, &free, &total);
+            vulkan_devs.push_back({dev, name, desc, reg_name});
+            vulkan_free_vram.push_back(free);
+            vulkan_is_uma.push_back(type == GGML_BACKEND_DEVICE_TYPE_IGPU);
+            if (verbose && vulkan_device == -1) {
+                fprintf(stderr,
+                        "%s: vulkan device %d: %s — free %.0f MB / total %.0f MB%s\n",
+                        log_prefix,
+                        (int) (vulkan_devs.size() - 1),
+                        desc && *desc ? desc : (name && *name ? name : "unknown"),
+                        (double) free  / (1024.0 * 1024.0),
+                        (double) total / (1024.0 * 1024.0),
+                        type == GGML_BACKEND_DEVICE_TYPE_IGPU
+                            ? " [UMA — biased against on hybrid machines]" : "");
+            }
+        }
 
 #if defined(__ANDROID__)
         // Android GPU allowlist: only Qualcomm Adreno is validated for the
@@ -409,10 +502,97 @@ ggml_backend_t init_gpu_backend(int n_gpu_layers,
         return nullptr;
     };
 
+    // QVAC-18605 — when a `vulkan_device` override or auto-pick is
+    // requested AND at least one Vulkan adapter is visible, resolve
+    // the chosen Vulkan adapter and move it to the front of
+    // `other_gpu` so `try_init` picks it first.
+    //
+    //   `vulkan_device == 0` (default): tier policy unchanged.
+    //   `vulkan_device == -1`         : auto-pick across Vulkan
+    //                                   adapters; tier policy
+    //                                   unchanged (user asked for
+    //                                   "best Vulkan device", not
+    //                                   "must be Vulkan over OpenCL").
+    //   `vulkan_device  >  0`         : explicit override.  User
+    //                                   asked for Vulkan device N
+    //                                   specifically, so honour it
+    //                                   by trying `other_gpu`
+    //                                   BEFORE `opencl_adreno_700plus`
+    //                                   below — otherwise on a
+    //                                   Snapdragon device that
+    //                                   exposes both backends, the
+    //                                   OpenCL-Adreno tier would
+    //                                   silently shadow the override.
+    //
+    // PR #31 review comment 3355973146: guard on `!vulkan_devs.empty()`
+    // so a `vulkan_device != 0` config doesn't abort `init_gpu_backend`
+    // on a no-Vulkan machine (Metal-only Mac, CUDA-only Linux,
+    // Adreno-OpenCL-only Snapdragon) — without the guard,
+    // `pick_vulkan_device_index` would throw on the empty device list
+    // and prevent the tier policy from falling through to the
+    // available non-Vulkan backend.
+    bool vulkan_override_wins_tier_policy = false;
+    if (vulkan_device != 0 && !vulkan_devs.empty()) {
+        const int chosen = pick_vulkan_device_index(vulkan_device,
+                                                    vulkan_free_vram,
+                                                    vulkan_is_uma);
+        const ggml_backend_dev_t chosen_dev = vulkan_devs[(size_t) chosen].dev;
+        auto it = std::find_if(other_gpu.begin(), other_gpu.end(),
+                                [&](const Cand & c) { return c.dev == chosen_dev; });
+        if (it != other_gpu.end()) {
+            Cand c = *it;
+            other_gpu.erase(it);
+            other_gpu.insert(other_gpu.begin(), c);
+        }
+        // Explicit non-auto override (`vulkan_device > 0`) means the
+        // operator deliberately selected Vulkan; surface that to the
+        // tier dispatch below so the OpenCL-Adreno preference doesn't
+        // silently win on Snapdragon-class devices.
+        if (vulkan_device > 0) vulkan_override_wins_tier_policy = true;
+        if (verbose) {
+            const Cand & c = vulkan_devs[(size_t) chosen];
+            const char * label = c.desc && *c.desc ? c.desc :
+                                 (c.name && *c.name ? c.name : "unknown");
+            if (vulkan_device == -1) {
+                bool any_discrete = false;
+                for (bool u : vulkan_is_uma) {
+                    if (!u) { any_discrete = true; break; }
+                }
+                fprintf(stderr,
+                    "%s: auto-picked Vulkan device %d (%s) — most free VRAM of %d adapter(s)%s\n",
+                    log_prefix, chosen, label,
+                    (int) vulkan_devs.size(),
+                    any_discrete ? " (round-12 UMA bias)" : "");
+            } else {
+                fprintf(stderr,
+                    "%s: using Vulkan device %d (%s) per --vulkan-device override\n",
+                    log_prefix, chosen, label);
+            }
+        }
+    } else if (vulkan_device != 0 && vulkan_devs.empty() && verbose) {
+        // Override requested but no Vulkan adapter present — log and
+        // fall through to the tier policy so the available GPU
+        // (CUDA / Metal / Adreno-OpenCL) still gets used.
+        fprintf(stderr,
+            "%s: vulkan_device=%d requested but no Vulkan adapter visible; "
+            "falling through to the tier policy\n",
+            log_prefix, vulkan_device);
+    }
+
+    // Tier dispatch.  When the operator pinned a specific Vulkan
+    // adapter via `vulkan_device > 0`, that explicit choice outranks
+    // the OpenCL-Adreno tier preference (review comment 3355995666):
+    // the user wants Vulkan, give them Vulkan.  Otherwise the tier
+    // policy is unchanged.
+    if (vulkan_override_wins_tier_policy) {
+        if (ggml_backend_t b = try_init(other_gpu)) return b;
+    }
     if (!opencl_adreno_700plus.empty()) {
         if (ggml_backend_t b = try_init(opencl_adreno_700plus)) return b;
     }
-    if (ggml_backend_t b = try_init(other_gpu)) return b;
+    if (!vulkan_override_wins_tier_policy) {
+        if (ggml_backend_t b = try_init(other_gpu)) return b;
+    }
     if (ggml_backend_t b = try_init(opencl_other)) return b;
 
     if (verbose) {
diff --git a/tts-cpp/src/backend_selection.h b/tts-cpp/src/backend_selection.h
index 7054cb7273c..4ab05cc7585 100644
--- a/tts-cpp/src/backend_selection.h
+++ b/tts-cpp/src/backend_selection.h
@@ -55,9 +55,25 @@ void ensure_backends_loaded();
 // so the existing user-visible logs in the three init sites stay
 // distinguishable; verbose=false suppresses everything except hard
 // errors.
+//
+// `vulkan_device` selects which Vulkan adapter to prefer when more
+// than one is visible in the registry (QVAC-18605 round 3 / 12):
+//   - 0 (default): first Vulkan adapter in registry order.
+//   - N > 0      : the Nth Vulkan adapter (0-indexed); throws on out
+//                  of range so a CLI typo fails loud instead of
+//                  silently falling through to CPU.
+//   - -1         : auto-pick: argmax(free VRAM), with a UMA bias
+//                  that excludes integrated-GPU adapters whenever at
+//                  least one discrete adapter is also visible (avoids
+//                  the iGPU's UMA-reported system RAM dwarfing the
+//                  discrete's true VRAM and silently stealing the
+//                  pick on hybrid desktops/laptops).
+// No effect when zero / one Vulkan adapters are visible, or when the
+// chosen backend is non-Vulkan (CUDA / Metal / OpenCL).
 ggml_backend_t init_gpu_backend(int n_gpu_layers,
                                 bool verbose,
-                                const char * log_prefix);
+                                const char * log_prefix,
+                                int vulkan_device = 0);
 
 // Convenience wrapper that picks up the registered CPU device and
 // returns its init handle. Mirrors parakeet-cpp's
diff --git a/tts-cpp/src/backend_util.h b/tts-cpp/src/backend_util.h
index 2eb8a966ac3..e21dfa7ab00 100644
--- a/tts-cpp/src/backend_util.h
+++ b/tts-cpp/src/backend_util.h
@@ -39,6 +39,10 @@ inline bool backend_is_metal(ggml_backend_t b) {
     return std::strcmp(backend_reg_name(b), "Metal") == 0;
 }
 
+inline bool backend_is_vulkan(ggml_backend_t b) {
+    return std::strcmp(backend_reg_name(b), "Vulkan") == 0;
+}
+
 inline void backend_set_n_threads(ggml_backend_t b, int n_threads) {
     if (!b || n_threads <= 0) return;
     ggml_backend_dev_t dev = ggml_backend_get_device(b);
diff --git a/tts-cpp/src/chatterbox_cli.cpp b/tts-cpp/src/chatterbox_cli.cpp
index 20ec0ee5d34..53716102253 100644
--- a/tts-cpp/src/chatterbox_cli.cpp
+++ b/tts-cpp/src/chatterbox_cli.cpp
@@ -367,6 +367,42 @@ struct cli_params {
     int32_t     supertonic_steps = 0;
     float       supertonic_speed = 0.0f;
     std::string supertonic_noise_npy;
+    // Vector-estimator F16 K/V flash-attention dispatch.  -1 = auto
+    // (on GPU, off on CPU); 0 / 1 force the setting.  Maps onto
+    // EngineOptions::f16_attn.  See `--f16-attn` flag below.
+    int32_t     supertonic_f16_attn = -1;
+    // Load-time F16 materialization for the audit-identified hot
+    // matmul / pwconv weights (Phase 2A).  -1 = auto / 0 / 1 force.
+    // Maps onto EngineOptions::f16_weights.
+    int32_t     supertonic_f16_weights = -1;
+    // QVAC-18605 — Vulkan adapter index.  Default 0 (the historical
+    // hard-coded value).  Maps onto EngineOptions::vulkan_device.
+    // Range-checked at GGUF load against
+    // `ggml_backend_vk_get_device_count()`; an out-of-range value
+    // throws (no silent CPU fallback).  Has no effect on builds
+    // compiled without `GGML_VULKAN` or when `--n-gpu-layers 0`.
+    int32_t     supertonic_vulkan_device = 0;
+    // QVAC-18605 follow-up — first-synth pre-warm text.  Empty
+    // disables.  Maps onto EngineOptions::prewarm_text.  Auto no-op
+    // on CPU backends.
+    std::string supertonic_prewarm_text;
+    // QVAC-18605 round 6 — comma-separated extra deny-list of
+    // substring patterns.  Empty default → zero behaviour change.
+    // Maps onto EngineOptions::f16_weights_deny_list (after
+    // comma-splitting).
+    std::vector<std::string> supertonic_f16_weights_deny_list;
+    // QVAC-18605 round 4 — multi-dtype K/V flash-attention dispatch.
+    // -1 = auto (falls back to --f16-attn for back-compat); 0=f32,
+    // 1=f16, 2=bf16, 3=q8_0.  Maps onto EngineOptions::kv_attn_type.
+    // Probe-gated graceful fallback to f32 on adapters that don't
+    // support the requested dtype.
+    int32_t     supertonic_kv_attn_type = -1;
+    // QVAC-18605 round 7 — Vulkan env-var overrides applied via
+    // `apply_vulkan_env_overrides` just before backend init.
+    // Operator-set env vars in the shell still WIN over these
+    // (set_env_if_unset semantics).  Maps onto
+    // EngineOptions::vulkan_env_overrides.
+    std::map<std::string, std::string> supertonic_vulkan_env_overrides;
     bool        has_supertonic_options = false;
 
     // Streaming synthesis (PROGRESS.md B1).  When > 0, speech tokens from
@@ -501,6 +537,49 @@ static void print_usage(const char * argv0) {
     fprintf(stderr, "  --steps N               Denoising steps. Defaults to GGUF metadata.\n");
     fprintf(stderr, "  --speed X               Duration speed multiplier. Defaults to GGUF metadata.\n");
     fprintf(stderr, "  --noise-npy PATH        Fixed initial noise tensor for parity/debug runs.\n");
+    fprintf(stderr, "  --f16-attn 0|1          Vector-estimator F16 K/V flash-attention.  Defaults\n");
+    fprintf(stderr, "                          to auto (on for GPU/OpenCL, off for CPU).  Triggers\n");
+    fprintf(stderr, "                          the OpenCL `flash_attn_f32_f16` kernel on Adreno;\n");
+    fprintf(stderr, "                          see PROGRESS_SUPERTONIC.md OpenCL section.\n");
+    fprintf(stderr, "  --kv-attn-type DTYPE    Vector-estimator multi-dtype K/V flash-attn dispatch.\n");
+    fprintf(stderr, "                          DTYPE in {auto,f32,f16,bf16,q8_0}.  Default auto:\n");
+    fprintf(stderr, "                          falls back to --f16-attn for backwards-compat.\n");
+    fprintf(stderr, "                          bf16 needs Vulkan coopmat2 (NVIDIA Ampere+ / RDNA3+);\n");
+    fprintf(stderr, "                          q8_0 halves the K/V upload bandwidth on Vulkan.\n");
+    fprintf(stderr, "                          Probe-gated graceful fallback to f32 on miss.\n");
+    fprintf(stderr, "  --f16-weights 0|1       Load-time F16 materialization for the hot matmul /\n");
+    fprintf(stderr, "                          pwconv weights identified by the audit.  Defaults\n");
+    fprintf(stderr, "                          to auto (on for GPU, off for CPU).  Halves the GPU\n");
+    fprintf(stderr, "                          read bandwidth into those ops with a small (~2e-3)\n");
+    fprintf(stderr, "                          numerical drift on the end-to-end synth.\n");
+    fprintf(stderr, "  --f16-weights-deny PAT1,PAT2,...   Comma-separated substring patterns; matching\n");
+    fprintf(stderr, "                          tensors stay F32 even when --f16-weights is on.\n");
+    fprintf(stderr, "                          Layered on top of the curated allow-list.  Empty\n");
+    fprintf(stderr, "                          entries are skipped defensively (config-typo guard).\n");
+    fprintf(stderr, "                          Default empty (zero behaviour change).\n");
+    fprintf(stderr, "  --vulkan-device N       Vulkan adapter index.  Default 0; -1 = auto-pick\n");
+    fprintf(stderr, "                          adapter with most free VRAM (multi-GPU machines).\n");
+    fprintf(stderr, "                          Has no effect unless built with -DGGML_VULKAN=ON\n");
+    fprintf(stderr, "                          and used with --n-gpu-layers > 0.  Range-checked at\n");
+    fprintf(stderr, "                          load time; an out-of-range value is a hard error\n");
+    fprintf(stderr, "                          (no silent CPU fallback).  See PROGRESS_SUPERTONIC.md\n");
+    fprintf(stderr, "                          \"Vulkan bring-up\" section for the supported-op matrix.\n");
+    fprintf(stderr, "  --vulkan-prefer-host-memory    Sets GGML_VK_PREFER_HOST_MEMORY=1.  Triage knob.\n");
+    fprintf(stderr, "  --vulkan-disable-coopmat2      Sets GGML_VK_DISABLE_COOPMAT2=1.  Useful for A/B-ing\n");
+    fprintf(stderr, "                                 the BF16 K/V dispatch path on coopmat2-capable adapters.\n");
+    fprintf(stderr, "  --vulkan-disable-bfloat16      Sets GGML_VK_DISABLE_BFLOAT16=1.  Forces F16 fallback\n");
+    fprintf(stderr, "                                 even when --kv-attn-type bf16 is requested.\n");
+    fprintf(stderr, "  --vulkan-perf-logger           Sets GGML_VK_PERF_LOGGER=1.  Enables ggml-vulkan's\n");
+    fprintf(stderr, "                                 per-shader timing output (verbose; for triage only).\n");
+    fprintf(stderr, "  --vulkan-async-transfer        Sets GGML_VK_ASYNC_USE_TRANSFER_QUEUE=1.\n");
+    fprintf(stderr, "  --vulkan-env KEY=VALUE         Set arbitrary GGML_VK_* env var.  May be repeated.\n");
+    fprintf(stderr, "                                 Operator-set env vars in the shell STILL win over\n");
+    fprintf(stderr, "                                 these CLI overrides (set_env_if_unset semantics).\n");
+    fprintf(stderr, "  --prewarm TEXT          Run one throwaway synth on TEXT at engine\n");
+    fprintf(stderr, "                          construction so first-real-call latency on Vulkan /\n");
+    fprintf(stderr, "                          OpenCL doesn't pay the shader-compile cost (~hundreds\n");
+    fprintf(stderr, "                          of ms cold start on Adreno + RADV per chatterbox\n");
+    fprintf(stderr, "                          PROGRESS.md).  No-op on CPU backends.\n");
     fprintf(stderr, "\n");
     fprintf(stderr, "  --stream-chunk-tokens N Synthesize the wav in streaming chunks of N speech\n");
     fprintf(stderr, "                          tokens each (~1 s audio per 25-token chunk).  With\n");
@@ -642,6 +721,59 @@ static bool parse_args(int argc, char ** argv, cli_params & params) {
         else if (arg == "--steps")          { if (!parse_int  ("--steps",          params.supertonic_steps)) return false; params.has_supertonic_options = true; }
         else if (arg == "--speed")          { if (!parse_float("--speed",          params.supertonic_speed)) return false; params.has_supertonic_options = true; }
         else if (arg == "--noise-npy")      { auto v = next("--noise-npy");      if (!v) return false; params.supertonic_noise_npy = v; params.has_supertonic_options = true; }
+        else if (arg == "--f16-attn")       { if (!parse_int  ("--f16-attn",       params.supertonic_f16_attn)) return false; params.has_supertonic_options = true; }
+        else if (arg == "--f16-weights")    { if (!parse_int  ("--f16-weights",    params.supertonic_f16_weights)) return false; params.has_supertonic_options = true; }
+        else if (arg == "--f16-weights-deny") {
+            // Comma-split.  Empty entries tolerated; the predicate
+            // skips them.  Tracked as a supertonic-option so the
+            // model-arch-detection branch in main() routes
+            // correctly.
+            auto v = next("--f16-weights-deny"); if (!v) return false;
+            params.supertonic_f16_weights_deny_list.clear();
+            const std::string raw = v;
+            size_t start = 0;
+            for (size_t k = 0; k <= raw.size(); ++k) {
+                if (k == raw.size() || raw[k] == ',') {
+                    params.supertonic_f16_weights_deny_list.emplace_back(raw.substr(start, k - start));
+                    start = k + 1;
+                }
+            }
+            params.has_supertonic_options = true;
+        }
+        else if (arg == "--vulkan-device")  { if (!parse_int  ("--vulkan-device",  params.supertonic_vulkan_device)) return false; params.has_supertonic_options = true; }
+        else if (arg == "--kv-attn-type") {
+            auto v = next("--kv-attn-type"); if (!v) return false;
+            const std::string s = v;
+            if      (s == "auto") params.supertonic_kv_attn_type = -1;
+            else if (s == "f32")  params.supertonic_kv_attn_type = 0;
+            else if (s == "f16")  params.supertonic_kv_attn_type = 1;
+            else if (s == "bf16") params.supertonic_kv_attn_type = 2;
+            else if (s == "q8_0") params.supertonic_kv_attn_type = 3;
+            else {
+                fprintf(stderr,
+                    "error: --kv-attn-type expects one of: auto, f32, f16, bf16, q8_0 (got: %s)\n",
+                    s.c_str());
+                return false;
+            }
+            params.has_supertonic_options = true;
+        }
+        else if (arg == "--prewarm")        { auto v = next("--prewarm");        if (!v) return false; params.supertonic_prewarm_text = v; params.has_supertonic_options = true; }
+        else if (arg == "--vulkan-prefer-host-memory") { params.supertonic_vulkan_env_overrides["GGML_VK_PREFER_HOST_MEMORY"]      = "1"; params.has_supertonic_options = true; }
+        else if (arg == "--vulkan-disable-coopmat2")   { params.supertonic_vulkan_env_overrides["GGML_VK_DISABLE_COOPMAT2"]        = "1"; params.has_supertonic_options = true; }
+        else if (arg == "--vulkan-disable-bfloat16")   { params.supertonic_vulkan_env_overrides["GGML_VK_DISABLE_BFLOAT16"]        = "1"; params.has_supertonic_options = true; }
+        else if (arg == "--vulkan-perf-logger")        { params.supertonic_vulkan_env_overrides["GGML_VK_PERF_LOGGER"]             = "1"; params.has_supertonic_options = true; }
+        else if (arg == "--vulkan-async-transfer")     { params.supertonic_vulkan_env_overrides["GGML_VK_ASYNC_USE_TRANSFER_QUEUE"]= "1"; params.has_supertonic_options = true; }
+        else if (arg == "--vulkan-env") {
+            auto v = next("--vulkan-env"); if (!v) return false;
+            const std::string raw = v;
+            const auto eq = raw.find('=');
+            if (eq == std::string::npos || eq == 0) {
+                fprintf(stderr, "error: --vulkan-env expects KEY=VALUE (got: %s)\n", raw.c_str());
+                return false;
+            }
+            params.supertonic_vulkan_env_overrides[raw.substr(0, eq)] = raw.substr(eq + 1);
+            params.has_supertonic_options = true;
+        }
         else if (arg == "--cfm-f16-kv-attn") { params.cfm_f16_kv_attn = true; }
         else if (arg == "--max-sentence-chars") { if (!parse_int("--max-sentence-chars", params.max_sentence_chars)) return false; }
         else if (arg == "--no-auto-split")  { params.max_sentence_chars = 0; }
@@ -835,7 +967,14 @@ static int run_supertonic_cli_path(const cli_params & params) {
     if (params.seed_set) opts.seed = params.seed;
     opts.n_threads = params.n_threads;
     opts.n_gpu_layers = params.n_gpu_layers;
+    opts.f16_attn = params.supertonic_f16_attn;
+    opts.f16_weights = params.supertonic_f16_weights;
+    opts.vulkan_device = params.supertonic_vulkan_device;
+    opts.prewarm_text = params.supertonic_prewarm_text;
     opts.noise_npy_path = params.supertonic_noise_npy;
+    opts.f16_weights_deny_list = params.supertonic_f16_weights_deny_list;
+    opts.kv_attn_type = params.supertonic_kv_attn_type;
+    opts.vulkan_env_overrides = params.supertonic_vulkan_env_overrides;
 
     auto result = tts_cpp::supertonic::synthesize(opts, params.text);
     stream_write_wav(params.out_wav, result.pcm, result.sample_rate);
diff --git a/tts-cpp/src/supertonic_bench.cpp b/tts-cpp/src/supertonic_bench.cpp
index c7ba619e7fd..eb072c96bdc 100644
--- a/tts-cpp/src/supertonic_bench.cpp
+++ b/tts-cpp/src/supertonic_bench.cpp
@@ -16,8 +16,14 @@
 //       --text "..." [--voice M1] [--language en] [--steps 5] [--speed 1.05] \
 //       [--seed 42] [--noise-npy noise.npy] [--runs 5] [--warmup 1] [--json-out result.json]
 
+#include "backend_selection.h"
 #include "supertonic_internal.h"
 #include "npy.h"
+// Vulkan adapter description in the bench backend annotator is now
+// resolved through the registry API
+// (`ggml_backend_get_device` + `ggml_backend_dev_description`); no
+// hard dep on the per-backend `ggml-vulkan.h` header / static
+// `ggml_backend_vk_get_device_description` entry point.
 
 #include <algorithm>
 #include <chrono>
@@ -27,6 +33,7 @@
 #include <fstream>
 #include <random>
 #include <stdexcept>
+#include <map>
 #include <string>
 #include <vector>
 
@@ -45,10 +52,53 @@ void usage(const char * argv0) {
         "usage: %s --model supertonic2.gguf --text TEXT\n"
         "          [--voice M1] [--language en] [--steps 5] [--speed 1.05]\n"
         "          [--seed 42] [--noise-npy /path/to/noise.npy]\n"
-        "          [--runs 5] [--warmup 1] [--threads N] [--json-out FILE]\n",
+        "          [--runs 5] [--warmup 1] [--threads N] [--n-gpu-layers N]\n"
+        "          [--vulkan-device N] (-1 = auto-pick adapter with most free VRAM)\n"
+        "          [--f16-attn 0|1] [--f16-weights 0|1]\n"
+        "          [--precision f32|f16|q8_0]   (default: f32)\n"
+        "          [--kv-attn-type auto|f32|f16|bf16|q8_0]\n"
+        "                            (multi-dtype K/V flash-attn dispatch; generalises\n"
+        "                            --f16-attn.  default auto: falls back to --f16-attn.\n"
+        "                            bf16/q8_0 require Vulkan adapter support; silent\n"
+        "                            fallback to f32 on probe miss.)\n"
+        "          [--f16-weights-deny PATTERN1,PATTERN2,...] (substring patterns,\n"
+        "                            comma-separated; matching tensors stay F32 even\n"
+        "                            when --f16-weights is on.  Layered on top of the\n"
+        "                            curated allow-list.  Default empty.)\n"
+        "          [--prewarm TEXT] (one cold-start synth before timed loop;\n"
+        "                            independent of --warmup; CPU is no-op)\n"
+        "          [--vulkan-prefer-host-memory]    (sets GGML_VK_PREFER_HOST_MEMORY=1)\n"
+        "          [--vulkan-disable-coopmat2]      (sets GGML_VK_DISABLE_COOPMAT2=1)\n"
+        "          [--vulkan-disable-bfloat16]      (sets GGML_VK_DISABLE_BFLOAT16=1)\n"
+        "          [--vulkan-perf-logger]           (sets GGML_VK_PERF_LOGGER=1)\n"
+        "          [--vulkan-async-transfer]        (sets GGML_VK_ASYNC_USE_TRANSFER_QUEUE=1)\n"
+        "          [--vulkan-env KEY=VALUE]         (set arbitrary GGML_VK_* env var; may repeat)\n"
+        "          [--no-bench-sync]   (skip ggml_backend_synchronize at stage boundaries;\n"
+        "                               default off for accurate per-stage attribution on Vulkan)\n"
+        "          [--bench-per-step]  (time each denoise step individually so the first-step\n"
+        "                               cold-pipeline cost is distinguished from steady-state)\n"
+        "          [--json-out FILE]\n",
         argv0);
 }
 
+tts_cpp::supertonic::detail::supertonic_precision parse_bench_precision(const std::string & s) {
+    using P = tts_cpp::supertonic::detail::supertonic_precision;
+    if (s == "f32" || s == "F32") return P::F32;
+    if (s == "f16" || s == "F16") return P::F16;
+    if (s == "q8_0" || s == "Q8_0" || s == "q8") return P::Q8_0;
+    throw std::runtime_error("unknown --precision value: " + s + " (expected f32|f16|q8_0)");
+}
+
+const char * precision_to_string(tts_cpp::supertonic::detail::supertonic_precision p) {
+    using P = tts_cpp::supertonic::detail::supertonic_precision;
+    switch (p) {
+        case P::F32:  return "f32";
+        case P::F16:  return "f16";
+        case P::Q8_0: return "q8_0";
+    }
+    return "f32";
+}
+
 double percentile(std::vector<double> v, double p) {
     if (v.empty()) return 0.0;
     std::sort(v.begin(), v.end());
@@ -116,6 +166,70 @@ int main(int argc, char ** argv) {
     int runs = 5;
     int warmup = 1;
     int n_threads = 0;
+    int n_gpu_layers = 0;
+    // -1 = auto (GPU on, CPU off); 0/1 to force.  See model.use_f16_attn.
+    int f16_attn = -1;
+    // Phase 2A — F16 load-time materialization of the hot matmul /
+    // pwconv weights.  -1 auto / 0 / 1 force.
+    int f16_weights = -1;
+    supertonic_precision precision = supertonic_precision::F32;
+    // QVAC-18605 — Vulkan adapter index.  Default 0 (the historical
+    // hard-coded value in `init_supertonic_backend`).  Range-checked
+    // at GGUF load against `ggml_backend_vk_get_device_count()`; an
+    // out-of-range value is a hard error.
+    int vulkan_device = 0;
+    // QVAC-18605 follow-up — first-synth pre-warm.  When non-empty,
+    // a throwaway synth on `prewarm_text` runs after model load + before
+    // the timed runs, forcing every per-stage GPU graph cache + shader
+    // pipeline to populate up-front.  No-op on CPU backends.  Note that
+    // bench's existing `--warmup N` flag is independent: it discards
+    // the first N timed runs from the median, but it doesn't avoid the
+    // shader-compile hit on the first warmup run.  `--prewarm TEXT`
+    // does, so the first warmup run reflects actual steady-state warm
+    // time rather than the cold-start outlier.
+    std::string prewarm_text;
+    // QVAC-18605 round 6 — comma-separated list of substring patterns
+    // that force matching tensors to stay F32 even when --f16-weights
+    // is on.  Layered on top of the curated allow-list in
+    // `should_materialise_f16_weight()`.  Default empty (zero
+    // behaviour change for every existing bench invocation).
+    std::vector<std::string> f16_weights_deny_list;
+    // QVAC-18605 round 4 — multi-dtype K/V flash-attn dispatch.
+    // -1 = auto (falls back to --f16-attn for back-compat); 0=f32,
+    // 1=f16, 2=bf16, 3=q8_0.  Probe-gated graceful fallback to f32
+    // on adapters that don't support the requested dtype.
+    int kv_attn_type = -1;
+    // QVAC-18605 round 7 — Vulkan env-var overrides applied via
+    // `apply_vulkan_env_overrides` BEFORE `init_supertonic_backend`.
+    std::map<std::string, std::string> vulkan_env_overrides;
+    // QVAC-18605 round 7 — bench observability flags.
+    //
+    // `bench_sync` (default true) inserts an explicit
+    // `ggml_backend_synchronize` at every per-stage boundary so
+    // the wall-clock attributes to the right stage on async
+    // backends (Vulkan / OpenCL).  Cheap on CPU (no-op).
+    // `--no-bench-sync` opts out for the rare case the operator
+    // wants to observe pipelined / overlapped behaviour.
+    //
+    // `bench_per_step` (default false) times each
+    // `supertonic_vector_step_ggml` call individually so the
+    // first-step (cold pipelines) cost can be distinguished from
+    // steady-state.  Adds an extra stage column per step in the
+    // human output and a `vector_step_ms` array in the JSON.
+    bool bench_sync     = true;
+    bool bench_per_step = false;
+
+    auto split_csv = [](const std::string & s) {
+        std::vector<std::string> out;
+        size_t start = 0;
+        for (size_t i = 0; i <= s.size(); ++i) {
+            if (i == s.size() || s[i] == ',') {
+                out.emplace_back(s.substr(start, i - start));
+                start = i + 1;
+            }
+        }
+        return out;
+    };
 
     for (int i = 1; i < argc; ++i) {
         std::string a = argv[i];
@@ -134,18 +248,102 @@ int main(int argc, char ** argv) {
         else if (a == "--runs") runs = std::stoi(next("--runs"));
         else if (a == "--warmup") warmup = std::stoi(next("--warmup"));
         else if (a == "--threads") n_threads = std::stoi(next("--threads"));
+        else if (a == "--n-gpu-layers") n_gpu_layers = std::stoi(next("--n-gpu-layers"));
+        else if (a == "--vulkan-device") vulkan_device = std::stoi(next("--vulkan-device"));
+        else if (a == "--prewarm") prewarm_text = next("--prewarm");
+        else if (a == "--f16-attn") f16_attn = std::stoi(next("--f16-attn"));
+        else if (a == "--f16-weights") f16_weights = std::stoi(next("--f16-weights"));
+        else if (a == "--precision") precision = parse_bench_precision(next("--precision"));
+        else if (a == "--f16-weights-deny") f16_weights_deny_list = split_csv(next("--f16-weights-deny"));
+        else if (a == "--kv-attn-type") {
+            const std::string v = next("--kv-attn-type");
+            if      (v == "auto") kv_attn_type = -1;
+            else if (v == "f32")  kv_attn_type = 0;
+            else if (v == "f16")  kv_attn_type = 1;
+            else if (v == "bf16") kv_attn_type = 2;
+            else if (v == "q8_0") kv_attn_type = 3;
+            else { fprintf(stderr,
+                "--kv-attn-type expects auto|f32|f16|bf16|q8_0 (got: %s)\n", v.c_str());
+                return 2; }
+        }
+        else if (a == "--vulkan-prefer-host-memory") vulkan_env_overrides["GGML_VK_PREFER_HOST_MEMORY"]      = "1";
+        else if (a == "--vulkan-disable-coopmat2")   vulkan_env_overrides["GGML_VK_DISABLE_COOPMAT2"]        = "1";
+        else if (a == "--vulkan-disable-bfloat16")   vulkan_env_overrides["GGML_VK_DISABLE_BFLOAT16"]        = "1";
+        else if (a == "--vulkan-perf-logger")        vulkan_env_overrides["GGML_VK_PERF_LOGGER"]             = "1";
+        else if (a == "--vulkan-async-transfer")     vulkan_env_overrides["GGML_VK_ASYNC_USE_TRANSFER_QUEUE"]= "1";
+        else if (a == "--vulkan-env") {
+            const std::string raw = next("--vulkan-env");
+            const auto eq = raw.find('=');
+            if (eq == std::string::npos || eq == 0) {
+                fprintf(stderr, "--vulkan-env expects KEY=VALUE (got: %s)\n", raw.c_str());
+                return 2;
+            }
+            vulkan_env_overrides[raw.substr(0, eq)] = raw.substr(eq + 1);
+        }
+        else if (a == "--no-bench-sync") bench_sync = false;
+        else if (a == "--bench-sync")    bench_sync = true;  // explicit on; default
+        else if (a == "--bench-per-step") bench_per_step = true;
         else if (a == "--json-out") json_out = next("--json-out");
         else if (a == "-h" || a == "--help") { usage(argv[0]); return 0; }
         else { fprintf(stderr, "unknown arg: %s\n", a.c_str()); usage(argv[0]); return 2; }
     }
     if (model_path.empty() || text.empty()) { usage(argv[0]); return 2; }
 
+    // QVAC-18605 round 7 — apply Vulkan env-var overrides BEFORE
+    // `load_supertonic_gguf` (which calls `init_supertonic_backend`,
+    // which is when ggml-vulkan reads its GGML_VK_* env vars).
+    // Throws on any non-`GGML_VK_` key (operator-config typo
+    // guard); we let the throw propagate to surface as an
+    // uncaught-exception backtrace, since bench is for operators
+    // who can read it (matches the legacy behaviour for `--vulkan-device
+    // abc` and similar).
+    apply_vulkan_env_overrides(vulkan_env_overrides);
+
     supertonic_model model;
-    if (!load_supertonic_gguf(model_path, model)) {
+    if (!load_supertonic_gguf(model_path, model, n_gpu_layers,
+                              /*verbose=*/false, f16_weights, precision,
+                              vulkan_device, f16_weights_deny_list)) {
         fprintf(stderr, "failed to load model\n");
         return 1;
     }
     supertonic_set_n_threads(model, n_threads);
+    // F16 K/V flash-attention dispatch: same auto policy as Engine
+    // (auto ⇒ on for GPU backends that pass the F16-K/V probe, off
+    // for CPU; user can force).  See `supertonic_backend_supports_f16_kv_flash_attn`
+    // in supertonic_gguf.cpp for the rationale (QVAC-18605).
+    if (f16_attn < 0) {
+        model.use_f16_attn = !model.backend_is_cpu &&
+                             supertonic_backend_supports_f16_kv_flash_attn(model.backend);
+    } else {
+        model.use_f16_attn = f16_attn != 0;
+    }
+    // QVAC-18605 round 4 — multi-dtype K/V dispatch resolution.
+    // Same plumbing as Engine::Impl ctor; out-of-range throws
+    // (caller surface).  Probes are advisory + cached.  PR #18
+    // reviewer (Omar) follow-up: surface explicit-request
+    // downgrades via stderr so the bench operator knows their
+    // `--kv-attn-type bf16` ran as f32 on an unsupported adapter
+    // (auto path stays silent).
+    bool kv_dtype_downgraded = false;
+    model.kv_attn_type = resolve_kv_attn_type(
+        kv_attn_type,
+        model.use_f16_attn,
+        supertonic_backend_supports_f16_kv_flash_attn(model.backend),
+        supertonic_backend_supports_bf16_kv_flash_attn(model.backend),
+        supertonic_backend_supports_q8_0_kv_flash_attn(model.backend),
+        &kv_dtype_downgraded);
+    if (kv_dtype_downgraded) {
+        static const char * const kv_label[] = {
+            "f32", "f16", "bf16", "q8_0"
+        };
+        fprintf(stderr,
+            "supertonic-bench: warning: requested --kv-attn-type %s but the "
+            "resolved backend's flash-attn probe rejected it; falling back to "
+            "f32 (set --kv-attn-type auto to silence)\n",
+            (kv_attn_type >= 0 && kv_attn_type <= 3)
+                ? kv_label[kv_attn_type] : "?");
+    }
+    model.use_f16_attn = (model.kv_attn_type == kv_attn_dtype::f16);
 
     auto vit = model.voices.find(voice);
     if (vit == model.voices.end()) {
@@ -176,17 +374,91 @@ int main(int argc, char ** argv) {
     Stage st_pre{"preprocess", {}};
     Stage st_dur{"duration", {}};
     Stage st_te {"text_encoder", {}};
-    Stage st_ve {"vector_estimator (5 step)", {}};
+    char st_ve_label[64];
+    std::snprintf(st_ve_label, sizeof(st_ve_label), "vector_estimator (%d step)", steps);
+    Stage st_ve {st_ve_label, {}};
     Stage st_voc{"vocoder", {}};
     Stage st_tot{"total", {}};
+    // QVAC-18605 round 7 — per-denoise-step breakdown.  Populated
+    // only when `--bench-per-step` is on; otherwise stays empty
+    // and is omitted from human + JSON output.  One Stage per
+    // step index (step 0 typically reflects cold-pipeline cost
+    // on Vulkan/OpenCL; steps 1+ reflect steady-state).
+    std::vector<Stage> st_ve_per_step;
+    if (bench_per_step) {
+        st_ve_per_step.reserve((size_t) steps);
+        for (int s = 0; s < steps; ++s) {
+            char lbl[64];
+            std::snprintf(lbl, sizeof(lbl), "  vector_step[%d]", s);
+            st_ve_per_step.push_back(Stage{lbl, {}});
+        }
+    }
     std::vector<double> rtfs;
     double last_audio_s = 0;
 
+    // QVAC-18605 round 7 — explicit backend sync at stage
+    // boundaries.  Cheap on CPU (returns immediately when no GPU
+    // work pending); on Vulkan / OpenCL ensures the next
+    // `clk::now()` reflects work-completed-by-the-prior-stage.
+    // No-op when `bench_sync` is false (operator opt-out).
+    auto bench_sync_now = [&]() {
+        if (bench_sync) ggml_backend_synchronize(model.backend);
+    };
+
+    // QVAC-18605 follow-up — first-synth pre-warm.
+    //
+    // Independent of the existing `--warmup N` flag.  `--warmup`
+    // discards the first N timed runs from the median; `--prewarm
+    // TEXT` runs ONE additional throwaway synth here, BEFORE the
+    // timed loop even starts, so the first warmup run reflects the
+    // post-shader-compile steady-state cost rather than the cold-
+    // start outlier.  No-op on CPU (no shader-compile cost to amortise)
+    // and on empty `--prewarm` (the operator didn't ask).
+    double prewarm_ms = 0.0;
+    if (!prewarm_text.empty() && !model.backend_is_cpu) {
+        auto pw_t0 = clk::now();
+        std::string pw_error;
+        std::vector<int32_t> pw_ids_i32;
+        std::string pw_norm;
+        if (supertonic_text_to_ids(model, prewarm_text, language, pw_ids_i32, &pw_norm, &pw_error)) {
+            std::vector<int64_t> pw_ids(pw_ids_i32.begin(), pw_ids_i32.end());
+            float pw_dur = 0;
+            std::vector<float> pw_text_emb;
+            if (supertonic_duration_forward_ggml(model, pw_ids.data(), (int) pw_ids.size(),
+                                                 style_dp.data(), pw_dur, &pw_error) &&
+                supertonic_text_encoder_forward_ggml(model, pw_ids.data(), (int) pw_ids.size(),
+                                                     style_ttl.data(), pw_text_emb, &pw_error)) {
+                const int chunk = model.hparams.base_chunk_size * model.hparams.ttl_chunk_compress_factor;
+                int pw_latent_len = std::max(1, (int) (pw_dur / speed * model.hparams.sample_rate + chunk - 1) / chunk);
+                std::vector<float> pw_latent((size_t) model.hparams.latent_channels * pw_latent_len, 0.0f);
+                std::vector<float> pw_mask((size_t) pw_latent_len, 1.0f);
+                std::vector<float> pw_next;
+                bool pw_ok = true;
+                for (int s = 0; s < steps && pw_ok; ++s) {
+                    pw_ok = supertonic_vector_step_ggml(model, pw_latent.data(), pw_latent_len,
+                                                        pw_text_emb.data(), (int) pw_ids.size(),
+                                                        style_ttl.data(), pw_mask.data(),
+                                                        s, steps, pw_next, &pw_error);
+                    pw_latent.swap(pw_next);
+                }
+                std::vector<float> pw_wav;
+                if (pw_ok) {
+                    supertonic_vocoder_forward_ggml(model, pw_latent.data(), pw_latent_len,
+                                                    pw_wav, &pw_error);
+                }
+            }
+        }
+        prewarm_ms = ms_t(clk::now() - pw_t0).count();
+        fprintf(stderr, "[prewarm] cold-start synth on '%s' took %.1fms\n",
+                prewarm_text.c_str(), prewarm_ms);
+    }
+
     int total_runs = runs + warmup;
     for (int r = 0; r < total_runs; ++r) {
         bool record = r >= warmup;
         std::string error;
 
+        bench_sync_now();
         auto t0 = clk::now();
 
         std::vector<int32_t> text_ids_i32;
@@ -196,6 +468,7 @@ int main(int argc, char ** argv) {
             free_supertonic_model(model); return 1;
         }
         std::vector<int64_t> text_ids(text_ids_i32.begin(), text_ids_i32.end());
+        bench_sync_now();
         auto t1 = clk::now();
 
         float duration_raw = 0;
@@ -204,6 +477,7 @@ int main(int argc, char ** argv) {
             fprintf(stderr, "duration failed: %s\n", error.c_str());
             free_supertonic_model(model); return 1;
         }
+        bench_sync_now();
         auto t2 = clk::now();
 
         const int sample_rate = model.hparams.sample_rate;
@@ -229,11 +503,18 @@ int main(int argc, char ** argv) {
             fprintf(stderr, "text encoder failed: %s\n", error.c_str());
             free_supertonic_model(model); return 1;
         }
+        bench_sync_now();
         auto t3 = clk::now();
 
         std::vector<float> latent_mask((size_t) latent_len, 1.0f);
         std::vector<float> next;
+        // QVAC-18605 round 7 — per-step timing.  When
+        // `bench_per_step` is on, a sync + clock sample bracket
+        // each `supertonic_vector_step_ggml` call.  When off, a
+        // single sync at end-of-loop matches the legacy timing
+        // semantics exactly (zero overhead added).
         for (int s = 0; s < steps; ++s) {
+            auto step_t0 = bench_per_step ? clk::now() : clk::time_point{};
             if (!supertonic_vector_step_ggml(model, latent.data(), latent_len,
                                              text_emb.data(), (int) text_ids.size(),
                                              style_ttl.data(), latent_mask.data(),
@@ -242,7 +523,15 @@ int main(int argc, char ** argv) {
                 free_supertonic_model(model); return 1;
             }
             latent.swap(next);
+            if (bench_per_step) {
+                bench_sync_now();
+                auto step_t1 = clk::now();
+                if (record) {
+                    st_ve_per_step[(size_t) s].ms.push_back(ms_t(step_t1 - step_t0).count());
+                }
+            }
         }
+        bench_sync_now();
         auto t4 = clk::now();
 
         std::vector<float> wav;
@@ -250,6 +539,7 @@ int main(int argc, char ** argv) {
             fprintf(stderr, "vocoder failed: %s\n", error.c_str());
             free_supertonic_model(model); return 1;
         }
+        bench_sync_now();
         auto t5 = clk::now();
 
         double audio_s = (double) wav.size() / (double) sample_rate;
@@ -274,7 +564,71 @@ int main(int argc, char ** argv) {
     printf("  text length: %zu chars\n", text.size());
     printf("  voice: %s, language: %s, steps: %d, speed: %.2f\n",
            voice.c_str(), language.c_str(), steps, speed);
-    printf("  threads: %d\n", model.n_threads);
+    printf("  threads: %d, n_gpu_layers: %d, precision: %s\n",
+           model.n_threads, n_gpu_layers, precision_to_string(precision));
+    {
+        // QVAC-18605 — bench backend description.  On Vulkan the
+        // adapter description is appended so multi-GPU machines
+        // unambiguously identify which device ran the bench.
+        std::string desc = ggml_backend_name(model.backend) ? ggml_backend_name(model.backend) : "(unknown)";
+        if (model.backend_is_vk) {
+            ggml_backend_dev_t dev = ggml_backend_get_device(model.backend);
+            const char * vk_desc = dev ? ggml_backend_dev_description(dev) : nullptr;
+            if (vk_desc && *vk_desc) {
+                const int idx = vulkan_device < 0 ? 0 : vulkan_device;
+                desc += " (device " + std::to_string(idx) + ": " + vk_desc + ")";
+            }
+        }
+        // QVAC-18605 follow-up — surface every backend-capability
+        // dispatch flag plus the cold-start prewarm latency so log
+        // grep'ing across multiple machines can attribute perf
+        // differences to the right cause (e.g. "use_f16_weights=off
+        // on this run because the F16 mul_mat probe rejected the
+        // shape" is much faster to triage than "why is this synth
+        // 30 % slower than the other one").
+        // QVAC-18605 round 3 — also surface BF16 K/V availability and
+        // the host-pinned-buffer-type availability.  Both are forward-
+        // compat capabilities (no live dispatch yet); the bench tag
+        // lets operators verify a future `--kv-attn-type bf16` /
+        // `--vulkan-pinned-uploads` opt-in will actually take effect
+        // on their machine before they flip the flag.
+        // QVAC-18605 round 4 — surface the resolved K/V dispatch
+        // dtype.  When the operator opts out of `--kv-attn-type`
+        // the resolved value falls through to `f16` / `f32` per
+        // `--f16-attn`, so the existing `f16_attn=on` tag still
+        // matches the historical baseline; new tag fires when
+        // bf16 / q8_0 actually take effect.
+        const char * kv_dtype_str = "f32";
+        switch (model.kv_attn_type) {
+            case kv_attn_dtype::f32:        kv_dtype_str = "f32";  break;
+            case kv_attn_dtype::f16:        kv_dtype_str = "f16";  break;
+            case kv_attn_dtype::bf16:       kv_dtype_str = "bf16"; break;
+            case kv_attn_dtype::q8_0:       kv_dtype_str = "q8_0"; break;
+            case kv_attn_dtype::autoselect: kv_dtype_str = "auto-leaked!"; break;
+        }
+        printf("  backend: %s%s%s%s (kv_attn_type=%s)%s%s%s\n",
+               desc.c_str(),
+               model.use_f16_attn        ? " (f16_attn=on)"        : "",
+               model.use_f16_weights     ? " (f16_weights=on)"     : "",
+               model.use_native_leaky_relu ? " (native_leaky_relu=on)" : "",
+               kv_dtype_str,
+               supertonic_backend_supports_q8_0_kv_flash_attn(model.backend) ? " (q8_0_kv_attn=available)" : "",
+               supertonic_backend_supports_bf16_kv_flash_attn(model.backend) ? " (bf16_kv_attn=available)" : "",
+               supertonic_backend_supports_pinned_host_buffer(model.backend) ? " (pinned_host_buffer=available)" : "");
+        // QVAC-18605 round 6 — confirm the F16-weights deny-list took
+        // effect.  Silent when the operator didn't supply one (no
+        // visual noise on the default path).
+        if (!f16_weights_deny_list.empty()) {
+            printf("  f16_weights_deny_list: %zu pattern%s; %d tensor%s excluded\n",
+                   f16_weights_deny_list.size(),
+                   f16_weights_deny_list.size() == 1 ? "" : "s",
+                   model.f16_weights_excluded_count,
+                   model.f16_weights_excluded_count == 1 ? "" : "s");
+        }
+        if (prewarm_ms > 0.0) {
+            printf("  prewarm: %.1fms (cold-start, discarded)\n", prewarm_ms);
+        }
+    }
     printf("  audio per run: %.3fs @ %d Hz\n", last_audio_s, model.hparams.sample_rate);
     printf("  runs: %d (warmup discarded: %d)\n", runs, warmup);
     printf("\n");
@@ -282,6 +636,12 @@ int main(int argc, char ** argv) {
     print_stage(st_dur);
     print_stage(st_te);
     print_stage(st_ve);
+    // QVAC-18605 round 7 — per-step breakdown lines.  Indented
+    // under the aggregate vector-estimator line for visual
+    // grouping.  Only emitted when --bench-per-step is on.
+    for (auto & st : st_ve_per_step) {
+        if (!st.ms.empty()) print_stage(st);
+    }
     print_stage(st_voc);
     print_stage(st_tot);
     if (!rtfs.empty()) {
@@ -306,9 +666,74 @@ int main(int argc, char ** argv) {
         os << "  \"steps\": " << steps << ",\n";
         os << "  \"speed\": " << speed << ",\n";
         os << "  \"threads\": " << model.n_threads << ",\n";
+        os << "  \"n_gpu_layers\": " << n_gpu_layers << ",\n";
+        os << "  \"precision\": \"" << precision_to_string(precision) << "\",\n";
         os << "  \"audio_s\": " << last_audio_s << ",\n";
         os << "  \"runs\": " << runs << ",\n";
         os << "  \"warmup\": " << warmup << ",\n";
+        os << "  \"prewarm_ms\": " << prewarm_ms << ",\n";
+        os << "  \"f16_attn\": " << (model.use_f16_attn ? "true" : "false") << ",\n";
+        os << "  \"f16_weights\": " << (model.use_f16_weights ? "true" : "false") << ",\n";
+        // QVAC-18605 round 4 — surface the resolved K/V dispatch
+        // dtype.  Always emitted (string label), so JSON consumers
+        // can attribute drift / perf differences to the right cause
+        // even on the default `auto` path.
+        {
+            const char * kv = "f32";
+            switch (model.kv_attn_type) {
+                case kv_attn_dtype::f32:  kv = "f32";  break;
+                case kv_attn_dtype::f16:  kv = "f16";  break;
+                case kv_attn_dtype::bf16: kv = "bf16"; break;
+                case kv_attn_dtype::q8_0: kv = "q8_0"; break;
+                case kv_attn_dtype::autoselect: kv = "auto-leaked"; break;
+            }
+            os << "  \"kv_attn_type\": \"" << kv << "\",\n";
+            os << "  \"kv_attn_type_requested\": " << kv_attn_type << ",\n";
+        }
+        // QVAC-18605 round 6 — surface the user-supplied deny-list +
+        // the count of tensors it excluded.  Always emitted (even on
+        // the default empty path) so JSON consumers can attribute
+        // any quality regression observed in CI to a config change.
+        os << "  \"f16_weights_deny_list\": [";
+        for (size_t k = 0; k < f16_weights_deny_list.size(); ++k) {
+            if (k) os << ", ";
+            os << "\"" << json_escape(f16_weights_deny_list[k]) << "\"";
+        }
+        os << "],\n";
+        os << "  \"f16_weights_excluded_count\": " << model.f16_weights_excluded_count << ",\n";
+        os << "  \"native_leaky_relu\": " << (model.use_native_leaky_relu ? "true" : "false") << ",\n";
+        os << "  \"q8_0_kv_attn_available\": "
+           << (supertonic_backend_supports_q8_0_kv_flash_attn(model.backend) ? "true" : "false") << ",\n";
+        // QVAC-18605 round 3 — extra capability flags surfaced for the
+        // forward-compat probes (BF16 K/V flash-attn + pinned-host-
+        // buffer-type).  Operators / CI scripts grep on these to
+        // pre-flight whether a future `--kv-attn-type bf16` /
+        // `--vulkan-pinned-uploads` opt-in will be effective on the
+        // resolved backend.
+        os << "  \"bf16_kv_attn_available\": "
+           << (supertonic_backend_supports_bf16_kv_flash_attn(model.backend) ? "true" : "false") << ",\n";
+        os << "  \"pinned_host_buffer_available\": "
+           << (supertonic_backend_supports_pinned_host_buffer(model.backend) ? "true" : "false") << ",\n";
+        // QVAC-18605 round 7 — bench observability surface.
+        // `bench_sync` documents whether the per-stage times
+        // include a `ggml_backend_synchronize` boundary; useful
+        // when comparing JSON across machines / configs.
+        os << "  \"bench_sync\": " << (bench_sync ? "true" : "false") << ",\n";
+        // QVAC-18605 round 7 — Vulkan env-var overrides surfaced
+        // verbatim so the JSON consumer can attribute drift to
+        // a specific override (or its absence).  Always emitted
+        // (object — empty on the default-config path).
+        os << "  \"vulkan_env_overrides\": {";
+        {
+            bool first = true;
+            for (const auto & kv : vulkan_env_overrides) {
+                if (!first) os << ", ";
+                first = false;
+                os << "\"" << json_escape(kv.first) << "\": \""
+                   << json_escape(kv.second) << "\"";
+            }
+        }
+        os << "},\n";
         os << "  \"rtf\": {"
            << "\"min\": " << minv(rtfs)
            << ", \"median\": " << median(rtfs)
@@ -320,7 +745,18 @@ int main(int argc, char ** argv) {
         write_json_stage(os, st_pre, true);
         write_json_stage(os, st_dur, true);
         write_json_stage(os, st_te, true);
-        write_json_stage(os, st_ve, true);
+        // QVAC-18605 round 7 — when --bench-per-step is on, emit
+        // each step as its own stage entry.  When off, the
+        // aggregate `vector_estimator` stage is the only entry
+        // for the vector-estimator buckets (legacy JSON shape).
+        if (!st_ve_per_step.empty()) {
+            write_json_stage(os, st_ve, true);
+            for (auto & st : st_ve_per_step) {
+                if (!st.ms.empty()) write_json_stage(os, st, true);
+            }
+        } else {
+            write_json_stage(os, st_ve, true);
+        }
         write_json_stage(os, st_voc, true);
         write_json_stage(os, st_tot, false);
         os << "  }\n";
diff --git a/tts-cpp/src/supertonic_chunker.cpp b/tts-cpp/src/supertonic_chunker.cpp
new file mode 100644
index 00000000000..9d2bc2385cc
--- /dev/null
+++ b/tts-cpp/src/supertonic_chunker.cpp
@@ -0,0 +1,307 @@
+#include "supertonic_chunker.h"
+
+#include <algorithm>
+#include <cstdint>
+
+namespace tts_cpp::supertonic::detail {
+namespace {
+
+// Minimal UTF-8 decoder — same shape as the anon-namespace helpers in
+// supertonic_preprocess.cpp.  Kept local so the chunker has no cross-file
+// dependency beyond its own header.  Replaces malformed sequences with
+// U+FFFD and a 1-byte advance (matches preprocess behaviour for parity).
+bool utf8_decode(const char * s, size_t len, size_t & pos, uint32_t & cp) {
+    if (pos >= len) return false;
+    uint8_t b0 = (uint8_t) s[pos];
+    if (b0 < 0x80) { cp = b0; pos += 1; return true; }
+    int extra = 0;
+    if      ((b0 & 0xE0) == 0xC0) { cp = b0 & 0x1F; extra = 1; }
+    else if ((b0 & 0xF0) == 0xE0) { cp = b0 & 0x0F; extra = 2; }
+    else if ((b0 & 0xF8) == 0xF0) { cp = b0 & 0x07; extra = 3; }
+    else                          { cp = 0xFFFD; pos += 1; return true; }
+    if (pos + 1 + extra > len)    { cp = 0xFFFD; pos += 1; return true; }
+    for (int i = 0; i < extra; ++i) {
+        uint8_t b = (uint8_t) s[pos + 1 + i];
+        if ((b & 0xC0) != 0x80) { cp = 0xFFFD; pos += 1; return true; }
+        cp = (cp << 6) | (b & 0x3F);
+    }
+    pos += 1 + extra;
+    return true;
+}
+
+struct cp_at {
+    uint32_t cp;        // code point
+    size_t   byte_pos;  // byte offset of this code point in the source string
+};
+
+std::vector<cp_at> decode_with_byte_offsets(const std::string & s) {
+    std::vector<cp_at> out;
+    out.reserve(s.size());
+    size_t pos = 0;
+    while (pos < s.size()) {
+        size_t   start = pos;
+        uint32_t cp    = 0;
+        if (!utf8_decode(s.data(), s.size(), pos, cp)) break;
+        out.push_back({cp, start});
+    }
+    return out;
+}
+
+bool is_space_cp(uint32_t cp) {
+    return cp == 0x09 || cp == 0x0A || cp == 0x0B || cp == 0x0C || cp == 0x0D ||
+           cp == 0x20 || cp == 0x85 || cp == 0xA0 || cp == 0x1680 ||
+           (cp >= 0x2000 && cp <= 0x200A) || cp == 0x2028 || cp == 0x2029 ||
+           cp == 0x202F || cp == 0x205F || cp == 0x3000;
+}
+
+// Clause-end punctuation (lower priority than sentence-end).  Includes
+// CJK and Arabic equivalents.  Closing brackets count — a clause that
+// just ended a parenthetical is a reasonable break point too.
+bool is_clause_end_cp(uint32_t cp) {
+    switch (cp) {
+        case 0x002C: // ,
+        case 0x003B: // ;
+        case 0x003A: // :
+        case 0xFF0C: // ， fullwidth comma
+        case 0x3001: // 、 ideographic comma
+        case 0xFF1B: // ； fullwidth semicolon
+        case 0xFF1A: // ： fullwidth colon
+        case 0x060C: // ،  Arabic comma
+        case 0x061B: // ؛  Arabic semicolon
+        case 0x0029: // )
+        case 0x005D: // ]
+        case 0x007D: // }
+        case 0xFF09: // ）
+            return true;
+        default:
+            return false;
+    }
+}
+
+// Scan for the first index in (lo, hi] where pred(cps[idx-1].cp) is true.
+// Right-first sweep from `target`, then leftward — chunks that end ON
+// the punctuation/space read more naturally than chunks that end one
+// character before it.  Returns SIZE_MAX if no match.
+size_t scan_for(const std::vector<cp_at> & cps,
+                size_t target,
+                size_t lo,
+                size_t hi,
+                bool (*pred)(uint32_t))
+{
+    if (hi <= lo + 1) return SIZE_MAX;
+    const size_t t = std::clamp(target, lo + 1, hi);
+    for (size_t r = t; r <= hi; ++r) {
+        if (pred(cps[r - 1].cp)) return r;
+    }
+    for (size_t r = t; r > lo + 1; --r) {
+        if (pred(cps[r - 2].cp)) return r - 1;
+    }
+    return SIZE_MAX;
+}
+
+// Find the best boundary index for splitting.  Two windows:
+//
+//   `sent_lo..sent_hi`  — wide window for sentence-end punctuation.
+//                          Sentence prosody dominates audio quality on
+//                          this model (the duration predictor and
+//                          attention run per-chunk, so chunk-aligned
+//                          sentence breaks let the model phrase
+//                          naturally), so sentence search reaches
+//                          much further than clause/whitespace.
+//
+//   `norm_lo..norm_hi`  — tight user-controlled window for clause and
+//                          whitespace fallbacks when no sentence is in
+//                          reach.  Hard-cut at `norm_hi` as last
+//                          resort.  Continuation flag in the engine
+//                          makes the resulting mid-clause chunk audio
+//                          tolerable; the bigger seam artifacts (small
+//                          pauses, rate shifts) are inherent to
+//                          per-chunk synthesis on a non-streaming-
+//                          trained model and can't be removed at this
+//                          layer.
+//
+// Returns the index AFTER the break (chunk = cps[start..break)).
+size_t pick_break(const std::vector<cp_at> & cps,
+                  size_t target,
+                  size_t sent_lo, size_t sent_hi,
+                  size_t norm_lo, size_t norm_hi)
+{
+    if (size_t b = scan_for(cps, target, sent_lo, sent_hi, is_sentence_end_cp);
+        b != SIZE_MAX) return b;
+    if (size_t b = scan_for(cps, target, norm_lo, norm_hi, is_clause_end_cp);
+        b != SIZE_MAX) return b;
+    if (size_t b = scan_for(cps, target, norm_lo, norm_hi, is_space_cp);
+        b != SIZE_MAX) return b;
+    return norm_hi;  // hard cut
+}
+
+std::string slice_to_string(const std::vector<cp_at> & cps,
+                            size_t start_idx,
+                            size_t end_idx,
+                            const std::string & source) {
+    if (start_idx >= end_idx) return {};
+    const size_t byte_start = cps[start_idx].byte_pos;
+    const size_t byte_end   = (end_idx < cps.size())
+                                ? cps[end_idx].byte_pos
+                                : source.size();
+    std::string out = source.substr(byte_start, byte_end - byte_start);
+
+    // Trim leading + trailing whitespace at the code-point level.  Done
+    // by scanning the slice — cheaper than re-decoding given the slice
+    // is typically tens of bytes.
+    size_t l = 0;
+    while (l < out.size() && (out[l] == ' ' || out[l] == '\t' ||
+                              out[l] == '\n' || out[l] == '\r')) ++l;
+    size_t r = out.size();
+    while (r > l && (out[r - 1] == ' ' || out[r - 1] == '\t' ||
+                     out[r - 1] == '\n' || out[r - 1] == '\r')) --r;
+    return out.substr(l, r - l);
+}
+
+} // namespace
+
+// Sentence-end punctuation across ASCII, CJK, Devanagari, and the
+// extended Unicode punctuation range.  Conservative — symbols that
+// can be sentence-terminating but ambiguous (e.g. ellipsis "…") are
+// intentionally excluded since they often continue a thought.
+//
+// Public (declared in supertonic_chunker.h) so the engine's per-chunk
+// "does this end on a natural sentence terminator?" helper shares the
+// same table — additions (e.g. Ethiopic ።, Tibetan ། later) land in
+// one place instead of needing to be synced across compilation units.
+bool is_sentence_end_cp(uint32_t cp) {
+    switch (cp) {
+        case 0x002E: // .
+        case 0x003F: // ?
+        case 0x0021: // !
+        case 0x3002: // 。  CJK ideographic full stop
+        case 0xFF1F: // ？ fullwidth question mark
+        case 0xFF01: // ！ fullwidth exclamation mark
+        case 0x203C: // ‼ double exclamation
+        case 0x2047: // ⁇ double question
+        case 0x2048: // ⁈ question exclamation
+        case 0x2049: // ⁉ exclamation question
+        case 0x0964: // ।  Devanagari danda
+        case 0x0965: // ॥  Devanagari double danda
+        case 0x06D4: // ۔  Urdu full stop
+            return true;
+        default:
+            return false;
+    }
+}
+
+std::vector<std::string> split_for_streaming(
+    const std::string & text,
+    int target_tokens,
+    int first_chunk_tokens,
+    int tolerance_pct,
+    int min_chunk_tokens)
+{
+    std::vector<std::string> out;
+    if (target_tokens <= 0 || text.empty()) {
+        // Caller is responsible for falling back to the batch path when
+        // target_tokens <= 0; returning a single-element vector here so
+        // the chunker remains usable as a defensive no-op splitter.
+        if (!text.empty()) out.push_back(text);
+        return out;
+    }
+
+    const std::vector<cp_at> cps = decode_with_byte_offsets(text);
+    if (cps.empty()) return out;
+
+    const int tol_pct        = std::clamp(tolerance_pct, 0, 100);
+    const int min_chunk      = std::max(1, min_chunk_tokens);
+    // Effective targets clamp up to min_chunk so the chunker never aims
+    // for a sub-minimum chunk (the model glitches on stub input below
+    // ~30 tokens — verified empirically on multiple seeds and texts).
+    const int target_eff     = std::max(target_tokens, min_chunk);
+    const int first_eff      = first_chunk_tokens > 0
+                                   ? std::max(first_chunk_tokens, min_chunk)
+                                   : 0;
+
+    const size_t total = cps.size();
+    size_t       start = 0;
+    int          chunk_idx = 0;
+
+    while (start < total) {
+        const int target_this = (chunk_idx == 0 && first_eff > 0)
+                                    ? first_eff
+                                    : target_eff;
+
+        // Tight window — for clause/whitespace boundaries and the
+        // hard-cut fallback.  Driven by the user-supplied tolerance.
+        // Lower bound is bumped to start + min_chunk so a break can't
+        // produce a sub-minimum chunk on this iteration.
+        int norm_lo_rel = std::max(1, target_this - target_this * tol_pct / 100);
+        int norm_hi_rel = target_this + target_this * tol_pct / 100;
+        norm_lo_rel     = std::max(norm_lo_rel, min_chunk);
+        norm_hi_rel     = std::max(norm_hi_rel, norm_lo_rel);
+
+        // Wide window — sentence-end search.  Reaches back to half the
+        // effective target (so a sentence break that yields a too-small
+        // chunk is rejected by the min_chunk floor) and forward to 2×
+        // the target.  2× is empirical: catches a long-but-reasonable
+        // first sentence in multi-sentence text (~75-90 chars at
+        // target=50), but narrow enough that for a genuinely runaway
+        // sentence (>2× target with no internal periods), the chunker
+        // falls through to whitespace and produces multiple sub-
+        // sentence chunks instead of slurping the whole tail as one
+        // huge "sentence-aligned" chunk.
+        int sent_lo_rel = std::max(1, target_this / 2);
+        int sent_hi_rel = target_this * 2;
+        sent_lo_rel     = std::max(sent_lo_rel, min_chunk);
+        sent_hi_rel     = std::max(sent_hi_rel, sent_lo_rel);
+
+        const size_t norm_lo = std::min(start + (size_t) norm_lo_rel, total);
+        const size_t norm_hi = std::min(start + (size_t) norm_hi_rel, total);
+        const size_t sent_lo = std::min(start + (size_t) sent_lo_rel, total);
+        const size_t sent_hi = std::min(start + (size_t) sent_hi_rel, total);
+
+        size_t brk;
+        if (norm_hi <= start + 1 || total - start <= (size_t) norm_hi_rel) {
+            // Entire remainder fits inside this chunk's upper tolerance —
+            // take it all.  Avoids leaving a tiny sub-tolerance tail.
+            brk = total;
+        } else {
+            const size_t target_abs = std::min(start + (size_t) target_this, total);
+            brk = pick_break(cps, target_abs,
+                             sent_lo, sent_hi,
+                             norm_lo, norm_hi);
+        }
+
+        std::string chunk = slice_to_string(cps, start, brk, text);
+        if (!chunk.empty()) out.push_back(std::move(chunk));
+        start = brk;
+        ++chunk_idx;
+    }
+
+    // Tail-merge heuristic: if the last chunk is genuinely tiny, fold
+    // it into the previous chunk to avoid paying full pipeline cost for
+    // a handful of trailing tokens.  Mirrors chatterbox_engine.cpp:608.
+    //
+    // Threshold is intentionally `max(6, target_tokens/3)`, NOT
+    // `min_chunk_tokens` — using min_chunk here would merge any
+    // last-chunk shorter than the floor, which can swallow a complete
+    // final sentence (e.g. Korean "공원에서 산책하기 좋은 날이다."
+    // is 18 code points, below a min_chunk=30 floor, but is itself a
+    // valid sentence-aligned chunk that the model handles fine because
+    // CJK information density per code point is much higher than ASCII).
+    // The min_chunk floor governs what the chunker proactively *aims
+    // for*, not what it does with whatever's left after the last natural
+    // boundary.
+    if (out.size() >= 2) {
+        const std::vector<cp_at> tail_cps = decode_with_byte_offsets(out.back());
+        const int                tail_thresh = std::max(6, target_tokens / 3);
+        if ((int) tail_cps.size() < tail_thresh) {
+            std::string merged = out[out.size() - 2];
+            if (!merged.empty() && !out.back().empty()) merged.push_back(' ');
+            merged += out.back();
+            out.pop_back();
+            out.back() = std::move(merged);
+        }
+    }
+
+    return out;
+}
+
+} // namespace tts_cpp::supertonic::detail
diff --git a/tts-cpp/src/supertonic_chunker.h b/tts-cpp/src/supertonic_chunker.h
new file mode 100644
index 00000000000..99c0142ce53
--- /dev/null
+++ b/tts-cpp/src/supertonic_chunker.h
@@ -0,0 +1,53 @@
+#pragma once
+
+// Multilingual streaming chunker for the Supertonic engine.
+//
+// Splits an input string into a list of substrings sized for per-chunk
+// synthesis, preferring natural boundaries when available:
+//
+//   1. sentence-end punctuation  (. ? ! 。 ？ ！ ‼ ⁇ ⁈ ⁉ । ॥)
+//   2. clause-end punctuation    (, ; : ， 、 ； ： ؛ ، and closing brackets)
+//   3. whitespace                (handles CJK/Thai/Lao/Khmer where 1+2 are absent)
+//   4. hard cut                  (last-resort cap at the upper tolerance bound)
+//
+// Token grain matches `supertonic_text_to_ids` (one ID per Unicode code
+// point after normalization), so the input character count IS the token
+// count that the engine will see.  No model tokenizer call is required
+// for sizing.
+
+#include <cstdint>
+#include <string>
+#include <vector>
+
+namespace tts_cpp::supertonic::detail {
+
+// Split `text` into chunks sized roughly `target_tokens` code points
+// each, snapping to the best available boundary within ±`tolerance_pct`
+// of the target.  When `first_chunk_tokens > 0`, the first chunk uses
+// that smaller target instead (latency knob — first audio lands earlier
+// while subsequent chunks stay large to keep throughput up).
+//
+// `min_chunk_tokens` is a hard floor on every chunk's size: the
+// effective target is `max(target_tokens, min_chunk_tokens)` (and
+// similarly for first-chunk).  The trailing chunk is merged into the
+// previous one if it ends up below the floor.  Default 30 — empirically
+// the model emits dropped/muddled phonemes when fed shorter stubs.
+//
+// Leading/trailing whitespace on each chunk is trimmed.  Adjacent chunks
+// concatenated back together (modulo trimmed whitespace) reproduce the
+// input.  Empty / whitespace-only chunks are not emitted.
+std::vector<std::string> split_for_streaming(
+    const std::string & text,
+    int target_tokens,
+    int first_chunk_tokens = 0,
+    int tolerance_pct      = 20,
+    int min_chunk_tokens   = 30);
+
+// Sentence-end predicate over a Unicode code point.  Public so the
+// engine's per-chunk "does this end on a natural sentence terminator?"
+// helper can share the table with the chunker's boundary search —
+// keeps additions (e.g. Ethiopic ።, Tibetan ། in the future) in one
+// place.  See supertonic_chunker.cpp for the full set.
+bool is_sentence_end_cp(uint32_t cp);
+
+} // namespace tts_cpp::supertonic::detail
diff --git a/tts-cpp/src/supertonic_cli.cpp b/tts-cpp/src/supertonic_cli.cpp
index 40c5f4f05fa..4c8963ee6ec 100644
--- a/tts-cpp/src/supertonic_cli.cpp
+++ b/tts-cpp/src/supertonic_cli.cpp
@@ -4,8 +4,10 @@
 #include <cmath>
 #include <cstdio>
 #include <cstdint>
+#include <cstdlib>
 #include <stdexcept>
 #include <string>
+#include <vector>
 
 namespace {
 
@@ -15,8 +17,76 @@ void usage(const char * argv0) {
         "          [--language en] [--voice NAME] [--steps N] [--speed X]\n"
         "          (voice/steps/speed default to GGUF metadata when omitted)\n"
         "          [--seed 42] [--threads N] [--n-gpu-layers N]\n"
-        "          [--noise-npy /path/to/noise.npy]\n",
-        argv0);
+        "          [--vulkan-device N] (Vulkan adapter index; ignored unless\n"
+        "                            built with -DGGML_VULKAN=ON; default 0,\n"
+        "                            -1 = auto-pick adapter with most free VRAM)\n"
+        "          [--f16-attn 0|1] (vector-estimator F16 K/V attention;\n"
+        "                            defaults to auto: on for GPU, off for CPU)\n"
+        "          [--kv-attn-type auto|f32|f16|bf16|q8_0]\n"
+        "                            (vector-estimator multi-dtype K/V flash-attn;\n"
+        "                            generalises --f16-attn.  default auto: falls\n"
+        "                            back to --f16-attn for backwards-compat.\n"
+        "                            bf16/q8_0 require Vulkan adapter support;\n"
+        "                            silent fallback to f32 on probe miss.)\n"
+        "          [--f16-weights 0|1] (load-time F16 materialization for the\n"
+        "                            audit-identified hot matmul / pwconv weights;\n"
+        "                            defaults to auto: on for GPU, off for CPU)\n"
+        "          [--precision f32|f16|q8_0]   (default: f32)\n"
+        "          [--f16-weights-deny PATTERN1,PATTERN2,...] (substring patterns,\n"
+        "                            comma-separated; matching tensors stay F32 even\n"
+        "                            when --f16-weights is on.  Default empty.)\n"
+        "          [--prewarm TEXT] (run one throwaway synth on TEXT at engine\n"
+        "                            construction so first-real-call latency on\n"
+        "                            Vulkan / OpenCL doesn't pay the shader-\n"
+        "                            compile cost; no-op on CPU)\n"
+        "          [--vulkan-prefer-host-memory]    (sets GGML_VK_PREFER_HOST_MEMORY=1)\n"
+        "          [--vulkan-disable-coopmat2]      (sets GGML_VK_DISABLE_COOPMAT2=1)\n"
+        "          [--vulkan-disable-bfloat16]      (sets GGML_VK_DISABLE_BFLOAT16=1)\n"
+        "          [--vulkan-perf-logger]           (sets GGML_VK_PERF_LOGGER=1)\n"
+        "          [--vulkan-async-transfer]        (sets GGML_VK_ASYNC_USE_TRANSFER_QUEUE=1)\n"
+        "          [--vulkan-env KEY=VALUE]         (set arbitrary GGML_VK_* env var;\n"
+        "                            may be repeated; operator-set env vars in the shell\n"
+        "                            STILL win over these CLI overrides)\n"
+        "          [--noise-npy /path/to/noise.npy]\n"
+        "          [--stream-chunk-tokens N]    (0 = batch; >0 enables\n"
+        "                            streaming with target ~N text-token chunks)\n"
+        "          [--stream-first-chunk-tokens N]  (override 1st-chunk target;\n"
+        "                            0 = same as --stream-chunk-tokens)\n"
+        "          [--stream-chunk-tolerance-pct N] (boundary-snap window; default 20)\n"
+        "          [--stream-min-chunk-tokens N]    (hard floor on chunk size;\n"
+        "                            default 30 — below this the model glitches\n"
+        "                            on stub input; chunks below the floor are\n"
+        "                            merged with their neighbor)\n"
+        "\n"
+        "          When --out is '-', the CLI emits raw s16le PCM to stdout as\n"
+        "          each chunk completes.  Pipe into a player, e.g.:\n"
+        "            %s --model ... --text '...' --out - --stream-chunk-tokens 50 \\\n"
+        "              | aplay -f S16_LE -r 44100 -c 1\n",
+        argv0, argv0);
+}
+
+tts_cpp::supertonic::Precision parse_precision(const std::string & s) {
+    if (s == "f32" || s == "F32") return tts_cpp::supertonic::Precision::F32;
+    if (s == "f16" || s == "F16") return tts_cpp::supertonic::Precision::F16;
+    if (s == "q8_0" || s == "Q8_0" || s == "q8") return tts_cpp::supertonic::Precision::Q8_0;
+    throw std::runtime_error("unknown --precision value: " + s + " (expected f32|f16|q8_0)");
+}
+
+// Emit `pcm` as raw signed-16-bit little-endian samples on stdout.  Used
+// by the streaming path so a consumer like `ffplay -f s16le -ar 44100 ...`
+// can begin playback as soon as the first chunk arrives.  Builds the
+// full chunk's worth of int16 into a contiguous buffer and writes it
+// with a single fwrite — a per-sample fwrite loop would do ~44k-132k
+// syscall-adjacent calls per chunk and noticeably tax streaming
+// throughput on slower terminals / pipes.
+void stream_emit_pcm_stdout(const float * pcm, std::size_t samples) {
+    std::vector<int16_t> buf(samples);
+    for (std::size_t i = 0; i < samples; ++i) {
+        float c = std::max(-1.0f, std::min(1.0f, pcm[i]));
+        buf[i] = (int16_t) std::lrintf(c * 32767.0f);
+    }
+    std::fwrite(buf.data(), sizeof(int16_t), samples, stdout);
+    std::fflush(stdout);
 }
 
 void write_wav(const std::string & path, const std::vector<float> & wav, int sr) {
@@ -49,6 +119,12 @@ int main(int argc, char ** argv) {
     tts_cpp::supertonic::EngineOptions opts;
     std::string text;
     std::string out;
+    // QVAC-18605 round 4 — wrap arg parse in try/catch so invalid
+    // values (`--kv-attn-type bogus`, `--vulkan-device abc`, etc.)
+    // surface as a clean `error: ...` line + exit 2 instead of an
+    // uncaught-exception backtrace.  Same exit-code convention as
+    // unknown-flag / missing-required handling below.
+    try {
     for (int i = 1; i < argc; ++i) {
         std::string arg = argv[i];
         auto next = [&](const char * flag) -> const char * {
@@ -65,19 +141,151 @@ int main(int argc, char ** argv) {
         else if (arg == "--seed") opts.seed = std::stoi(next("--seed"));
         else if (arg == "--threads") opts.n_threads = std::stoi(next("--threads"));
         else if (arg == "--n-gpu-layers") opts.n_gpu_layers = std::stoi(next("--n-gpu-layers"));
+        else if (arg == "--vulkan-device") opts.vulkan_device = std::stoi(next("--vulkan-device"));
+        else if (arg == "--f16-attn") opts.f16_attn = std::stoi(next("--f16-attn"));
+        else if (arg == "--kv-attn-type") {
+            const std::string v = next("--kv-attn-type");
+            if      (v == "auto") opts.kv_attn_type = -1;
+            else if (v == "f32")  opts.kv_attn_type = 0;
+            else if (v == "f16")  opts.kv_attn_type = 1;
+            else if (v == "bf16") opts.kv_attn_type = 2;
+            else if (v == "q8_0") opts.kv_attn_type = 3;
+            else throw std::runtime_error(
+                "--kv-attn-type expects one of: auto, f32, f16, bf16, q8_0 (got: " + v + ")");
+        }
+        else if (arg == "--f16-weights") opts.f16_weights = std::stoi(next("--f16-weights"));
+        else if (arg == "--precision") opts.precision = parse_precision(next("--precision"));
+        else if (arg == "--f16-weights-deny") {
+            // Comma-split into a vector<string>.  Empty entries
+            // are tolerated (predicate skips them defensively).
+            opts.f16_weights_deny_list.clear();
+            const std::string raw = next("--f16-weights-deny");
+            size_t start = 0;
+            for (size_t k = 0; k <= raw.size(); ++k) {
+                if (k == raw.size() || raw[k] == ',') {
+                    opts.f16_weights_deny_list.emplace_back(raw.substr(start, k - start));
+                    start = k + 1;
+                }
+            }
+        }
+        else if (arg == "--prewarm") opts.prewarm_text = next("--prewarm");
+        else if (arg == "--vulkan-prefer-host-memory") opts.vulkan_env_overrides["GGML_VK_PREFER_HOST_MEMORY"]      = "1";
+        else if (arg == "--vulkan-disable-coopmat2")   opts.vulkan_env_overrides["GGML_VK_DISABLE_COOPMAT2"]        = "1";
+        else if (arg == "--vulkan-disable-bfloat16")   opts.vulkan_env_overrides["GGML_VK_DISABLE_BFLOAT16"]        = "1";
+        else if (arg == "--vulkan-perf-logger")        opts.vulkan_env_overrides["GGML_VK_PERF_LOGGER"]             = "1";
+        else if (arg == "--vulkan-async-transfer")     opts.vulkan_env_overrides["GGML_VK_ASYNC_USE_TRANSFER_QUEUE"]= "1";
+        else if (arg == "--vulkan-env") {
+            const std::string raw = next("--vulkan-env");
+            const auto eq = raw.find('=');
+            if (eq == std::string::npos || eq == 0) {
+                throw std::runtime_error("--vulkan-env expects KEY=VALUE (got: " + raw + ")");
+            }
+            opts.vulkan_env_overrides[raw.substr(0, eq)] = raw.substr(eq + 1);
+        }
         else if (arg == "--noise-npy") opts.noise_npy_path = next("--noise-npy");
+        else if (arg == "--stream-chunk-tokens") {
+            opts.stream_chunk_tokens = std::stoi(next("--stream-chunk-tokens"));
+        }
+        else if (arg == "--stream-first-chunk-tokens") {
+            opts.stream_first_chunk_tokens = std::stoi(next("--stream-first-chunk-tokens"));
+        }
+        else if (arg == "--stream-chunk-tolerance-pct") {
+            opts.stream_chunk_tolerance_pct = std::stoi(next("--stream-chunk-tolerance-pct"));
+        }
+        else if (arg == "--stream-min-chunk-tokens") {
+            opts.stream_min_chunk_tokens = std::stoi(next("--stream-min-chunk-tokens"));
+        }
         else if (arg == "-h" || arg == "--help") { usage(argv[0]); return 0; }
         else { fprintf(stderr, "unknown arg: %s\n", arg.c_str()); usage(argv[0]); return 2; }
     }
+    } catch (const std::exception & e) {
+        fprintf(stderr, "error: %s\n", e.what());
+        usage(argv[0]);
+        return 2;
+    }
     if (opts.model_gguf_path.empty() || text.empty() || out.empty()) {
         usage(argv[0]);
         return 2;
     }
     try {
-        auto result = tts_cpp::supertonic::synthesize(opts, text);
-        write_wav(out, result.pcm, result.sample_rate);
-        fprintf(stderr, "wrote %s (%.2fs @ %d Hz, %zu samples)\n",
-                out.c_str(), result.duration_s, result.sample_rate, result.pcm.size());
+        const bool streaming = opts.stream_chunk_tokens > 0;
+        const bool stdout_pcm = (out == "-");
+
+        if (!streaming) {
+            if (stdout_pcm) {
+                fprintf(stderr,
+                    "error: --out - requires --stream-chunk-tokens > 0 "
+                    "(stdout streaming is the streaming-mode output)\n");
+                return 2;
+            }
+            auto result = tts_cpp::supertonic::synthesize(opts, text);
+            write_wav(out, result.pcm, result.sample_rate);
+            fprintf(stderr, "wrote %s (%.2fs @ %d Hz, %zu samples)\n",
+                    out.c_str(), result.duration_s, result.sample_rate, result.pcm.size());
+            return 0;
+        }
+
+        // Streaming path.  Construct a persistent Engine so per-chunk
+        // synth doesn't pay GGUF load each iteration.
+        tts_cpp::supertonic::Engine engine(opts);
+        if (stdout_pcm) {
+            fprintf(stderr,
+                "streaming: emitting raw s16le PCM on stdout "
+                "(chunk target: %d text tokens; first chunk: %d; backend: %s)\n",
+                opts.stream_chunk_tokens,
+                opts.stream_first_chunk_tokens > 0
+                    ? opts.stream_first_chunk_tokens
+                    : opts.stream_chunk_tokens,
+                engine.backend_name().c_str());
+        }
+
+        // Optional per-chunk WAV dump for debugging.  When the env var
+        // SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX is set, the callback writes
+        // each chunk's PCM to "<prefix><idx>.wav" so you can play chunks
+        // individually and see which one contains a glitch.
+        const char * dump_prefix = std::getenv("SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX");
+
+        std::size_t total_samples = 0;
+        int          n_chunks      = 0;
+        auto on_chunk = [&](const float * pcm, std::size_t samples,
+                            int chunk_index, bool is_last) {
+            if (stdout_pcm) {
+                stream_emit_pcm_stdout(pcm, samples);
+            }
+            if (dump_prefix) {
+                std::string path = std::string(dump_prefix)
+                    + std::to_string(chunk_index) + ".wav";
+                std::vector<float> tmp(pcm, pcm + samples);
+                // 44.1 kHz is the Supertonic model default; the real SR
+                // comes back on the final SynthesisResult but isn't
+                // visible here.  Hard-coding here is fine for a debug
+                // dump — if a future model ships at a different SR this
+                // will be wrong, but the callback signature doesn't
+                // surface it.
+                write_wav(path, tmp, 44100);
+            }
+            total_samples += samples;
+            ++n_chunks;
+            fprintf(stderr,
+                    "chunk %d%s: %zu samples%s%s\n",
+                    chunk_index, is_last ? " (last)" : "",
+                    samples,
+                    stdout_pcm ? " -> stdout" : "",
+                    dump_prefix ? " (+ dumped)" : "");
+        };
+
+        auto result = engine.synthesize(text, on_chunk);
+
+        if (!stdout_pcm) {
+            // File mode: write the concatenated PCM as a WAV.
+            write_wav(out, result.pcm, result.sample_rate);
+            fprintf(stderr, "wrote %s (%.2fs @ %d Hz, %zu samples across %d chunks)\n",
+                    out.c_str(), result.duration_s, result.sample_rate,
+                    result.pcm.size(), n_chunks);
+        } else {
+            fprintf(stderr, "streamed %zu samples across %d chunks (%.2fs)\n",
+                    total_samples, n_chunks, result.duration_s);
+        }
         return 0;
     } catch (const std::exception & e) {
         fprintf(stderr, "error: %s\n", e.what());
diff --git a/tts-cpp/src/supertonic_duration.cpp b/tts-cpp/src/supertonic_duration.cpp
index 6e087af6e00..68825f68687 100644
--- a/tts-cpp/src/supertonic_duration.cpp
+++ b/tts-cpp/src/supertonic_duration.cpp
@@ -24,6 +24,33 @@ f32_tensor read_f32(const supertonic_model & m, const std::string & source_name)
     return out;
 }
 
+// F17 — lazy host-side cache for weights consumed by the duration
+// stage's scalar continuation.  First call downloads via
+// `read_f32`, second+ calls reuse the cached vector via copy into
+// a fresh `f32_tensor`.  The vector copy on return is one host
+// memcpy (~25 µs per 256 KiB matmul weight on a modern CPU) vs.
+// the GPU→host sync it replaces (~50–100 µs on a discrete OpenCL
+// GPU).  Net 2–4× win for the matmul weights; ~50× win for the
+// small (~1 KiB) LN / bias tensors that dominate the call count.
+//
+// Returns by value to preserve the `f32_tensor` ABI the rest of
+// this TU expects.  The cache itself lives on `supertonic_model`
+// (`scalar_weight_cache`); see the doc-block in
+// supertonic_internal.h for the lifetime + thread-safety
+// contract.
+f32_tensor cached_read_f32(const supertonic_model & m, const std::string & source_name) {
+    ggml_tensor * t = require_source_tensor(m, source_name);
+    f32_tensor out;
+    for (int i = 0; i < 4; ++i) out.ne[i] = t->ne[i];
+    auto & entry = m.scalar_weight_cache[source_name]; // emplace empty on miss
+    if (entry.empty()) {
+        entry.resize((size_t) ggml_nelements(t));
+        ggml_backend_tensor_get(t, entry.data(), 0, ggml_nbytes(t));
+    }
+    out.data = entry; // one host memcpy; saved sync dwarfs it
+    return out;
+}
+
 inline float relu(float x) { return x > 0.0f ? x : 0.0f; }
 inline float gelu(float x) { return 0.5f * x * (1.0f + std::erff(x * 0.7071067811865475f)); }
 
@@ -51,7 +78,14 @@ ggml_tensor * repeat_like(ggml_context * ctx, ggml_tensor * v, ggml_tensor * lik
     if (!ggml_can_repeat(v, like)) {
         throw std::runtime_error("cannot repeat tensor in duration graph");
     }
-    return ggml_repeat(ctx, v, like);
+    // Every caller feeds this into ggml_add/ggml_mul which broadcast natively;
+    // skip the explicit ggml_repeat dispatch.
+    static const bool force_explicit_repeat =
+        std::getenv("SUPERTONIC_FORCE_EXPLICIT_REPEAT") != nullptr;
+    if (force_explicit_repeat) {
+        return ggml_repeat(ctx, v, like);
+    }
+    return v;
 }
 
 ggml_tensor * conv1d_f32(ggml_context * ctx,
@@ -60,6 +94,7 @@ ggml_tensor * conv1d_f32(ggml_context * ctx,
                          int stride,
                          int padding,
                          int dilation) {
+    // duration uses the pure-graph path unconditionally; no CPU fast path.
     ggml_tensor * im2col = ggml_im2col(ctx, kernel, input, stride, 0, padding, 0, dilation, 0, false, GGML_TYPE_F32);
     ggml_tensor * result = ggml_mul_mat(ctx,
         ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[2] * im2col->ne[1]),
@@ -68,6 +103,15 @@ ggml_tensor * conv1d_f32(ggml_context * ctx,
 }
 
 ggml_tensor * edge_clamp_pad_1d(ggml_context * ctx, ggml_tensor * x, int pad_left, int pad_right) {
+    if (pad_left == 0 && pad_right == 0) return x;
+    static const bool disable_fused_edge_pad =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_EDGE_PAD") != nullptr;
+    if (!disable_fused_edge_pad &&
+        x->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        ggml_is_contiguous(x)) {
+        return ggml_supertonic_edge_pad_1d(ctx, x, pad_left, pad_right);
+    }
     const int64_t L = x->ne[0];
     const int64_t C = x->ne[1];
     ggml_tensor * out = x;
@@ -90,6 +134,16 @@ ggml_tensor * depthwise_same_ggml(ggml_context * ctx,
                                   ggml_tensor * b,
                                   int dilation) {
     const int K = (int) w->ne[0];
+    static const bool disable_fused =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_DEPTHWISE") != nullptr;
+    if (!disable_fused && (K == 3 || K == 5) &&
+        x->type == GGML_TYPE_F32 && w->type == GGML_TYPE_F32 &&
+        b->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 && w->ne[1] == 1 && w->ne[3] == 1 &&
+        w->ne[2] == x->ne[1] && b->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(x) && ggml_is_contiguous(w) && ggml_is_contiguous(b)) {
+        return ggml_supertonic_depthwise_1d(ctx, x, w, b, dilation);
+    }
     const int pad_left = ((K - 1) * dilation) / 2;
     const int pad_right = (K - 1) * dilation - pad_left;
     ggml_tensor * padded = edge_clamp_pad_1d(ctx, x, pad_left, pad_right);
@@ -101,6 +155,15 @@ ggml_tensor * depthwise_same_ggml(ggml_context * ctx,
 }
 
 ggml_tensor * layer_norm_ggml(ggml_context * ctx, ggml_tensor * x, ggml_tensor * g, ggml_tensor * b) {
+    static const bool disable_fused_layer_norm =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_LAYER_NORM") != nullptr;
+    if (!disable_fused_layer_norm &&
+        x->type == GGML_TYPE_F32 && g->type == GGML_TYPE_F32 && b->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        g->ne[0] == x->ne[1] && b->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(x) && ggml_is_contiguous(g) && ggml_is_contiguous(b)) {
+        return ggml_supertonic_layer_norm_channel(ctx, x, g, b, 1e-6f);
+    }
     ggml_tensor * xt = ggml_cont(ctx, ggml_permute(ctx, x, 1, 0, 2, 3));
     xt = ggml_norm(ctx, xt, 1e-6f);
     xt = ggml_mul(ctx, xt, repeat_like(ctx, g, xt));
@@ -234,16 +297,18 @@ void self_attention(const supertonic_model & m, int idx, std::vector<float> & x,
     const float scale = 1.0f / std::sqrt((float) D);
     const std::string p = "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers." + std::to_string(idx);
 
-    f32_tensor q_w = read_f32(m, p + ".conv_q.weight");
-    f32_tensor q_b = read_f32(m, p + ".conv_q.bias");
-    f32_tensor k_w = read_f32(m, p + ".conv_k.weight");
-    f32_tensor k_b = read_f32(m, p + ".conv_k.bias");
-    f32_tensor v_w = read_f32(m, p + ".conv_v.weight");
-    f32_tensor v_b = read_f32(m, p + ".conv_v.bias");
-    f32_tensor o_w = read_f32(m, p + ".conv_o.weight");
-    f32_tensor o_b = read_f32(m, p + ".conv_o.bias");
-    f32_tensor rel_k = read_f32(m, p + ".emb_rel_k"); // [1, 9, D]
-    f32_tensor rel_v = read_f32(m, p + ".emb_rel_v");
+    // F17 — every read goes through the host-side scalar weight
+    // cache; only the first synth pays the backend download.
+    f32_tensor q_w = cached_read_f32(m, p + ".conv_q.weight");
+    f32_tensor q_b = cached_read_f32(m, p + ".conv_q.bias");
+    f32_tensor k_w = cached_read_f32(m, p + ".conv_k.weight");
+    f32_tensor k_b = cached_read_f32(m, p + ".conv_k.bias");
+    f32_tensor v_w = cached_read_f32(m, p + ".conv_v.weight");
+    f32_tensor v_b = cached_read_f32(m, p + ".conv_v.bias");
+    f32_tensor o_w = cached_read_f32(m, p + ".conv_o.weight");
+    f32_tensor o_b = cached_read_f32(m, p + ".conv_o.bias");
+    f32_tensor rel_k = cached_read_f32(m, p + ".emb_rel_k"); // [1, 9, D]
+    f32_tensor rel_v = cached_read_f32(m, p + ".emb_rel_v");
 
     std::vector<float> q, k, v;
     linear1x1(x, L, C, q_w, &q_b, C, q);
@@ -304,10 +369,11 @@ void self_attention(const supertonic_model & m, int idx, std::vector<float> & x,
 
 void ffn_block(const supertonic_model & m, int idx, std::vector<float> & x, int L, int C) {
     const std::string p = "duration:tts.dp.sentence_encoder.attn_encoder.ffn_layers." + std::to_string(idx);
-    f32_tensor w1 = read_f32(m, p + ".conv_1.weight");
-    f32_tensor b1 = read_f32(m, p + ".conv_1.bias");
-    f32_tensor w2 = read_f32(m, p + ".conv_2.weight");
-    f32_tensor b2 = read_f32(m, p + ".conv_2.bias");
+    // F17 — host-cached scalar weights.
+    f32_tensor w1 = cached_read_f32(m, p + ".conv_1.weight");
+    f32_tensor b1 = cached_read_f32(m, p + ".conv_1.bias");
+    f32_tensor w2 = cached_read_f32(m, p + ".conv_2.weight");
+    f32_tensor b2 = cached_read_f32(m, p + ".conv_2.bias");
     std::vector<float> y;
     linear1x1(x, L, C, w1, &b1, (int) w1.ne[2], y);
     for (float & v : y) v = relu(v);
@@ -324,6 +390,33 @@ void dense(const std::vector<float> & x, const f32_tensor & w, const f32_tensor
     }
 }
 
+// Audit finding F11 — persistent graph cache for the duration
+// sentence-encoder GGML graph.
+//
+// Before this finding `duration_sentence_proj_ggml_impl` allocated
+// a fresh `ggml_context` + `ggml_gallocr_t` on every call, then
+// freed both at the end.  The shape of the graph depends only on
+// `L = text_len + 1`; consecutive synth calls with the same text
+// length pay no graph-build cost after the first.  The lifetime
+// helpers below match the (alive-id, generation_id) safe-free
+// pattern used by the vocoder + vector estimator caches.
+struct duration_graph_cache {
+    const supertonic_model * model = nullptr;
+    uint64_t generation_id = 0;
+    int L = 0;
+    std::vector<uint8_t> buf;
+    ggml_context * ctx = nullptr;
+    ggml_cgraph * gf = nullptr;
+    ggml_gallocr_t allocr = nullptr;
+    ggml_tensor * in = nullptr;
+};
+
+inline void free_duration_graph_cache(duration_graph_cache & cache) {
+    supertonic_safe_gallocr_free(cache.allocr, cache.generation_id);
+    if (cache.ctx) ggml_free(cache.ctx);
+    cache = {};
+}
+
 } // namespace
 
 bool supertonic_duration_forward_cpu(const supertonic_model & model,
@@ -513,47 +606,66 @@ static bool duration_sentence_proj_ggml_impl(const supertonic_model & model,
         push_trace(*scalar_trace, "duration_pred0_no_style", 1, 128, h);
         }
 
-        constexpr int MAX_NODES = 512;
-        static size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
-                                 ggml_graph_overhead_custom(MAX_NODES, false);
-        thread_local std::vector<uint8_t> buf(buf_size);
-        ggml_init_params gp = { buf_size, buf.data(), true };
-        ggml_context * ctx = ggml_init(gp);
-        ggml_cgraph * gf = ggml_new_graph_custom(ctx, MAX_NODES, false);
-
-        ggml_tensor * in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(in, "duration_embed"); ggml_set_input(in);
-        ggml_tensor * y = in;
-        for (int i = 0; i < 6; ++i) {
-            const std::string p = "duration:tts.dp.sentence_encoder.convnext.convnext." + std::to_string(i);
-            y = duration_convnext_ggml(ctx, model, p, y);
-            const std::string name = "duration_convnext" + std::to_string(i);
-            ggml_set_name(y, name.c_str()); ggml_set_output(y);
-            ggml_build_forward_expand(gf, y);
-        }
-        ggml_tensor * q = conv1d_f32(ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_q.weight"), y, 1, 0, 1);
-        q = ggml_add(ctx, q, repeat_like(ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_q.bias"), q));
-        ggml_set_name(q, "duration_attn0_q"); ggml_set_output(q); ggml_build_forward_expand(gf, q);
-        ggml_tensor * k = conv1d_f32(ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_k.weight"), y, 1, 0, 1);
-        k = ggml_add(ctx, k, repeat_like(ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_k.bias"), k));
-        ggml_set_name(k, "duration_attn0_k"); ggml_set_output(k); ggml_build_forward_expand(gf, k);
-        ggml_tensor * v = conv1d_f32(ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_v.weight"), y, 1, 0, 1);
-        v = ggml_add(ctx, v, repeat_like(ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_v.bias"), v));
-        ggml_set_name(v, "duration_attn0_v"); ggml_set_output(v); ggml_build_forward_expand(gf, v);
-
-        ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
-        if (!allocr) {
-            ggml_free(ctx);
-            throw std::runtime_error("ggml_gallocr_new duration failed");
-        }
-        if (!ggml_gallocr_reserve(allocr, gf)) {
-            ggml_gallocr_free(allocr);
-            ggml_free(ctx);
-            throw std::runtime_error("ggml_gallocr_reserve duration failed");
+        // F11 — cached duration graph.  Key is (model, generation_id, L);
+        // consecutive synth calls with the same text_len skip the
+        // graph rebuild (~200 nodes) + gallocr_new + reserve cycle.
+        // Lifetime: `free_duration_graph_cache` consults the alive-id
+        // registry to skip `gallocr_free` against a backend that's
+        // already been torn down, same pattern as the other stages.
+        thread_local duration_graph_cache cache;
+        if (cache.model != &model || cache.generation_id != model.generation_id ||
+            cache.L != L) {
+            free_duration_graph_cache(cache);
+            cache.model = &model;
+            cache.generation_id = model.generation_id;
+            cache.L = L;
+
+            constexpr int MAX_NODES = 512;
+            const size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
+                                    ggml_graph_overhead_custom(MAX_NODES, false);
+            cache.buf.assign(buf_size, 0);
+            ggml_init_params gp = { buf_size, cache.buf.data(), true };
+            cache.ctx = ggml_init(gp);
+            cache.gf = ggml_new_graph_custom(cache.ctx, MAX_NODES, false);
+
+            cache.in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, C);
+            ggml_set_name(cache.in, "duration_embed"); ggml_set_input(cache.in);
+            ggml_tensor * y = cache.in;
+            for (int i = 0; i < 6; ++i) {
+                const std::string p = "duration:tts.dp.sentence_encoder.convnext.convnext." + std::to_string(i);
+                y = duration_convnext_ggml(cache.ctx, model, p, y);
+                const std::string name = "duration_convnext" + std::to_string(i);
+                ggml_set_name(y, name.c_str()); ggml_set_output(y);
+                ggml_build_forward_expand(cache.gf, y);
+            }
+            ggml_tensor * q = conv1d_f32(cache.ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_q.weight"), y, 1, 0, 1);
+            q = ggml_add(cache.ctx, q, repeat_like(cache.ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_q.bias"), q));
+            ggml_set_name(q, "duration_attn0_q"); ggml_set_output(q); ggml_build_forward_expand(cache.gf, q);
+            ggml_tensor * k = conv1d_f32(cache.ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_k.weight"), y, 1, 0, 1);
+            k = ggml_add(cache.ctx, k, repeat_like(cache.ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_k.bias"), k));
+            ggml_set_name(k, "duration_attn0_k"); ggml_set_output(k); ggml_build_forward_expand(cache.gf, k);
+            ggml_tensor * v = conv1d_f32(cache.ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_v.weight"), y, 1, 0, 1);
+            v = ggml_add(cache.ctx, v, repeat_like(cache.ctx, require_source_tensor(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_v.bias"), v));
+            ggml_set_name(v, "duration_attn0_v"); ggml_set_output(v); ggml_build_forward_expand(cache.gf, v);
+
+            cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
+            if (!cache.allocr) {
+                ggml_free(cache.ctx);
+                cache = {};
+                throw std::runtime_error("ggml_gallocr_new duration failed");
+            }
+            if (!ggml_gallocr_reserve(cache.allocr, cache.gf)) {
+                ggml_gallocr_free(cache.allocr);
+                ggml_free(cache.ctx);
+                cache = {};
+                throw std::runtime_error("ggml_gallocr_reserve duration failed");
+            }
+            ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
         }
-        ggml_gallocr_alloc_graph(allocr, gf);
+        ggml_cgraph * gf = cache.gf;
+
         std::vector<float> x_raw = pack_time_channel_for_ggml(x, L, C);
-        ggml_backend_tensor_set(in, x_raw.data(), 0, x_raw.size()*sizeof(float));
+        ggml_backend_tensor_set(cache.in, x_raw.data(), 0, x_raw.size()*sizeof(float));
         supertonic_graph_compute(model, gf);
 
         PUSH_DURATION_GGML({"duration_embed", {L, C}, x});
@@ -574,8 +686,9 @@ static bool duration_sentence_proj_ggml_impl(const supertonic_model & model,
         std::vector<float> v0_g = tensor_to_time_channel(ggml_graph_get_tensor(gf, "duration_attn0_v"));
         const int H = 2, D = C / H, half_window = 4;
         const float scale = 1.0f / std::sqrt((float)D);
-        f32_tensor rel_k = read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.emb_rel_k");
-        f32_tensor rel_v = read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.emb_rel_v");
+        // F17 — host-cached scalar weights (relpos K/V embeddings).
+        f32_tensor rel_k = cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.emb_rel_k");
+        f32_tensor rel_v = cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.emb_rel_v");
         std::vector<float> out((size_t)L*C, 0.0f), scores(L), probs(L);
         for (int h = 0; h < H; ++h) {
             for (int qi = 0; qi < L; ++qi) {
@@ -611,8 +724,9 @@ static bool duration_sentence_proj_ggml_impl(const supertonic_model & model,
                 }
             }
         }
-        f32_tensor o_w = read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_o.weight");
-        f32_tensor o_b = read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_o.bias");
+        // F17 — host-cached.
+        f32_tensor o_w = cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_o.weight");
+        f32_tensor o_b = cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_o.bias");
         std::vector<float> proj;
         linear1x1(out, L, C, o_w, &o_b, C, proj);
         PUSH_DURATION_GGML({"duration_attn0_out", {L, C}, proj});
@@ -620,20 +734,22 @@ static bool duration_sentence_proj_ggml_impl(const supertonic_model & model,
         std::vector<float> attn_res = proj;
         for (size_t i = 0; i < attn_res.size(); ++i) attn_res[i] += conv_out[i];
         PUSH_DURATION_GGML({"duration_attn0_residual", {L, C}, attn_res});
+        // F17 — host-cached LN weights.
         layer_norm_channel(
             attn_res, L, C,
-            read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.0.norm.weight"),
-            read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.0.norm.bias"));
+            cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.0.norm.weight"),
+            cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.0.norm.bias"));
         PUSH_DURATION_GGML({"duration_attn0_norm", {L, C}, attn_res});
         std::vector<float> ffn0_g = attn_res;
         ffn_block(model, 0, ffn0_g, L, C);
         PUSH_DURATION_GGML({"duration_ffn0_out", {L, C}, ffn0_g});
         for (size_t i = 0; i < ffn0_g.size(); ++i) ffn0_g[i] += attn_res[i];
         PUSH_DURATION_GGML({"duration_ffn0_residual", {L, C}, ffn0_g});
+        // F17 — host-cached.
         layer_norm_channel(
             ffn0_g, L, C,
-            read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_2.0.norm.weight"),
-            read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_2.0.norm.bias"));
+            cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_2.0.norm.weight"),
+            cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_2.0.norm.bias"));
         PUSH_DURATION_GGML({"duration_ffn0_norm", {L, C}, ffn0_g});
 
         std::vector<float> attn1_g = ffn0_g;
@@ -641,28 +757,31 @@ static bool duration_sentence_proj_ggml_impl(const supertonic_model & model,
         PUSH_DURATION_GGML({"duration_attn1_out", {L, C}, attn1_g});
         for (size_t i = 0; i < attn1_g.size(); ++i) attn1_g[i] += ffn0_g[i];
         PUSH_DURATION_GGML({"duration_attn1_residual", {L, C}, attn1_g});
+        // F17 — host-cached.
         layer_norm_channel(
             attn1_g, L, C,
-            read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.1.norm.weight"),
-            read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.1.norm.bias"));
+            cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.1.norm.weight"),
+            cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.1.norm.bias"));
         PUSH_DURATION_GGML({"duration_attn1_norm", {L, C}, attn1_g});
         std::vector<float> ffn1_g = attn1_g;
         ffn_block(model, 1, ffn1_g, L, C);
         PUSH_DURATION_GGML({"duration_ffn1_out", {L, C}, ffn1_g});
         for (size_t i = 0; i < ffn1_g.size(); ++i) ffn1_g[i] += attn1_g[i];
         PUSH_DURATION_GGML({"duration_ffn1_residual", {L, C}, ffn1_g});
+        // F17 — host-cached.
         layer_norm_channel(
             ffn1_g, L, C,
-            read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_2.1.norm.weight"),
-            read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_2.1.norm.bias"));
+            cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_2.1.norm.weight"),
+            cached_read_f32(model, "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_2.1.norm.bias"));
         PUSH_DURATION_GGML({"duration_ffn1_norm", {L, C}, ffn1_g});
         for (size_t i = 0; i < ffn1_g.size(); ++i) ffn1_g[i] += conv_out[i];
         PUSH_DURATION_GGML({"duration_encoder_out", {L, C}, ffn1_g});
         std::vector<float> sentence_repr_g(C);
         for (int c = 0; c < C; ++c) sentence_repr_g[c] = ffn1_g[c];
         std::vector<float> projected_g;
+        // F17 — host-cached.
         linear1x1(sentence_repr_g, 1, C,
-                  read_f32(model, "duration:tts.dp.sentence_encoder.proj_out.net.weight"),
+                  cached_read_f32(model, "duration:tts.dp.sentence_encoder.proj_out.net.weight"),
                   nullptr, C, projected_g);
         if (sentence_proj_out) *sentence_proj_out = projected_g;
         PUSH_DURATION_GGML({"duration_sentence_proj", {1, C}, projected_g});
@@ -670,13 +789,13 @@ static bool duration_sentence_proj_ggml_impl(const supertonic_model & model,
         for (int c = 0; c < C; ++c) combined_g[c] = projected_g[c];
         for (int i = 0; i < 128; ++i) combined_g[C + i] = 0.0f;
         std::vector<float> h_g;
+        // F17 — host-cached.
         dense(combined_g,
-              read_f32(model, "duration:tts.dp.predictor.layers.0.weight"),
-              read_f32(model, "duration:tts.dp.predictor.layers.0.bias"),
+              cached_read_f32(model, "duration:tts.dp.predictor.layers.0.weight"),
+              cached_read_f32(model, "duration:tts.dp.predictor.layers.0.bias"),
               192, 128, h_g);
         PUSH_DURATION_GGML({"duration_pred0_no_style", {1, 128}, h_g});
-        ggml_gallocr_free(allocr);
-        ggml_free(ctx);
+        // F11: ctx + allocr live in `cache` and survive across synths.
         if (error) error->clear();
 #undef PUSH_DURATION_GGML
         return true;
@@ -695,6 +814,7 @@ bool supertonic_duration_trace_ggml(const supertonic_model & model,
                                     bool include_scalar_trace,
                                     bool include_ggml_trace,
                                     std::vector<float> * sentence_proj_out) {
+    supertonic_op_dispatch_scope dispatch(model);
     return duration_sentence_proj_ggml_impl(model, text_ids, text_len, &scalar_trace, &ggml_trace,
                                            error, include_scalar_trace, include_ggml_trace,
                                            sentence_proj_out);
@@ -706,6 +826,7 @@ bool supertonic_duration_forward_ggml(const supertonic_model & model,
                                       const float * style_dp,
                                       float & duration_out,
                                       std::string * error) {
+    supertonic_op_dispatch_scope dispatch(model);
     try {
         std::vector<supertonic_trace_tensor> scalar;
         std::vector<supertonic_trace_tensor> ggml;
@@ -717,16 +838,18 @@ bool supertonic_duration_forward_ggml(const supertonic_model & model,
         for (int c = 0; c < 64; ++c) combined[c] = projected[c];
         for (int i = 0; i < 128; ++i) combined[64 + i] = style_dp[i];
         std::vector<float> h;
+        // F17 — host-cached predictor weights.  Style is per-call
+        // input data, not a backend weight, so it stays uncached.
         dense(combined,
-              read_f32(model, "duration:tts.dp.predictor.layers.0.weight"),
-              read_f32(model, "duration:tts.dp.predictor.layers.0.bias"),
+              cached_read_f32(model, "duration:tts.dp.predictor.layers.0.weight"),
+              cached_read_f32(model, "duration:tts.dp.predictor.layers.0.bias"),
               192, 128, h);
-        float prelu = read_f32(model, "duration:tts.dp.predictor.activation.weight").data[0];
+        float prelu = cached_read_f32(model, "duration:tts.dp.predictor.activation.weight").data[0];
         for (float & v : h) if (v < 0.0f) v *= prelu;
         std::vector<float> out;
         dense(h,
-              read_f32(model, "duration:tts.dp.predictor.layers.1.weight"),
-              read_f32(model, "duration:tts.dp.predictor.layers.1.bias"),
+              cached_read_f32(model, "duration:tts.dp.predictor.layers.1.weight"),
+              cached_read_f32(model, "duration:tts.dp.predictor.layers.1.bias"),
               128, 1, out);
         duration_out = std::exp(out[0]);
         if (error) error->clear();
diff --git a/tts-cpp/src/supertonic_engine.cpp b/tts-cpp/src/supertonic_engine.cpp
index cc87c09e084..4ab761c0133 100644
--- a/tts-cpp/src/supertonic_engine.cpp
+++ b/tts-cpp/src/supertonic_engine.cpp
@@ -2,13 +2,21 @@
 #include "tts-cpp/supertonic/engine.h"
 
 #include "backend_selection.h"
+#include "supertonic_chunker.h"
 #include "supertonic_internal.h"
 #include "npy.h"
+// Vulkan adapter description in `backend_name()` is now resolved
+// through the registry API (`ggml_backend_get_device` +
+// `ggml_backend_dev_description`) so no per-backend header include
+// is needed.  Same change other call sites went through to drop the
+// hard dep on `ggml-vulkan.h` under `GGML_BACKEND_DL=ON`.
 
 #include <atomic>
 #include <cmath>
-#include <cstring>
 #include <cstdint>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
 #include <filesystem>
 #include <stdexcept>
 
@@ -108,12 +116,51 @@ class numpy_random_state {
     }
 };
 
+// Heuristic: does this chunk end at a natural sentence terminator?
+// Used by streaming to decide whether to skip the auto-appended period
+// (continuation chunks) or keep it (complete-sentence chunks).  Commas
+// and other clause punctuation are NOT counted here — chunks ending in
+// a comma still want is_continuation=true so the model hears them as
+// a continuation, not a mini-sentence.
+//
+// Trims trailing whitespace, then decodes the final UTF-8 code point
+// and delegates to the chunker's `is_sentence_end_cp` so the
+// terminator table is defined in exactly one place (see
+// supertonic_chunker.cpp).
+bool chunk_ends_with_sentence_term(const std::string & s) {
+    size_t i = s.size();
+    while (i > 0 && (s[i - 1] == ' ' || s[i - 1] == '\t' ||
+                     s[i - 1] == '\n' || s[i - 1] == '\r')) --i;
+    if (i == 0) return false;
+    // Walk back to the leading byte of the final UTF-8 sequence.
+    size_t pos = i - 1;
+    while (pos > 0 && ((uint8_t) s[pos] & 0xC0) == 0x80) --pos;
+    const size_t bytes = i - pos;
+    uint32_t cp = 0;
+    if      (bytes == 1) cp = (uint8_t) s[pos];
+    else if (bytes == 2) cp = ((s[pos] & 0x1F) << 6) | (s[pos + 1] & 0x3F);
+    else if (bytes == 3) cp = ((s[pos] & 0x0F) << 12) |
+                              ((s[pos + 1] & 0x3F) << 6) |
+                              (s[pos + 2] & 0x3F);
+    else if (bytes == 4) cp = ((s[pos] & 0x07) << 18) |
+                              ((s[pos + 1] & 0x3F) << 12) |
+                              ((s[pos + 2] & 0x3F) << 6) |
+                              (s[pos + 3] & 0x3F);
+    return detail::is_sentence_end_cp(cp);
+}
+
 } // namespace
 
 struct Engine::Impl {
     EngineOptions    opts;
     supertonic_model model;
     std::atomic<bool> cancel_flag{false};
+    // QVAC-18605 round 7 — voice ttl/dp host cache.  Populated
+    // lazily on first `synthesize()` call per voice; subsequent
+    // calls hit the cache and skip the GPU→host download (2 sync
+    // points per call eliminated on Vulkan / OpenCL).  See the
+    // contract on `voice_host_cache` in supertonic_internal.h.
+    voice_host_cache voices_host;
 
     explicit Impl(const EngineOptions & o)
         : opts(o) {
@@ -123,7 +170,6 @@ struct Engine::Impl {
         if (!std::filesystem::exists(opts.model_gguf_path)) {
             throw std::runtime_error(supertonic_setup_hint(opts.model_gguf_path));
         }
-
         // Wire backends_dir + opencl_cache_dir BEFORE any backend
         // init. First-Engine-wins across the whole process; second
         // and later Engines reuse the already-loaded registry. See
@@ -135,13 +181,114 @@ struct Engine::Impl {
             ::tts_cpp::detail::set_opencl_cache_dir(opts.opencl_cache_dir);
         }
 
-        if (!load_supertonic_gguf(opts.model_gguf_path, model, opts.n_gpu_layers, false)) {
+        // Map the public Precision enum onto the internal one (separate
+        // declaration so the engine header doesn't pull in internal.h).
+        supertonic_precision internal_precision = supertonic_precision::F32;
+        switch (opts.precision) {
+            case Precision::F32:  internal_precision = supertonic_precision::F32;  break;
+            case Precision::F16:  internal_precision = supertonic_precision::F16;  break;
+            case Precision::Q8_0: internal_precision = supertonic_precision::Q8_0; break;
+        }
+        // QVAC-18605 round 7 — apply Vulkan env-var overrides
+        // BEFORE `load_supertonic_gguf` (which calls
+        // `init_supertonic_backend`).  ggml-vulkan reads its
+        // GGML_VK_* env vars at backend init, so the overrides
+        // need to land in the environment before that point.
+        // Throws on any key without `GGML_VK_` prefix (operator-
+        // config typo guard); the throw propagates up to the
+        // caller (no model loaded yet, no cleanup needed).
+        apply_vulkan_env_overrides(opts.vulkan_env_overrides);
+        if (!load_supertonic_gguf(opts.model_gguf_path, model,
+                                  opts.n_gpu_layers, /*verbose=*/false,
+                                  opts.f16_weights, internal_precision,
+                                  opts.vulkan_device,
+                                  opts.f16_weights_deny_list)) {
             throw std::runtime_error("Supertonic Engine: failed to load GGUF: " +
                                      opts.model_gguf_path);
         }
         try {
             supertonic_set_n_threads(model, opts.n_threads);
 
+            // F16 K/V attention dispatch: auto-enable on GPU backends,
+            // disable on CPU; user can override either way.  Captured
+            // into the model so supertonic_op_dispatch_scope picks it
+            // up on every synthesize() call.  See model.use_f16_attn
+            // in supertonic_internal.h.
+            //
+            // QVAC-18605 — auto-policy is now backend-capability-gated.
+            // Probes `ggml_backend_supports_op` for a Supertonic-
+            // shaped F16-K/V flash_attn graph node before flipping
+            // the flag.  A backend that compiles `flash_attn_ext`
+            // but rejects the F16 K/V variant for our shape (head_dim
+            // = 64, n_heads = 4) keeps the F32 path — slower but
+            // guaranteed to not crash at first synth call.  Manual
+            // override via `--f16-attn 1` still forces dispatch
+            // (useful for debug-shim backends).
+            if (opts.f16_attn < 0) {
+                model.use_f16_attn = !model.backend_is_cpu &&
+                                     supertonic_backend_supports_f16_kv_flash_attn(model.backend);
+            } else {
+                model.use_f16_attn = opts.f16_attn != 0;
+            }
+
+            // QVAC-18605 round 4 — multi-dtype K/V dispatch resolution.
+            //
+            // Layered ON TOP of the round-1 `use_f16_attn` boolean:
+            // when `opts.kv_attn_type == -1` (the default), the
+            // resolver falls back to the boolean's value, so every
+            // existing operator config sees zero behaviour change.
+            //
+            // When the operator opts in to a non-default dtype, the
+            // resolved enum drives the vector-estimator dispatch
+            // and the boolean is updated to mirror the F16 case
+            // (so any external code still keying on the boolean
+            // — currently none in tree but kept for forward-compat
+            // — stays consistent).  Out-of-range opts.kv_attn_type
+            // throws inside the resolver; we let the throw
+            // propagate up to the Engine ctor (which already wraps
+            // the body in try/catch and frees the model).
+            //
+            // Probes are advisory: an explicit BF16 / Q8_0 request
+            // on an adapter that doesn't support it falls back to
+            // F32 — same advisory-probe pattern as the round-1
+            // F16 auto-policy fallback above.
+            //
+            // PR #18 reviewer (Omar) follow-up: the silent
+            // fallback was masking operator surprise — someone
+            // pinning `--kv-attn-type bf16` in their production
+            // config on a mixed fleet (some adapters support
+            // BF16 K/V, some don't) would silently see F32 on
+            // the unsupported subset.  The resolver's
+            // `out_was_downgraded` out-param surfaces the
+            // explicit-request + missing-probe case so we can
+            // emit a one-line stderr warning (auto path stays
+            // silent — the operator didn't ask for a specific
+            // dtype, so there's nothing to surprise them with).
+            bool kv_dtype_downgraded = false;
+            model.kv_attn_type = resolve_kv_attn_type(
+                opts.kv_attn_type,
+                model.use_f16_attn,
+                supertonic_backend_supports_f16_kv_flash_attn(model.backend),
+                supertonic_backend_supports_bf16_kv_flash_attn(model.backend),
+                supertonic_backend_supports_q8_0_kv_flash_attn(model.backend),
+                &kv_dtype_downgraded);
+            if (kv_dtype_downgraded) {
+                static const char * const kv_label[] = {
+                    "f32", "f16", "bf16", "q8_0"
+                };
+                std::fprintf(stderr,
+                    "supertonic: warning: requested --kv-attn-type %s but the "
+                    "resolved backend's flash-attn probe rejected it; falling "
+                    "back to f32 (set --kv-attn-type auto to silence)\n",
+                    (opts.kv_attn_type >= 0 && opts.kv_attn_type <= 3)
+                        ? kv_label[opts.kv_attn_type] : "?");
+            }
+            // Keep the boolean consistent with the resolved enum.
+            // No-op for the default `kv_attn_type == -1` path (the
+            // resolver already mirrors the boolean).  Becomes a
+            // no-op for explicit `--kv-attn-type 1` too.
+            model.use_f16_attn = (model.kv_attn_type == kv_attn_dtype::f16);
+
             // Validate voice up front so we throw at construction
             // rather than mid-synthesize().
             const std::string voice = opts.voice.empty()
@@ -150,6 +297,20 @@ struct Engine::Impl {
             if (model.voices.find(voice) == model.voices.end()) {
                 throw std::runtime_error("Supertonic Engine: unknown voice: " + voice);
             }
+
+            // QVAC-18605 follow-up — opt-in first-synth pre-warm.
+            // Skipped on CPU (no shader-compile cost to amortise)
+            // and on empty `prewarm_text` (the caller didn't ask).
+            // On Vulkan / OpenCL this runs one throwaway synth to
+            // force every per-stage graph cache to populate and
+            // every shader pipeline to compile, so the first
+            // operator-visible `synthesize()` call hits steady-
+            // state latency instead of paying the ~hundreds-of-ms
+            // cold-start hit chatterbox PROGRESS.md measured on
+            // Adreno + RADV.
+            if (!opts.prewarm_text.empty() && !model.backend_is_cpu) {
+                synthesize(opts.prewarm_text);  // discard result
+            }
         } catch (...) {
             free_supertonic_model(model);
             throw;
@@ -163,11 +324,14 @@ struct Engine::Impl {
     Impl(const Impl &)             = delete;
     Impl & operator=(const Impl &) = delete;
 
-    SynthesisResult synthesize(const std::string & text) {
-        if (text.empty()) {
-            throw std::runtime_error("Supertonic Engine: text is empty");
-        }
-
+    // Single-chunk synthesis worker.  Runs the full Supertonic pipeline
+    // (preprocess → duration → noise → text encoder → vector estimator
+    // CFM loop → vocoder) on `text` with the given seed.  When
+    // `is_continuation` is true the preprocess skips the auto-appended
+    // terminal period — used by streaming for mid-utterance chunks so
+    // the model isn't told "this is a complete sentence" when it isn't.
+    SynthesisResult run_single_chunk(const std::string & text, int seed,
+                                     bool is_continuation = false) {
         const std::string voice = opts.voice.empty()
             ? model.hparams.default_voice
             : opts.voice;
@@ -182,13 +346,22 @@ struct Engine::Impl {
             // construction (not currently supported but guard anyway).
             throw std::runtime_error("Supertonic Engine: unknown voice: " + voice);
         }
-        std::vector<float> style_ttl = read_tensor_f32(vit->second.ttl);
-        std::vector<float> style_dp  = read_tensor_f32(vit->second.dp);
+        // QVAC-18605 round 7 — `voices_host.get_or_load` returns
+        // a stable reference into the per-engine cache.  First
+        // call per voice does the 2 GPU→host downloads + caches;
+        // subsequent calls return the cached entry without
+        // touching the backend.  Pointers + size below are valid
+        // for the duration of this `synthesize()` call (cache is
+        // never `clear()`ed during synthesis).
+        const auto & voice_entry = voices_host.get_or_load(voice, vit->second.ttl, vit->second.dp);
+        const float * style_ttl  = voice_entry.ttl.data();
+        const float * style_dp   = voice_entry.dp.data();
 
         std::vector<int32_t> text_ids_i32;
         std::string normalized;
         std::string error;
-        if (!supertonic_text_to_ids(model, text, opts.language, text_ids_i32, &normalized, &error)) {
+        if (!supertonic_text_to_ids(model, text, opts.language, text_ids_i32,
+                                    &normalized, &error, is_continuation)) {
             throw std::runtime_error("Supertonic Engine: text preprocessing failed: " + error);
         }
         std::vector<int64_t> text_ids(text_ids_i32.begin(), text_ids_i32.end());
@@ -199,7 +372,7 @@ struct Engine::Impl {
 
         float duration_raw = 0.0f;
         if (!supertonic_duration_forward_ggml(model, text_ids.data(), (int) text_ids.size(),
-                                              style_dp.data(), duration_raw, &error)) {
+                                              style_dp, duration_raw, &error)) {
             throw std::runtime_error("Supertonic Engine: duration failed: " + error);
         }
         const float duration_s  = duration_raw / speed;
@@ -221,7 +394,7 @@ struct Engine::Impl {
             latent.resize(noise.n_elements());
             std::memcpy(latent.data(), npy_as_f32(noise), latent.size() * sizeof(float));
         } else {
-            numpy_random_state rng((uint32_t) opts.seed);
+            numpy_random_state rng((uint32_t) seed);
             latent.assign((size_t) model.hparams.latent_channels * latent_len, 0.0f);
             for (float & v : latent) v = rng.standard_normal();
         }
@@ -232,26 +405,38 @@ struct Engine::Impl {
 
         std::vector<float> text_emb;
         if (!supertonic_text_encoder_forward_ggml(model, text_ids.data(), (int) text_ids.size(),
-                                                  style_ttl.data(), text_emb, &error)) {
+                                                  style_ttl, text_emb, &error)) {
             throw std::runtime_error("Supertonic Engine: text encoder failed: " + error);
         }
 
         std::vector<float> latent_mask((size_t) latent_len, 1.0f);
 
-        std::vector<float> next;
-        for (int step = 0; step < steps; ++step) {
-            if (cancel_flag.load(std::memory_order_acquire)) {
-                throw std::runtime_error("Supertonic Engine: cancelled at vector step "
-                                         + std::to_string(step));
-            }
-            if (!supertonic_vector_step_ggml(model, latent.data(), latent_len,
-                                             text_emb.data(), (int) text_ids.size(),
-                                             style_ttl.data(), latent_mask.data(),
-                                             step, steps, next, &error)) {
-                throw std::runtime_error("Supertonic Engine: vector estimator failed: " + error);
-            }
-            latent.swap(next);
+        // Master's CFM loop unrolling (Phase A1+A2) replaced the
+        // round-7 per-step `supertonic_vector_step_ggml` loop with
+        // a single `supertonic_vector_loop_ggml` call below.  The
+        // per-step cancellation hook from round 7 collapses into
+        // this single pre-synth check (cancel granularity moves
+        // from per-step to per-synth on the GPU path; the CPU
+        // path's per-step fallback inside `supertonic_vector_loop_ggml`
+        // retains finer cancellation if needed).
+        if (cancel_flag.load(std::memory_order_acquire)) {
+            throw std::runtime_error("Supertonic Engine: cancelled before vector estimator");
+        }
+        // Phase A1+A2: run all CFM steps as ONE ggml graph on non-CPU
+        // backends.  Latent flows step-to-step in GPU memory; on CPU this
+        // falls back to a per-step loop over `supertonic_vector_step_ggml`.
+        // Override via SUPERTONIC_DISABLE_LOOP_GRAPH=1.
+        // NOTE: cancellation granularity is now per-synth on the GPU path
+        // (worst-case cancel latency = whole CFM loop).  CPU keeps per-step
+        // cancellation via the fallback.
+        std::vector<float> final_latent;
+        if (!supertonic_vector_loop_ggml(model, latent.data(), latent_len,
+                                          text_emb.data(), (int) text_ids.size(),
+                                          style_ttl, latent_mask.data(),
+                                          steps, final_latent, &error)) {
+            throw std::runtime_error("Supertonic Engine: vector estimator failed: " + error);
         }
+        latent = std::move(final_latent);
 
         if (cancel_flag.load(std::memory_order_acquire)) {
             throw std::runtime_error("Supertonic Engine: cancelled before vocoder");
@@ -270,12 +455,151 @@ struct Engine::Impl {
         return result;
     }
 
+    SynthesisResult synthesize(const std::string & text) {
+        if (text.empty()) {
+            throw std::runtime_error("Supertonic Engine: text is empty");
+        }
+        return run_single_chunk(text, opts.seed);
+    }
+
+    // Streaming path: chunk text via the multilingual splitter, run the
+    // full per-chunk pipeline, apply an anti-click raised-cosine fade
+    // across inter-chunk seams, invoke `on_chunk` synchronously, and
+    // accumulate the full PCM in the returned result (callback is an
+    // *addition*, not a replacement — matches Chatterbox semantics).
+    SynthesisResult synthesize_streaming(const std::string & text,
+                                         const StreamCallback & on_chunk) {
+        if (text.empty()) {
+            throw std::runtime_error("Supertonic Engine: text is empty");
+        }
+
+        std::vector<std::string> chunks = detail::split_for_streaming(
+            text,
+            opts.stream_chunk_tokens,
+            opts.stream_first_chunk_tokens,
+            opts.stream_chunk_tolerance_pct,
+            opts.stream_min_chunk_tokens);
+
+        if (chunks.empty()) {
+            throw std::runtime_error("Supertonic Engine: chunker produced no chunks");
+        }
+
+        // Optional chunk-boundary trace for debugging the multilingual
+        // splitter.  Off by default; opt-in via env var so production
+        // synthesis isn't slowed by stderr writes.
+        if (const char * env = std::getenv("SUPERTONIC_LOG_CHUNKS"); env && env[0] == '1') {
+            for (size_t i = 0; i < chunks.size(); ++i) {
+                std::fprintf(stderr, "chunk[%zu] (%zu bytes): %s\n",
+                             i, chunks[i].size(), chunks[i].c_str());
+            }
+        }
+
+        SynthesisResult full;
+        full.duration_s = 0.0f;
+
+        const int n_chunks = (int) chunks.size();
+        for (int k = 0; k < n_chunks; ++k) {
+            if (cancel_flag.load(std::memory_order_acquire)) {
+                throw std::runtime_error(
+                    "Supertonic Engine: cancelled during streaming chunk "
+                    + std::to_string(k));
+            }
+
+            // Use opts.seed for every chunk.  Each chunk has a different
+            // predicted latent_len (driven by its own text and duration
+            // model), so the RNG produces different-length noise tensors
+            // for each chunk even with the same seed — there's no risk
+            // of identical starting noise across chunks.  An earlier
+            // version perturbed the seed per chunk (opts.seed + k) as
+            // a defensive measure, but that landed some chunks on
+            // nearby seeds where the model produces phantom phoneme
+            // artifacts ("park.K" tail).  Keeping the user's chosen
+            // seed across chunks gives consistent, controllable output.
+            //
+            // is_continuation: chunks that DON'T end on a natural
+            // sentence terminator (.?! and the CJK / Devanagari / Urdu
+            // equivalents) need preprocess to skip the auto-appended
+            // period.  Otherwise the model hears the stub as a complete
+            // sentence with falling intonation + trailing artifacts —
+            // the failure mode that originally restricted us to
+            // sentence-only chunking.  With the flag, mid-clause /
+            // mid-word chunk endings flow through with their natural
+            // (un-punctuated) tail so the model treats them as a
+            // continuation.
+            const bool is_continuation = !chunk_ends_with_sentence_term(chunks[k]);
+            if (const char * env = std::getenv("SUPERTONIC_LOG_CHUNKS");
+                env && env[0] == '1') {
+                std::fprintf(stderr, "chunk[%d] is_continuation=%d\n",
+                             k, (int) is_continuation);
+            }
+            SynthesisResult chunk_res = run_single_chunk(chunks[k], opts.seed,
+                                                         is_continuation);
+
+            // Anti-click raised-cosine fade across inter-chunk seams.
+            // Without HiFT cache continuity (Supertonic runs each chunk
+            // as a fresh independent pipeline), plain concatenation can
+            // produce a faint click at the boundary.  ~10 ms is enough
+            // to hide the click without audibly attenuating speech.
+            // Applied to the start of every non-first chunk and the end
+            // of every non-last chunk.  The very-first chunk start and
+            // very-last chunk end are left untouched so the streamed
+            // output is acoustically equivalent to the batch output at
+            // those endpoints.
+            const int    sr      = chunk_res.sample_rate;
+            const size_t fade_n  = std::min<size_t>(
+                                       (size_t)(sr * 10 / 1000),
+                                       chunk_res.pcm.size() / 2);
+            const bool   is_first = (k == 0);
+            const bool   is_last  = (k == n_chunks - 1);
+
+            if (!is_first && fade_n > 0) {
+                for (size_t i = 0; i < fade_n; ++i) {
+                    const float t = (float) i / (float) fade_n;
+                    const float w = 0.5f * (1.0f - std::cos((float) M_PI * t));
+                    chunk_res.pcm[i] *= w;
+                }
+            }
+            if (!is_last && fade_n > 0) {
+                const size_t n = chunk_res.pcm.size();
+                for (size_t i = 0; i < fade_n; ++i) {
+                    const float t = (float) i / (float) fade_n;
+                    const float w = 0.5f * (1.0f - std::cos((float) M_PI * t));
+                    chunk_res.pcm[n - 1 - i] *= w;
+                }
+            }
+
+            // Fire callback before accumulating, so the consumer sees
+            // the same buffer it would receive in pure-streaming mode.
+            on_chunk(chunk_res.pcm.data(), chunk_res.pcm.size(), k, is_last);
+
+            full.pcm.insert(full.pcm.end(), chunk_res.pcm.begin(), chunk_res.pcm.end());
+            full.duration_s  += chunk_res.duration_s;
+            full.sample_rate  = chunk_res.sample_rate;
+        }
+
+        return full;
+    }
+
     std::string backend_name() const {
         if (!model.backend) return "(unknown)";
-        if (const char * name = ggml_backend_name(model.backend)) {
-            return std::string(name);
+        const char * name = ggml_backend_name(model.backend);
+        std::string out = name ? std::string(name) : "(unknown)";
+        // QVAC-18605 — append device description when Vulkan is the
+        // resolved backend.  Mirrors chatterbox's bench output so a
+        // log line like "backend: Vulkan (device 0: NVIDIA RTX 5090)"
+        // is unambiguous when triaging multi-GPU machines.  Pulled
+        // through `ggml_backend_dev_description(ggml_backend_get_device(b))`
+        // so the lookup links under `GGML_BACKEND_DL=ON` without a
+        // static dep on `ggml_backend_vk_get_device_description`.
+        if (model.backend_is_vk) {
+            ggml_backend_dev_t dev = ggml_backend_get_device(model.backend);
+            const char * desc = dev ? ggml_backend_dev_description(dev) : nullptr;
+            if (desc && *desc) {
+                const int idx = opts.vulkan_device < 0 ? 0 : opts.vulkan_device;
+                out += " (device " + std::to_string(idx) + ": " + desc + ")";
+            }
         }
-        return "(unknown)";
+        return out;
     }
 };
 
@@ -291,10 +615,35 @@ SynthesisResult Engine::synthesize(const std::string & text) {
     return pimpl_->synthesize(text);
 }
 
+SynthesisResult Engine::synthesize(const std::string & text,
+                                   const StreamCallback & on_chunk) {
+    // Fall through to the batch path when streaming is disabled or no
+    // callback is wired up.  Both conditions match the Chatterbox
+    // semantics — callers can pass a no-op callback safely.
+    if (!on_chunk || pimpl_->opts.stream_chunk_tokens <= 0) {
+        return pimpl_->synthesize(text);
+    }
+    return pimpl_->synthesize_streaming(text, on_chunk);
+}
+
 void Engine::cancel() {
     pimpl_->cancel_flag.store(true, std::memory_order_release);
 }
 
+// QVAC-18605 follow-up — explicit first-synth pre-warm.
+// Forwards to the in-place `synthesize` and discards the PCM,
+// gated on the same `backend_is_cpu` short-circuit the auto-
+// invoked path at the end of `Impl::Impl` uses.  See the
+// declaration in `tts-cpp/supertonic/engine.h` for the full
+// rationale; the implementation here intentionally keeps the
+// no-op CPU fast path so callers don't have to branch on
+// `backend_device()` themselves.
+void Engine::warm_up(const std::string & text) {
+    if (text.empty()) return;
+    if (pimpl_->model.backend_is_cpu) return;
+    pimpl_->synthesize(text);  // discard result
+}
+
 const EngineOptions & Engine::options() const {
     return pimpl_->opts;
 }
diff --git a/tts-cpp/src/supertonic_gguf.cpp b/tts-cpp/src/supertonic_gguf.cpp
index eb4420c38a4..ea5f977ec41 100644
--- a/tts-cpp/src/supertonic_gguf.cpp
+++ b/tts-cpp/src/supertonic_gguf.cpp
@@ -14,8 +14,14 @@
 
 #include <algorithm>
 #include <atomic>
+#include <chrono>
+#include <cmath>
+#include <cstdio>
+#include <map>
 #include <cstdlib>
+#include <cstring>
 #include <mutex>
+#include <unordered_map>
 #include <unordered_set>
 #include <stdexcept>
 #include <thread>
@@ -65,6 +71,89 @@ ggml_tensor * get_tensor_or_null(const supertonic_model & model, const std::stri
     return it == model.tensors.end() ? nullptr : it->second;
 }
 
+// Compute the storage type for a model tensor given the source type from
+// the GGUF and the engine's compute-precision selector.  Non-matmul tensors
+// (biases, norms, embeddings — stored as f32 in the GGUF) are unaffected;
+// only quantized matmul weights actually change destination type.
+//
+// Truth table:
+//   precision \ src_type      | F32  | F16  | Q8_0
+//   --------------------------+------+------+------
+//   F32 (default)             | F32  | F32  | F32
+//   F16  (Phase B1)           | F32  | F16  | F16
+//   Q8_0 (Phase A3)           | F32  | F32  | Q8_0   <-- key win: Metal keeps q8_0
+//
+// F32 row preserves the historical behaviour exactly.
+// Predicate: is `tensor_name` a true matmul weight that lands in a
+// `ggml_mul_mat(weight, activation)` call (weight as src0) where Metal
+// can dispatch `kernel_mul_mm_q8_0_f32` directly?
+//
+// Today this is only the vector_estimator's per-step matmul weights —
+// those go through `dense_matmul_time_wt_pretransposed_ggml` (the
+// B2-partial helper) which uses the pretransposed weight as src0 and
+// dispatches the optimised q8_0 mat-mat kernel.
+//
+// Other GGUF q8_0 sources (text_encoder, duration, speech-prompted
+// attention) still flow through `dense_matmul_time_ggml`, which does
+// `ggml_cont(ggml_transpose(w))` at compute time — and Metal has no
+// CONT kernel for q8_0, so we'd crash.  Phase A3 follow-up: extend
+// the pretranspose-aware helper to those sites and broaden this
+// predicate.
+bool is_supertonic_matmul_weight_name(const std::string & name) {
+    return name.find("vector_estimator:onnx::MatMul_") != std::string::npos;
+}
+
+ggml_type target_supertonic_storage_type(const std::string & name,
+                                         enum ggml_type src_type,
+                                         supertonic_precision precision,
+                                         bool backend_is_cpu) {
+    // Only quantized matmul-weight tensors are subject to the precision
+    // selector.  Everything else (biases, norms, scales, the unicode
+    // indexer i32 lookup, etc.) is passed through unchanged so we don't
+    // attempt a dequant on types that don't have a to_float trait.
+    const bool is_quantized_weight =
+        (src_type == GGML_TYPE_Q8_0) || (src_type == GGML_TYPE_F16);
+    if (!is_quantized_weight) return src_type;
+
+    switch (precision) {
+        case supertonic_precision::F32:  return GGML_TYPE_F32;
+        case supertonic_precision::F16:
+            // Asymmetric like q8_0: on CPU dequant everything to f32 (AMX
+            // cblas takes f32).  On non-CPU keep f16 ONLY for true matmul-
+            // weight tensors that flow through dense_matmul_time_pretransposed_*
+            // — these dispatch ggml-metal's `kernel_mul_mm_f16_f32` directly.
+            // Other quantized GGUF tensors (relpos embeddings, conv1d
+            // kernels, per-channel scales used in plain ggml_mul) flow into
+            // ggml_metal_op_bin which asserts f32 on both srcs, so we dequant
+            // them at load.
+            if (!backend_is_cpu && is_supertonic_matmul_weight_name(name)) {
+                return GGML_TYPE_F16;
+            }
+            return GGML_TYPE_F32;
+        case supertonic_precision::Q8_0:
+            // Asymmetric: on CPU, ALWAYS dequant to f32 so cblas/AMX takes
+            // the weights (q8_0 path on CPU is NEON-only and loses the AMX
+            // advantage; not worth the parity drift).  On non-CPU backends,
+            // keep q8_0 ONLY for true matmul-weight tensors that flow
+            // through `dense_matmul_time_wt_pretransposed_ggml`'s
+            // weight-as-src0 ordering — other quantized GGUF tensors
+            // (relpos embeddings, conv1d kernels) use op patterns that
+            // Metal lacks q8_0 kernels for.
+            if (!backend_is_cpu &&
+                src_type == GGML_TYPE_Q8_0 &&
+                is_supertonic_matmul_weight_name(name)) {
+                return GGML_TYPE_Q8_0;
+            }
+            return GGML_TYPE_F32;
+    }
+    return GGML_TYPE_F32;
+}
+
+bool needs_supertonic_tensor_conversion(enum ggml_type src_type,
+                                        enum ggml_type dst_type) {
+    return src_type != dst_type;
+}
+
 bool should_expand_supertonic_tensor(enum ggml_type type) {
     return type == GGML_TYPE_F16 || type == GGML_TYPE_Q8_0;
 }
@@ -89,11 +178,66 @@ std::vector<float> expand_supertonic_tensor_to_f32(const ggml_tensor * src) {
     return out;
 }
 
-ggml_backend_t init_supertonic_backend(int n_gpu_layers, bool verbose) {
+// Convert a GGUF tensor's data into `out_buf`, which the caller has sized
+// to `ggml_row_size(dst_type, n_elems) * (n_rows ...)` — i.e. ggml_nbytes
+// for the destination tensor shape.  Supports any pair the ggml type
+// traits cover: F32 ↔ F16 ↔ Q8_0.  Always converts via f32 as the pivot
+// because that's the only API surface ggml exports publicly.
+void convert_supertonic_tensor_data(const ggml_tensor * src,
+                                    enum ggml_type dst_type,
+                                    std::vector<uint8_t> & out_buf) {
+    const int64_t n = ggml_nelements(src);
+    const void * src_data = ggml_get_data(src);
+
+    if (src->type == dst_type) {
+        // No conversion needed — caller should ideally have skipped this path
+        // and uploaded the raw GGUF bytes, but handle it for completeness.
+        const size_t bytes = ggml_nbytes(src);
+        out_buf.resize(bytes);
+        std::memcpy(out_buf.data(), src_data, bytes);
+        return;
+    }
+
+    // Pivot through f32 using the public ggml_get_type_traits() API.
+    // `ggml_get_type_traits_cpu()->from_float` is also public for the
+    // reverse direction (f32 → quantized).
+    std::vector<float> f32_pivot((size_t) n);
+    const ggml_type_traits * src_tr = ggml_get_type_traits(src->type);
+    if (!src_tr || !src_tr->to_float) {
+        throw std::runtime_error(std::string("Supertonic load: missing to_float for ") +
+                                 ggml_type_name(src->type));
+    }
+    src_tr->to_float(src_data, f32_pivot.data(), n);
+
+    if (dst_type == GGML_TYPE_F32) {
+        out_buf.resize(f32_pivot.size() * sizeof(float));
+        std::memcpy(out_buf.data(), f32_pivot.data(), out_buf.size());
+        return;
+    }
+
+    const size_t dst_bytes = ggml_row_size(dst_type, n);
+    out_buf.resize(dst_bytes);
+
+    const ggml_type_traits_cpu * dst_tr = ggml_get_type_traits_cpu(dst_type);
+    if (!dst_tr || !dst_tr->from_float) {
+        throw std::runtime_error(std::string("Supertonic load: missing from_float for ") +
+                                 ggml_type_name(dst_type));
+    }
+    dst_tr->from_float(f32_pivot.data(), out_buf.data(), n);
+}
+
+ggml_backend_t init_supertonic_backend(int n_gpu_layers, bool verbose, int vulkan_device = 0) {
     // GPU cascade is centralised in backend_selection.cpp's
     // `init_gpu_backend` (Adreno 700+ -> OpenCL, every other GPU ->
     // Vulkan/Metal/CUDA/Mali, with Adreno 6xx OpenCL force-skipped).
-    if (ggml_backend_t b = ::tts_cpp::detail::init_gpu_backend(n_gpu_layers, verbose, "supertonic")) {
+    // `vulkan_device` (round-3 / round-12) is forwarded so the shared
+    // helper applies the supertonic-side Vulkan device-selection
+    // policy when multiple Vulkan adapters are visible: -1 → auto
+    // (free-VRAM argmax with UMA bias), 0 → first Vulkan device
+    // (registry order), N > 0 → that index in the registry's
+    // Vulkan-device subset.  No-op when only one Vulkan device is
+    // visible or when the chosen backend is non-Vulkan.
+    if (ggml_backend_t b = ::tts_cpp::detail::init_gpu_backend(n_gpu_layers, verbose, "supertonic", vulkan_device)) {
         return b;
     }
     if (ggml_backend_t b = ::tts_cpp::detail::init_cpu_backend()) {
@@ -103,6 +247,456 @@ ggml_backend_t init_supertonic_backend(int n_gpu_layers, bool verbose) {
     throw std::runtime_error("init_supertonic_backend: no CPU device registered");
 }
 
+// QVAC-18605 — backend capability probe for `GGML_OP_LEAKY_RELU`.
+//
+// Builds a throwaway 1-element F32 tensor + a LEAKY_RELU node (no
+// alloc, no compute) inside a tiny `ggml_init` scratch context, then
+// asks the backend whether it would accept the op.  The synthetic
+// node is the same shape Supertonic actually emits (axis-0 contig F32),
+// so a `true` answer guarantees the real graphs in the vocoder will
+// dispatch the fused builtin.
+//
+// Why dynamic instead of a hard-coded backend table?  The set of
+// backends shipping `LEAKY_RELU` shifts with chatterbox-ggml patch
+// state (OpenCL gets it via a vendored patch but plain upstream
+// doesn't).  The dynamic probe keeps the right answer when the patch
+// is added or removed without touching this TU.
+//
+// Costs nothing on the hot path — runs once per `load_supertonic_gguf`
+// call.
+bool backend_supports_native_leaky_relu(ggml_backend_t backend) {
+    if (!backend) return false;
+    ggml_init_params probe_params = {
+        /*.mem_size   =*/ ggml_tensor_overhead() * 8,
+        /*.mem_buffer =*/ nullptr,
+        /*.no_alloc   =*/ true,
+    };
+    ggml_context * probe_ctx = ggml_init(probe_params);
+    if (!probe_ctx) return false;
+    bool ok = false;
+    try {
+        ggml_tensor * x  = ggml_new_tensor_1d(probe_ctx, GGML_TYPE_F32, 16);
+        ggml_tensor * op = ggml_leaky_relu(probe_ctx, x, 0.1f, /*inplace=*/false);
+        ok = (op != nullptr) && ggml_backend_supports_op(backend, op);
+    } catch (...) {
+        ok = false;
+    }
+    ggml_free(probe_ctx);
+    return ok;
+}
+
+// QVAC-18605 — runtime check: backend is `ggml-vulkan`.
+//
+// Forwarder to the shared `tts_cpp::detail::backend_is_vulkan`
+// helper in backend_util.h (same pattern as `backend_is_metal`
+// / `backend_is_cpu`).  The supertonic anon-namespace name is
+// kept short for local readability; the inline helper resolves
+// the reg-name through the registry API
+// (`ggml_backend_get_device` + `ggml_backend_dev_backend_reg`
+// + `ggml_backend_reg_name`) so it links under both
+// `GGML_BACKEND_DL=ON` and `=OFF` modes.
+bool backend_is_vulkan(ggml_backend_t backend) {
+    return ::tts_cpp::detail::backend_is_vulkan(backend);
+}
+
+// QVAC-18605 — internal-named alias for the public probe symbol.
+// The anon-namespace function name keeps the local TU references
+// short; the public-symbol forwarder below resolves the
+// `supertonic_backend_supports_f16_kv_flash_attn` declaration in
+// `supertonic_internal.h`.
+//
+// QVAC-18605 — backend capability probe for F16-K/V `FLASH_ATTN_EXT`.
+//
+// The OpenCL bring-up's auto-enable policy (`!backend_is_cpu`) blindly
+// turns on F16 K/V dispatch on any non-CPU backend.  That works for
+// OpenCL (the chatterbox patch unconditionally accepts the op) and
+// for Vulkan when the head dim is a multiple of 8 (Supertonic's
+// head_dim=64 satisfies that), but a future backend / driver / shape
+// combo could reject the op at graph time — and a graph-build failure
+// at the first synth call is much harder to triage than a load-time
+// auto-disable + a clear log line.
+//
+// The probe builds a synthetic `ggml_flash_attn_ext` node with the
+// shape Supertonic actually emits — Q=[head_dim, q_len, n_heads] F32,
+// K/V=[head_dim, kv_len, n_heads] F16, no mask — matching the live
+// call site in `build_text_attention_cache` (supertonic_vector_estimator.cpp).
+// q_len is set to a multiple of n_heads (= 16) so the live `q_len=70`
+// (not divisible by 4) doesn't tickle a probe-only `ggml_can_mul_mat`
+// rejection; the GPU dispatch supports both the divisible and non-
+// divisible cases at runtime, so probe-shape divisibility is purely
+// a probe-API concern.
+//
+// On a `false` answer the auto-policy refuses to enable F16 attention
+// (the F32 path stays correct, just slower).  Manual override via
+// `--f16-attn 1` still forces the F16 path for benchmarking; this
+// probe only gates the *auto* policy.
+//
+// Cost: one ggml_init + ~6 tensor allocations + one supports_op call
+// at load time.  Zero hot-path cost — and the result is now memoised
+// per `ggml_backend_t` handle by `cached_backend_supports_*` below so
+// the engine + bench + load_supertonic_gguf trio doesn't re-run the
+// probe three times for the same backend.
+bool backend_supports_f16_kv_flash_attn_uncached(ggml_backend_t backend) {
+    if (!backend) return false;
+    ggml_init_params probe_params = {
+        /*.mem_size   =*/ ggml_tensor_overhead() * 16,
+        /*.mem_buffer =*/ nullptr,
+        /*.no_alloc   =*/ true,
+    };
+    ggml_context * probe_ctx = ggml_init(probe_params);
+    if (!probe_ctx) return false;
+    bool ok = false;
+    try {
+        constexpr int head_dim = 64;
+        constexpr int n_heads  = 4;
+        // q_len chosen as `n_heads * 4` so `ggml_can_mul_mat(k, q)`'s
+        // probe-only `q.ne[2] % k.ne[2] == 0` constraint is satisfied
+        // (n_heads % n_heads = 0 is the live-call invariant; here we
+        // use a Q with ne[2] = n_heads, ne[1] = q_len, so the same
+        // shape contract holds).
+        constexpr int q_len    = 16;
+        constexpr int kv_len   = 16;
+        // Live shape from `build_text_attention_cache`:
+        //   q_in: [head_dim, q_len, n_heads]  (F32)
+        //   k_in: [head_dim, kv_len, n_heads] (F16 after `ggml_cpy`)
+        //   v_in: [head_dim, kv_len, n_heads] (F16 after `ggml_cpy`)
+        ggml_tensor * q  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_F32, head_dim, q_len, n_heads);
+        ggml_tensor * k  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_F16, head_dim, kv_len, n_heads);
+        ggml_tensor * v  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_F16, head_dim, kv_len, n_heads);
+        ggml_tensor * op = ggml_flash_attn_ext(probe_ctx, q, k, v, nullptr,
+                                               1.0f / (float) head_dim, 0.0f, 0.0f);
+        ok = (op != nullptr) && ggml_backend_supports_op(backend, op);
+    } catch (...) {
+        ok = false;
+    }
+    ggml_free(probe_ctx);
+    return ok;
+}
+
+// QVAC-18605 follow-up — backend capability probe for the Q8_0
+// K/V `FLASH_ATTN_EXT` variant.
+//
+// Vulkan's `GGML_OP_FLASH_ATTN_EXT` `supports_op` advertises Q8_0
+// (and Q4_0) K/V types in the scalar and coopmat2 paths
+// (`ggml-vulkan.cpp:15257`).  Switching K/V from F16 to Q8_0
+// halves the upload bandwidth into the per-step attention cache
+// (50 KB → 25 KB per K and V on Supertonic's hot shape),
+// equivalently ~1 MB / synth on the default 5-step × 4-site
+// schedule, in exchange for a small (~0.5 %) relative-error drift
+// vs F16 K/V on the attention output.  Worth the trade on memory-
+// bandwidth-bound mobile GPUs (Adreno, Mali) once measured on a
+// real device.
+//
+// This PR adds the probe + caches the result, but does NOT yet
+// wire `model.use_q8_kv_attn` into the live dispatch site — Q8_0
+// K/V drift hasn't been measured against the existing F16 K/V
+// parity harness on a real Vulkan adapter.  The probe primes the
+// capability cache so a follow-up patch can flip the dispatch
+// behind a `--kv-attn-type q8_0` opt-in without re-running the
+// `supports_op` query.  Tracked in PROGRESS_SUPERTONIC.md
+// "Deferred work".
+bool backend_supports_q8_0_kv_flash_attn_uncached(ggml_backend_t backend) {
+    if (!backend) return false;
+    ggml_init_params probe_params = {
+        /*.mem_size   =*/ ggml_tensor_overhead() * 16,
+        /*.mem_buffer =*/ nullptr,
+        /*.no_alloc   =*/ true,
+    };
+    ggml_context * probe_ctx = ggml_init(probe_params);
+    if (!probe_ctx) return false;
+    bool ok = false;
+    try {
+        // Same shape as the F16-K/V probe; only K/V dtype differs.
+        // Q8_0 is a 32-element-per-block quantisation, so kv_len
+        // must be a multiple of 32 to satisfy the live
+        // `ggml_can_repeat` / row-stride invariants the GPU
+        // dispatch requires.  The live call site has kv_len = 50;
+        // we pick 32 here as the smallest multiple-of-Q8_0-block
+        // that exercises the same `supports_op` switch.
+        constexpr int head_dim = 64;
+        constexpr int n_heads  = 4;
+        constexpr int q_len    = 16;
+        constexpr int kv_len   = 32;
+        ggml_tensor * q  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_F32,  head_dim, q_len,  n_heads);
+        ggml_tensor * k  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_Q8_0, head_dim, kv_len, n_heads);
+        ggml_tensor * v  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_Q8_0, head_dim, kv_len, n_heads);
+        ggml_tensor * op = ggml_flash_attn_ext(probe_ctx, q, k, v, nullptr,
+                                               1.0f / (float) head_dim, 0.0f, 0.0f);
+        ok = (op != nullptr) && ggml_backend_supports_op(backend, op);
+    } catch (...) {
+        ok = false;
+    }
+    ggml_free(probe_ctx);
+    return ok;
+}
+
+// QVAC-18605 round 3 — backend capability probe for Vulkan's
+// `ggml_backend_vk_host_buffer_type()`.
+//
+// Vulkan exposes a host-visible, device-coherent buffer type
+// that lets the CPU fill an input tensor without going through
+// ggml-vulkan's internal staging buffer.  Wiring the actual
+// upload path through that buffer is a per-engine refactor
+// (input scratchpad allocator separate from the model gallocr);
+// this round only adds the probe so the capability cache is
+// primed for that follow-up.  The bench output surfaces the
+// flag so operators can confirm the host-buffer-type path is
+// available on their adapter before flipping the (future)
+// `--vulkan-pinned-uploads` opt-in.
+//
+// Probe is trivial: succeeds iff the backend is Vulkan AND the
+// device's `host_buffer_type` slot is non-null.  Routed through
+// the registry API (`ggml_backend_get_device` +
+// `ggml_backend_dev_host_buffer_type`) so it works under
+// `GGML_BACKEND_DL=ON`; on backends that don't expose a host
+// buffer type (CPU, Metal, OpenCL, …) the device-level slot
+// returns null and we report unsupported.
+bool backend_supports_pinned_host_buffer_uncached(ggml_backend_t backend) {
+    if (!backend) return false;
+    if (!::tts_cpp::detail::backend_is_vulkan(backend)) return false;
+    ggml_backend_dev_t dev = ggml_backend_get_device(backend);
+    return dev && ggml_backend_dev_host_buffer_type(dev) != nullptr;
+}
+
+// QVAC-18605 round 3 — backend capability probe for the BF16 K/V
+// `FLASH_ATTN_EXT` variant.
+//
+// Vulkan's `GGML_OP_FLASH_ATTN_EXT` `supports_op` advertises
+// BF16 K/V via the coopmat2-only path
+// (`ggml-vulkan.cpp:GGML_OP_FLASH_ATTN_EXT` case branch around
+// line 15257).  BF16 has the same per-element size as F16 (2
+// bytes), so the upload bandwidth is identical, but BF16's
+// wider exponent range (8 bits vs. F16's 5) avoids the
+// occasional underflow on small attention scores that drives
+// F16's ~0.2 % tolerance widening on the parity harness.
+// On hardware with `cooperative_matrix2` (NVIDIA Ampere+, AMD
+// RDNA3+) BF16 K/V is also faster than F16 K/V because the
+// coopmat2 BF16 multiply-accumulate ops are dispatched at
+// hardware-tensor-core throughput.
+//
+// Like the Q8_0 K/V probe, this round adds the probe + caches
+// the result as a forward-compat capability; the live dispatch
+// site isn't yet wired (a follow-up will gate `--kv-attn-type
+// bf16` on the probe so the dispatch flips when the cache says
+// the hardware accepts the op).
+//
+// Probe shape mirrors the F16-K/V probe with the K/V dtype set
+// to `GGML_TYPE_BF16` — same `kv_len = 16` (BF16 row stride is
+// `head_dim * 2` bytes, identical to F16).
+bool backend_supports_bf16_kv_flash_attn_uncached(ggml_backend_t backend) {
+    if (!backend) return false;
+    ggml_init_params probe_params = {
+        /*.mem_size   =*/ ggml_tensor_overhead() * 16,
+        /*.mem_buffer =*/ nullptr,
+        /*.no_alloc   =*/ true,
+    };
+    ggml_context * probe_ctx = ggml_init(probe_params);
+    if (!probe_ctx) return false;
+    bool ok = false;
+    try {
+        constexpr int head_dim = 64;
+        constexpr int n_heads  = 4;
+        constexpr int q_len    = 16;
+        constexpr int kv_len   = 16;
+        ggml_tensor * q  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_F32,  head_dim, q_len,  n_heads);
+        ggml_tensor * k  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_BF16, head_dim, kv_len, n_heads);
+        ggml_tensor * v  = ggml_new_tensor_3d(probe_ctx, GGML_TYPE_BF16, head_dim, kv_len, n_heads);
+        ggml_tensor * op = ggml_flash_attn_ext(probe_ctx, q, k, v, nullptr,
+                                               1.0f / (float) head_dim, 0.0f, 0.0f);
+        ok = (op != nullptr) && ggml_backend_supports_op(backend, op);
+    } catch (...) {
+        ok = false;
+    }
+    ggml_free(probe_ctx);
+    return ok;
+}
+
+// QVAC-18605 follow-up — backend capability probe for the hot
+// F16-weight `mul_mat` shape Supertonic dispatches every step.
+//
+// Mirror of `backend_supports_f16_kv_flash_attn_uncached`: the
+// `use_f16_weights` auto-policy used to flip on `!backend_is_cpu`
+// blindly, with no check that the resolved backend would accept the
+// resulting `mul_mat(F16 weight, F32 activation) → F32` graph node
+// for the shapes the audit identified as hot.  Every shipping GPU
+// backend (CUDA / Metal / Vulkan / OpenCL) does support this combo,
+// but a future debug-shim / partial-port backend that wires up
+// `mul_mat` for F32-only would crash at first synth call when
+// `f16_weights` was auto-enabled — exactly the failure mode the
+// F16-K/V probe was added to prevent.
+//
+// Probe shape mirrors the vector-estimator attention W_query
+// matmul (`[head_dim*n_heads = 256, in_dim = 256]` weight, F16
+// storage; `[256, q_len = 16]` activation, F32; output F32),
+// which is the most common F16-weight matmul site in the
+// production graph (32 such matmuls per synth, 5-step schedule).
+//
+// Cost: one ggml_init + 3 tensor allocations + one supports_op
+// call at load time.  Zero hot-path cost — memoised per
+// `ggml_backend_t` by `cached_backend_supports_*` below.
+bool backend_supports_f16_mul_mat_uncached(ggml_backend_t backend) {
+    if (!backend) return false;
+    ggml_init_params probe_params = {
+        /*.mem_size   =*/ ggml_tensor_overhead() * 8,
+        /*.mem_buffer =*/ nullptr,
+        /*.no_alloc   =*/ true,
+    };
+    ggml_context * probe_ctx = ggml_init(probe_params);
+    if (!probe_ctx) return false;
+    bool ok = false;
+    try {
+        // Live shape from the vector-estimator attention W_query /
+        // W_key / W_value matmul site.
+        constexpr int head_dim = 64;
+        constexpr int n_heads  = 4;
+        constexpr int width    = head_dim * n_heads;  // 256
+        constexpr int q_len    = 16;
+        ggml_tensor * w  = ggml_new_tensor_2d(probe_ctx, GGML_TYPE_F16, width, width);
+        ggml_tensor * x  = ggml_new_tensor_2d(probe_ctx, GGML_TYPE_F32, width, q_len);
+        ggml_tensor * op = ggml_mul_mat(probe_ctx, w, x);
+        ok = (op != nullptr) && ggml_backend_supports_op(backend, op);
+    } catch (...) {
+        ok = false;
+    }
+    ggml_free(probe_ctx);
+    return ok;
+}
+
+// QVAC-18605 follow-up — process-wide capability-probe cache.
+//
+// Three sites probe the same `ggml_backend_t` for the same op
+// support boolean: `load_supertonic_gguf` (LEAKY_RELU at backend
+// resolution time), `Engine::Engine` and `supertonic_bench`'s
+// `main` (F16-K/V flash-attn at auto-policy time).  Engine + bench
+// life-cycles also call `load_supertonic_gguf` themselves, so the
+// uncached probe set fires on average 2–3 times per backend per
+// process.  On a CPU backend each probe costs ~1 µs (ggml_init +
+// supports_op walks a small switch).  On Vulkan, `supports_op`
+// inspects the device's pipeline state and may force coopmat
+// shader specialisation lookup — measured ~50–200 µs on Adreno /
+// llvmpipe / RADV in microbenchmarks.  Negligible per-probe but
+// visible in cold-start traces, and the cache eliminates 100 % of
+// the redundancy.
+//
+// Cache shape: `unordered_map<ggml_backend_t, probe_results>`.
+// Key is the backend handle (stable for the backend's lifetime;
+// recycled keys after a backend is freed are technically possible
+// but the per-handle entry cost is ~24 bytes, so we don't bother
+// invalidating on free).  Test seam: `supertonic_clear_capability_cache`
+// drops every entry — used by the unit test to verify the cache
+// is hit on the second call.
+//
+// Thread-safety: guarded by a single std::mutex.  Hot path is
+// load-time only, never the per-synth path, so contention is
+// negligible.
+struct backend_capabilities {
+    bool native_leaky_relu;
+    bool f16_kv_flash_attn;
+    bool f16_mul_mat;
+    // QVAC-18605 follow-up — Q8_0 K/V flash-attn support.  Probed
+    // here as a forward-compat capability; the dispatch isn't yet
+    // wired (see `backend_supports_q8_0_kv_flash_attn_uncached`'s
+    // docstring + PROGRESS_SUPERTONIC.md "Deferred work").
+    bool q8_0_kv_flash_attn;
+    // QVAC-18605 round 3 — BF16 K/V flash-attn support.  Probed
+    // here as a forward-compat capability; the dispatch isn't yet
+    // wired (see `backend_supports_bf16_kv_flash_attn_uncached`'s
+    // docstring + PROGRESS_SUPERTONIC.md "Deferred work").  BF16
+    // K/V is the wider-exponent alternative to F16 K/V — mostly
+    // useful on Vulkan with cooperative_matrix2 support.
+    bool bf16_kv_flash_attn;
+    // QVAC-18605 round 3 — pinned-host-buffer-type availability.
+    // True iff the backend is Vulkan AND
+    // `ggml_backend_vk_host_buffer_type()` returns non-null.
+    // Forward-compat — primes the cache for a future per-engine
+    // input-scratchpad refactor that uses the host-pinned buffer
+    // to skip ggml-vulkan's internal staging-buffer hop on the
+    // per-step uploads.
+    bool pinned_host_buffer;
+};
+
+inline std::mutex & capability_cache_mu() {
+    static std::mutex m;
+    return m;
+}
+inline std::unordered_map<ggml_backend_t, backend_capabilities> & capability_cache() {
+    static std::unordered_map<ggml_backend_t, backend_capabilities> c;
+    return c;
+}
+// Probe-call counter for the regression test in
+// test_supertonic_capability_cache.cpp: each cached_backend_supports_*
+// helper bumps the counter only when it actually invokes the
+// uncached probe (i.e. on a cold cache).  The test asserts that
+// the counter advances by exactly one across N consecutive
+// cached_backend_supports_native_leaky_relu(b) calls on the same
+// backend.
+std::atomic<uint64_t> & capability_probe_call_counter() {
+    static std::atomic<uint64_t> n{0};
+    return n;
+}
+
+// Returns a `const &` to the cached entry.  The reference outlives
+// the `lock_guard` because:
+//   - `std::unordered_map` element references are NOT invalidated by
+//     `insert` / `emplace` even when the table rehashes; only
+//     iterators are.  (Standard guarantee, [unord.req.except].)
+//   - `find` / `emplace` are the only mutators on this cache from
+//     production code.  Production never `erase`s an entry and never
+//     calls `clear()` — the cache lives for the duration of the
+//     process.
+//
+// PR #18 reviewer (Omar) follow-up — UaF risk from test-only
+// `clear()`:
+// `supertonic_clear_capability_cache()` is a test seam exported for
+// `test-supertonic-capability-cache` to drop every cached entry and
+// re-exercise the cold-cache probe path.  If a test ever called
+// `cached_backend_capabilities(b)` (capturing the returned `const
+// &`) on thread A, then called `supertonic_clear_capability_cache()`
+// on thread B WHILE thread A was still dereferencing the reference,
+// the underlying element would be destroyed and thread A would
+// observe a use-after-free.
+//
+// Today this is a no-op risk: every test runs single-threaded, the
+// `clear` call is a single statement at the top of one test
+// (`test_capability_cache_drop_then_repopulate`), and no production
+// path reaches `clear`.  But the contract isn't enforced by the
+// type system, so spelling it out here:
+//   1. Production callers may hold the returned reference across
+//      arbitrary subsequent `cached_backend_capabilities` calls for
+//      DIFFERENT backends (insert-doesn't-invalidate-references).
+//   2. Production callers MUST NOT keep the reference alive across
+//      ANY `supertonic_clear_capability_cache` call (test code's
+//      responsibility).
+//   3. Multi-threaded callers must ensure no thread is dereferencing
+//      a returned reference while another thread calls `clear`
+//      (caller-side synchronisation; the lock here protects the
+//      map structure during insert/find, NOT element lifetime).
+//   4. If a future refactor adds a production-reachable `erase` or
+//      `clear` path, this function should either return-by-value or
+//      switch to `std::shared_ptr<const backend_capabilities>`
+//      ownership.
+const backend_capabilities & cached_backend_capabilities(ggml_backend_t backend) {
+    std::lock_guard<std::mutex> lk(capability_cache_mu());
+    auto & c = capability_cache();
+    auto it = c.find(backend);
+    if (it != c.end()) return it->second;
+    capability_probe_call_counter().fetch_add(1, std::memory_order_relaxed);
+    backend_capabilities caps;
+    caps.native_leaky_relu   = backend_supports_native_leaky_relu(backend);
+    caps.f16_kv_flash_attn   = backend_supports_f16_kv_flash_attn_uncached(backend);
+    caps.f16_mul_mat         = backend_supports_f16_mul_mat_uncached(backend);
+    caps.q8_0_kv_flash_attn  = backend_supports_q8_0_kv_flash_attn_uncached(backend);
+    caps.bf16_kv_flash_attn  = backend_supports_bf16_kv_flash_attn_uncached(backend);
+    caps.pinned_host_buffer  = backend_supports_pinned_host_buffer_uncached(backend);
+    return c.emplace(backend, caps).first->second;
+}
+
+// Backwards-compatible name kept for the in-tree callers that already
+// reference it; routes through the cache.
+bool backend_supports_f16_kv_flash_attn(ggml_backend_t backend) {
+    return cached_backend_capabilities(backend).f16_kv_flash_attn;
+}
+
 void set_env_if_unset(const char * name, const char * value) {
     if (std::getenv(name) != nullptr) return;
 #if defined(_WIN32)
@@ -112,6 +706,31 @@ void set_env_if_unset(const char * name, const char * value) {
 #endif
 }
 
+// QVAC-18605 round 7 — pure-logic key-validator for the
+// `apply_vulkan_env_overrides` ALL-OR-NOTHING contract.  Returns
+// `true` (with `out_bad_key` populated) on the first key that
+// doesn't start with `GGML_VK_`, `false` on success.  Split out
+// so the public helper validates the entire map BEFORE touching
+// any env var.
+//
+// Out-param + bool return (instead of returning `std::string`
+// with empty-as-success) because an empty-string KEY is itself
+// invalid input — a pure-string return would conflate "no bad
+// key found" with "the bad key was the empty string".
+bool find_invalid_vulkan_env_key(const std::map<std::string, std::string> & overrides,
+                                 std::string & out_bad_key) {
+    static const std::string prefix = "GGML_VK_";
+    for (const auto & kv : overrides) {
+        const std::string & key = kv.first;
+        if (key.size() <= prefix.size() ||
+            key.compare(0, prefix.size(), prefix) != 0) {
+            out_bad_key = key;
+            return true;
+        }
+    }
+    return false;
+}
+
 void configure_supertonic_blas_threads_once() {
 #if defined(TTS_CPP_USE_ACCELERATE)
     static bool configured = false;
@@ -179,6 +798,787 @@ bool is_supertonic_alive(uint64_t generation_id) {
     return supertonic_alive_ids().find(generation_id) != supertonic_alive_ids().end();
 }
 
+// QVAC-18605 — public forwarder for the F16-K/V flash-attn probe.
+// Lets engine.cpp / supertonic_bench.cpp gate the auto-policy on
+// the resolved backend's actual capability instead of the
+// historical "any non-CPU backend" heuristic — saves a graph-build
+// crash on backends that ship `flash_attn_ext` but reject the
+// F16 K/V variant for the Supertonic shape.  See the inline probe
+// `backend_supports_f16_kv_flash_attn_uncached` in this TU for
+// the rationale.  Routes through `cached_backend_capabilities`
+// (process-wide cache keyed by `ggml_backend_t`) so engine + bench
+// + load trio doesn't re-run the probe three times for the same
+// backend.
+bool supertonic_backend_supports_f16_kv_flash_attn(ggml_backend_t backend) {
+    return cached_backend_capabilities(backend).f16_kv_flash_attn;
+}
+
+// QVAC-18605 follow-up — public forwarder for the F16-weight
+// `mul_mat` probe.  Symmetric to the F16-K/V probe above; gates
+// the `use_f16_weights` auto-policy in engine.cpp + bench so a
+// backend that ships F16 storage but rejects F16 mul_mat for the
+// hot vector-estimator attention shape doesn't crash at first
+// synth call.  Cached.
+bool supertonic_backend_supports_f16_mul_mat(ggml_backend_t backend) {
+    return cached_backend_capabilities(backend).f16_mul_mat;
+}
+
+// QVAC-18605 follow-up — public forwarder for the Q8_0 K/V
+// flash-attn probe.  Forward-compat — primes the capability
+// cache for a future `--kv-attn-type q8_0` opt-in (cuts K/V
+// upload bandwidth ~2× on memory-bandwidth-bound mobile GPUs)
+// without forcing the live dispatch through Q8_0 today.  See
+// `backend_supports_q8_0_kv_flash_attn_uncached` for the
+// rationale + the deferred-work entry in PROGRESS_SUPERTONIC.md.
+bool supertonic_backend_supports_q8_0_kv_flash_attn(ggml_backend_t backend) {
+    return cached_backend_capabilities(backend).q8_0_kv_flash_attn;
+}
+
+// QVAC-18605 round 3 — public forwarder for the BF16 K/V flash-
+// attn probe.  Forward-compat — primes the capability cache for
+// a future `--kv-attn-type bf16` opt-in (BF16's wider exponent
+// range avoids the F16 underflow on small attention scores
+// without paying a 2× bandwidth cost).  Mostly useful on Vulkan
+// devices that advertise `cooperative_matrix2` (NVIDIA Ampere+,
+// AMD RDNA3+).  See `backend_supports_bf16_kv_flash_attn_uncached`
+// for the rationale + the deferred-work entry in
+// PROGRESS_SUPERTONIC.md.
+bool supertonic_backend_supports_bf16_kv_flash_attn(ggml_backend_t backend) {
+    return cached_backend_capabilities(backend).bf16_kv_flash_attn;
+}
+
+// QVAC-18605 round 3 — public forwarder for the pinned-host-
+// buffer-type probe.  Symmetric to the BF16 / Q8_0 K/V
+// forwarders above; primes the capability cache with whether
+// `ggml_backend_vk_host_buffer_type()` is callable on this
+// backend so a future per-engine input-scratchpad refactor can
+// gate the host-pinned upload path on the cached answer
+// (avoids re-querying the Vulkan backend per synth step).
+bool supertonic_backend_supports_pinned_host_buffer(ggml_backend_t backend) {
+    return cached_backend_capabilities(backend).pinned_host_buffer;
+}
+
+// QVAC-18605 round 12 #5 — pinned-host-buffer input allocator.
+//
+// Implementation strategy:
+//
+//   1. Defensive null-check (callers in error-handler paths can
+//      hand us a half-constructed model with `.backend == nullptr`
+//      or a stale ctx pointer).  Either case → `nullptr`.
+//
+//   2. Probe-gated dispatch.  We reuse the round-3 capability
+//      probe `supertonic_backend_supports_pinned_host_buffer`
+//      so the wired cache builds can also call the probe
+//      independently (e.g. to decide whether to even create the
+//      input_ctx).  The cache itself is process-wide so the
+//      lookup is constant-time after the first cold miss.
+//
+//   3. `ggml_backend_alloc_ctx_tensors_from_buft(ctx, host_buft)`
+//      walks every tensor in `input_ctx`, allocates one
+//      contiguous buffer from `host_buft` big enough to hold
+//      all of them, and binds each tensor to its slot in that
+//      buffer.  Returns the buffer (owned by caller) or
+//      `nullptr` on alloc failure (e.g. BAR memory exhausted —
+//      rare; caller falls back to gallocr's default-buft path
+//      which uses device memory + staging).
+//
+// On the dev rig (RTX 5090 + 128 GB host RAM), the host buffer
+// for a typical (L=20, text_len=24) synth is ~80 KB total —
+// trivial vs the multi-GB device buffers gallocr would have
+// otherwise produced, but the saving is on the per-step uploads
+// where each `ggml_backend_tensor_set` skips one staging-buffer
+// memcpy on the way to BAR memory.
+ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
+    const supertonic_model & model,
+    ggml_context * input_ctx) {
+    if (model.backend == nullptr || input_ctx == nullptr) {
+        return nullptr;
+    }
+    // Probe — bypasses any Vulkan-symbol dependency on backends
+    // that don't ship one (CPU, Metal, OpenCL, accel, BLAS...).
+    if (!supertonic_backend_supports_pinned_host_buffer(model.backend)) {
+        return nullptr;
+    }
+    // Resolve the host-pinned buffer type through the registry API
+    // (`ggml_backend_dev_host_buffer_type`) so the call links under
+    // `GGML_BACKEND_DL=ON`.  Same value the legacy
+    // `ggml_backend_vk_host_buffer_type()` returns, sourced from the
+    // device-level slot instead of the per-backend static entry.
+    ggml_backend_dev_t dev = ggml_backend_get_device(model.backend);
+    ggml_backend_buffer_type_t host_buft =
+        dev ? ggml_backend_dev_host_buffer_type(dev) : nullptr;
+    if (host_buft == nullptr) {
+        // Probe said yes but the device slot now returns null —
+        // defensive race against a backend that lost the capability
+        // between probe and call.  Fall back to nullptr; caller uses
+        // gallocr's default path.
+        return nullptr;
+    }
+    // Allocates one buffer big enough to hold every tensor in
+    // `input_ctx` AND binds each tensor to its slot.  Caller owns
+    // the returned buffer.  Returns nullptr on BAR exhaustion
+    // (extremely rare) — caller falls through.
+    return ggml_backend_alloc_ctx_tensors_from_buft(input_ctx, host_buft);
+}
+
+// QVAC-18605 round 13 #1 — input-scratchpad allocator that
+// consolidates the round-12 boilerplate.  See the docstring on
+// the declaration in supertonic_internal.h for the contract.
+//
+// Implementation:
+//   1. Defensive null-checks first.  These cover error-handler
+//      paths where the caller hands us a half-constructed state.
+//   2. Try pinned-host via `try_alloc_inputs_in_pinned_host_buffer`.
+//      Returns on success.
+//   3. Fall back to `ggml_backend_alloc_ctx_tensors`.  This
+//      allocates from the backend's default buffer type, which
+//      on Vulkan is device-local memory (with the usual staging
+//      hop per `ggml_backend_tensor_set`); on CPU it's host
+//      memory directly.  Same correctness as pre-round-12.
+//   4. On BOTH failing, throw with a message including the
+//      cache name so operators can correlate the failure with a
+//      specific cache rebuild site.
+ggml_backend_buffer_t alloc_input_scratchpad_or_throw(
+    const supertonic_model & model,
+    ggml_context * input_ctx,
+    const char * cache_name) {
+    if (cache_name == nullptr) {
+        throw std::runtime_error(
+            "supertonic: alloc_input_scratchpad_or_throw: cache_name is null "
+            "(caller-bug: pass a string literal naming the cache)");
+    }
+    if (model.backend == nullptr) {
+        throw std::runtime_error(
+            std::string("supertonic: ") + cache_name +
+            ": cannot allocate input scratchpad without a backend "
+            "(model.backend is null)");
+    }
+    if (input_ctx == nullptr) {
+        throw std::runtime_error(
+            std::string("supertonic: ") + cache_name +
+            ": cannot allocate input scratchpad with a null ggml_context");
+    }
+    // First try pinned-host (Vulkan-only).  Round 12 #5 already
+    // returns nullptr cleanly on CPU / Metal / OpenCL / etc.
+    ggml_backend_buffer_t buf =
+        try_alloc_inputs_in_pinned_host_buffer(model, input_ctx);
+    if (buf) return buf;
+    // Fall back to default backend buffer.  Same correctness as
+    // pre-round-12; just one staging hop per upload on Vulkan.
+    buf = ggml_backend_alloc_ctx_tensors(input_ctx, model.backend);
+    if (buf) return buf;
+    // Both failed — this is a system-level resource issue (BAR
+    // exhaustion AND device-memory exhaustion).  Loud failure so
+    // the operator's logs surface the cache that ran out of room.
+    throw std::runtime_error(
+        std::string("supertonic: ") + cache_name +
+        ": failed to allocate input scratchpad "
+        "(both pinned-host and default-backend paths returned null)");
+}
+
+// QVAC-18605 round 3 — multi-device Vulkan auto-pick policy.
+//
+// Pure logic — no Vulkan symbols touched here.  The Vulkan-only
+// wrapper (`init_supertonic_backend`'s `#ifdef GGML_USE_VULKAN`
+// branch) calls `ggml_backend_vk_get_device_memory()` per device
+// to build the `free_vram_per_device` list, then dispatches into
+// this helper.  Splitting the policy from the plumbing means the
+// behaviour matrix is testable on CPU with synthetic inputs (see
+// test_supertonic_vulkan_device_select.cpp).
+//
+// See the docstring on the declaration in supertonic_internal.h
+// for the behaviour matrix.
+int resolve_vulkan_device_index(int requested,
+                                const std::vector<size_t> & free_vram_per_device,
+                                const std::vector<bool> & is_uma_per_device) {
+    const int dev_count = (int) free_vram_per_device.size();
+    if (dev_count <= 0) {
+        throw std::runtime_error(
+            "supertonic: cannot resolve --vulkan-device against an empty "
+            "device list (no Vulkan adapter visible)");
+    }
+    // Round-12 caller-bug guard.  When `is_uma_per_device` is
+    // non-empty its length MUST match `free_vram_per_device`;
+    // otherwise we'd be reading off the end of one of the
+    // vectors below.  Empty (the default) is fine — falls through
+    // to the round-3 policy.
+    if (!is_uma_per_device.empty() &&
+        is_uma_per_device.size() != free_vram_per_device.size()) {
+        throw std::runtime_error(
+            "supertonic: is_uma_per_device.size()=" +
+            std::to_string(is_uma_per_device.size()) +
+            " must equal free_vram_per_device.size()=" +
+            std::to_string(free_vram_per_device.size()) +
+            " when non-empty");
+    }
+    // Reserved-future negative value — fail loud instead of
+    // silently treating as 0 (would mask a CLI typo).
+    if (requested < -1) {
+        throw std::runtime_error(
+            "supertonic: --vulkan-device " + std::to_string(requested) +
+            " is reserved (only -1 means auto-pick)");
+    }
+    // Auto-pick.
+    if (requested == -1) {
+        // Round-12: when UMA flags are available AND at least
+        // one discrete device exists, restrict the argmax to
+        // the discrete subset.  Discrete-only argmax preserves
+        // round-3's tie-break (lower index) within the subset.
+        //
+        // `is_uma_per_device.empty()` is the round-3 path —
+        // unchanged behaviour for every caller that hasn't yet
+        // wired the UMA flag list.
+        //
+        // ASSUMPTION (PR #18 review): `is_uma_per_device[i]` is
+        // populated from `ggml_backend_dev_get_props().type`
+        // mapped through `GGML_BACKEND_DEVICE_TYPE_IGPU / _CPU /
+        // _ACCEL` → UMA, otherwise → discrete.  This is correct
+        // on every test-matrix entry we have (RTX 5090 + AMD
+        // RADV iGPU, single-discrete-only, single-UMA-only,
+        // all-UMA, multi-discrete).  Edge case that can silently
+        // mis-classify: a discrete adapter whose driver
+        // mis-reports its type as `_IGPU` (some Thunderbolt eGPU
+        // configurations; some ARM SoC dGPU paths).  On such a
+        // rig:
+        //   - the discrete is flagged UMA → excluded from the
+        //     discrete-subset argmax;
+        //   - if every other visible adapter is also flagged UMA,
+        //     `any_discrete == false` and we fall through to the
+        //     round-3 all-device argmax → discrete still picked
+        //     by `free_vram` (correct outcome by coincidence).
+        //   - if the rig also has a TRUE UMA iGPU with more
+        //     reported "free VRAM" (system RAM), the round-12
+        //     bias prefers the iGPU over the mis-classified
+        //     discrete → silent regression vs. round 3.  Operator
+        //     escape hatch: `--vulkan-device N` is UMA-agnostic
+        //     (passes through unchanged below) so an explicit
+        //     index always wins.  `--vulkan-perf-logger` exposes
+        //     the chosen device in the bench JSON for
+        //     post-mortem diagnosis.
+        //   - Future hardening: add a "free-VRAM ceiling" filter
+        //     (e.g. UMA reports system-RAM-scale numbers; a
+        //     discrete reporting > 256 GB is implausible and can
+        //     be heuristically re-classified).  Out-of-scope for
+        //     QVAC-18605; tracked in
+        //     `aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md`.
+        if (!is_uma_per_device.empty()) {
+            bool any_discrete = false;
+            for (bool u : is_uma_per_device) {
+                if (!u) { any_discrete = true; break; }
+            }
+            if (any_discrete) {
+                // argmax over the discrete subset; ties → lower
+                // index.  Manual loop instead of max_element +
+                // predicate because we need the ORIGINAL index
+                // (not the subset's local index).
+                int best_idx = -1;
+                size_t best_free = 0;
+                for (int i = 0; i < dev_count; ++i) {
+                    if (is_uma_per_device[(size_t) i]) continue;
+                    if (best_idx == -1 || free_vram_per_device[(size_t) i] > best_free) {
+                        best_idx  = i;
+                        best_free = free_vram_per_device[(size_t) i];
+                    }
+                }
+                return best_idx;  // can't be -1; any_discrete == true
+            }
+            // Fall through: all-UMA → round-3 argmax over all.
+        }
+        // Round-3 path: argmax(free VRAM); ties → lower index.
+        // std::max_element returns the first iterator that
+        // compares equal under `<` so the tie-breaking rule is
+        // implicit in the std::less<> default.
+        const auto it = std::max_element(free_vram_per_device.begin(),
+                                         free_vram_per_device.end());
+        return (int) std::distance(free_vram_per_device.begin(), it);
+    }
+    // Explicit index — range-check.  UMA-agnostic (operator-
+    // pinned index always wins, regardless of device type).
+    if (requested >= dev_count) {
+        throw std::runtime_error(
+            "supertonic: --vulkan-device " + std::to_string(requested) +
+            " out of range (visible adapters: " +
+            std::to_string(dev_count) + ")");
+    }
+    return requested;
+}
+
+// Test seam — drops every cached entry so the regression test in
+// `test_supertonic_capability_cache.cpp` can verify the cache is
+// hit on the second call (the cold-cache call bumps the probe
+// counter; subsequent calls don't until the cache is cleared).
+// Not part of the supported public API; the symbol is exported
+// only for the in-process test harness and not declared in the
+// `supertonic_internal.h` header for external consumers.
+void supertonic_clear_capability_cache() {
+    std::lock_guard<std::mutex> lk(capability_cache_mu());
+    capability_cache().clear();
+}
+
+// Test seam — exposes the cold-cache probe call counter so the
+// regression test can assert the cache short-circuits the
+// uncached path on a hit.  Returns the counter's *current* value,
+// which the caller compares before / after `cached_backend_*`
+// calls to verify zero increments on a hot cache.
+uint64_t supertonic_capability_probe_call_count() {
+    return capability_probe_call_counter().load(std::memory_order_relaxed);
+}
+
+// QVAC-18605 round 7 — Vulkan env-var passthrough.
+//
+// ALL-OR-NOTHING: validate every key starts with `GGML_VK_`
+// BEFORE touching the environment.  An operator-config typo like
+// `GMML_VK_PREFER_HOST_MEMORY` throws cleanly without leaving the
+// env in a half-applied state where the good entries took effect
+// but the bad one didn't.  Empty map is a no-op (regression-
+// guarded by `test_empty_map_is_noop`).
+//
+// `set_env_if_unset` semantics: an operator-set env var (already
+// present in the environment when this is called) WINS over the
+// EngineOptions override.  Lets a debugging operator force-disable
+// a setting from the shell without recompiling, while still
+// letting the production EngineOptions configuration set the same
+// knob in the absence of a shell override.
+void apply_vulkan_env_overrides(const std::map<std::string, std::string> & overrides) {
+    if (overrides.empty()) return;
+    std::string bad;
+    if (find_invalid_vulkan_env_key(overrides, bad)) {
+        throw std::runtime_error(
+            "supertonic: invalid Vulkan env-var override key '" + bad +
+            "' — keys must start with 'GGML_VK_' (operator-config typo guard)");
+    }
+    for (const auto & kv : overrides) {
+        set_env_if_unset(kv.first.c_str(), kv.second.c_str());
+    }
+}
+
+// QVAC-18605 round 7 — voice ttl/dp host cache.
+//
+// Implementation matches the contract documented on the struct
+// declaration in supertonic_internal.h.  Inlines the
+// `read_tensor_f32` body (defined in supertonic_engine.cpp, not
+// linkable from here) — three lines, zero abstraction cost.
+const voice_host_cache::entry &
+voice_host_cache::get_or_load(const std::string & voice_name,
+                              ggml_tensor * ttl_tensor,
+                              ggml_tensor * dp_tensor) {
+    auto it = by_name_.find(voice_name);
+    if (it != by_name_.end()) {
+        // Cache HIT: return the existing entry without touching
+        // the GGML tensors.  Caller may legally pass nullptr for
+        // ttl/dp on a hit (see test_second_load_hits_cache).
+        return it->second;
+    }
+    if (!ttl_tensor || !dp_tensor) {
+        throw std::runtime_error(
+            "voice_host_cache: cache miss for voice '" + voice_name +
+            "' but ttl/dp tensor is null (Engine::Impl bug — voices.find() should "
+            "have validated the voice before this call)");
+    }
+    entry e;
+    e.ttl.resize((size_t) ggml_nelements(ttl_tensor));
+    ggml_backend_tensor_get(ttl_tensor, e.ttl.data(), 0, ggml_nbytes(ttl_tensor));
+    e.dp.resize((size_t) ggml_nelements(dp_tensor));
+    ggml_backend_tensor_get(dp_tensor, e.dp.data(), 0, ggml_nbytes(dp_tensor));
+    auto inserted = by_name_.emplace(voice_name, std::move(e));
+    return inserted.first->second;
+}
+
+void voice_host_cache::clear() {
+    by_name_.clear();
+}
+
+size_t voice_host_cache::size() const {
+    return by_name_.size();
+}
+
+// Phase 2A — hot-weight predicate.
+//
+// Returns true for source names that should be materialised as
+// F16 on a non-CPU backend when `model.use_f16_weights` is set.
+// See the docstring on `should_materialise_f16_weight` in
+// supertonic_internal.h for the full roster + test references.
+//
+// Implementation rules:
+//   - String matching uses explicit suffix / contains checks; no
+//     regex (the predicate runs once per GGUF tensor at load time,
+//     not on the hot path, but we still want it cheap + audit-
+//     friendly).
+//   - Pre-transposed `__T` companions are excluded (the original
+//     gets materialised; the companion lives separately).
+//   - Bias / norm-weight / γ tensors are excluded by suffix.
+//   - Embedding tables and small fixed-shape per-channel vectors
+//     are excluded by name fragment.
+bool should_materialise_f16_weight(const std::string & source_name) {
+    if (source_name.empty()) return false;
+
+    auto ends_with = [&](const std::string & suffix) {
+        return source_name.size() >= suffix.size() &&
+               std::equal(suffix.rbegin(), suffix.rend(), source_name.rbegin());
+    };
+    auto contains = [&](const std::string & frag) {
+        return source_name.find(frag) != std::string::npos;
+    };
+
+    // Bias / scale / shift / γ — always cold.  Catches both
+    // `*.bias` and bias-like `linear.bias` substrings the audit
+    // explicitly negative-tested against.
+    if (ends_with(".bias"))                  return false;
+    if (contains(".linear.bias"))            return false;
+    if (contains(".norm.norm.weight"))       return false;
+    if (contains(".norm.norm.bias"))         return false;
+    if (ends_with(".gamma"))                 return false;
+    if (contains(".char_embedder.weight"))   return false;
+    if (contains(".emb_rel_k"))              return false;
+    if (contains(".emb_rel_v"))              return false;
+    if (contains("normalizer.scale"))        return false;
+    if (contains("PRelu_"))                  return false;
+    if (contains(".dwconv."))                return false;
+    if (contains(".attn.theta"))             return false;
+    // Pre-transposed companions (F6) are stored separately; the
+    // original goes through this predicate normally.  The `__T`
+    // suffix tags them.
+    if (ends_with("__T"))                    return false;
+    // Negative trap (test_supertonic_f16_weights.cpp covers this):
+    // a bias-like suffix could otherwise sneak through if it has
+    // a digit suffix that happens to match `_NNNN` below.
+    if (contains("MatMul_") && ends_with("_bias")) return false;
+
+    // Positive list:
+    //
+    //  - vector_estimator attention matmuls: `onnx::MatMul_NNNN`
+    //    where NNNN is the per-group / per-attention-site ID.
+    //    Cover-all by the `onnx::MatMul_` substring inside the
+    //    `vector_estimator:` namespace.
+    //  - vector_estimator convnext pwconv1/2: anything ending in
+    //    `.pwconv1.weight` or `.pwconv2.weight`.
+    //  - vocoder convnext pwconv1/2 + head linear: same suffix
+    //    convention.
+    //  - text-encoder linears: `text_encoder:onnx::MatMul_` and
+    //    the FFN `conv_1.weight` / `conv_2.weight`.
+    const bool ve  = source_name.rfind("vector_estimator:", 0) == 0;
+    const bool voc = source_name.rfind("vocoder:", 0) == 0;
+    const bool tex = source_name.rfind("text_encoder:", 0) == 0;
+    if (!ve && !voc && !tex) return false;
+
+    if (contains("onnx::MatMul_")) {
+        // Reject `onnx::MatMul_` followed by an empty / non-digit
+        // tail (audit test edge case: `"vector_estimator:onnx::MatMul_"`).
+        const size_t pos = source_name.find("onnx::MatMul_");
+        if (pos != std::string::npos) {
+            const std::string tail = source_name.substr(pos + 13);
+            if (tail.empty()) return false;
+            // First char of tail must be a digit; otherwise it's
+            // a name like `MatMul_bias_3101` which is a manufactured
+            // negative.  See predicate-negatives test.
+            if (!(tail[0] >= '0' && tail[0] <= '9')) return false;
+        }
+        return true;
+    }
+    if (ends_with(".pwconv1.weight")) return true;
+    if (ends_with(".pwconv2.weight")) return true;
+    if (ends_with(".head.layer1.net.weight")) return true;
+    if (ends_with(".head.layer2.weight"))     return true;
+    if (contains(".conv_1.weight")) return true;
+    if (contains(".conv_2.weight")) return true;
+
+    return false;
+}
+
+// QVAC-18605 round 6 — 2-arg overload.
+//
+// Two-stage decision:
+//
+//   1. If any non-empty entry in `extra_deny_substrings` is a
+//      substring of `source_name`, return `false` immediately.
+//      Operator-supplied deny patterns short-circuit the curated
+//      allow-list (they're meant to FORCE F32 even for tensors
+//      the curated path would have promoted).
+//
+//   2. Otherwise, forward to the 1-arg version (curated allow-
+//      list).
+//
+// Empty deny-list → behaviour identical to the 1-arg version
+// (zero behaviour change for every existing call site that
+// passes the default empty list).
+//
+// Empty strings inside the deny-list are SKIPPED on purpose:
+// substring `""` would otherwise match every name and silently
+// disable F16 weights for the entire model, which is almost
+// certainly an operator typo (e.g. trailing comma in a config
+// file producing an empty entry).  Surfacing the typo via a
+// loud warning would be nicer, but `should_materialise_f16_weight`
+// is a pure predicate with no logging hook; the defensive skip
+// keeps the predicate honest while a higher-layer config
+// validator can warn separately if desired.
+bool should_materialise_f16_weight(const std::string & source_name,
+                                   const std::vector<std::string> & extra_deny_substrings) {
+    if (source_name.empty()) return false;
+    for (const std::string & pattern : extra_deny_substrings) {
+        if (pattern.empty()) continue;  // defensive skip
+        if (source_name.find(pattern) != std::string::npos) {
+            return false;
+        }
+    }
+    return should_materialise_f16_weight(source_name);
+}
+
+// Thread-local dispatch flags consulted by the GGML graph builders to
+// pick between the CBLAS-backed `ggml_custom_4d` fast paths (CPU only)
+// and the portable pure-GGML fallbacks (any backend).  See the
+// supertonic_op_dispatch_scope comment in supertonic_internal.h.
+//
+// QVAC-18605 — `g_supertonic_use_native_leaky_relu` carries the
+// resolved-backend's `LEAKY_RELU` capability into the
+// `leaky_relu_portable_ggml` helper.  Defaults to `true` so the
+// historical CPU-only path keeps using the fused builtin even when no
+// scope is active (matches `g_supertonic_use_cpu_custom_ops`'s default
+// rationale).
+namespace {
+thread_local bool g_supertonic_use_cpu_custom_ops    = true;
+thread_local bool g_supertonic_use_f16_attn          = false;
+thread_local bool g_supertonic_use_native_leaky_relu = true;
+// QVAC-18605 round 4 — current K/V flash-attn dispatch dtype.
+// Defaults to f32 so a graph builder called outside any
+// `supertonic_op_dispatch_scope` doesn't accidentally take the
+// F16/BF16/Q8_0 path (matches the model's default value).
+thread_local kv_attn_dtype g_supertonic_kv_attn_type  = kv_attn_dtype::f32;
+}
+
+bool supertonic_use_cpu_custom_ops() {
+    return g_supertonic_use_cpu_custom_ops;
+}
+
+bool supertonic_use_f16_attn() {
+    return g_supertonic_use_f16_attn;
+}
+
+bool supertonic_use_native_leaky_relu() {
+    return g_supertonic_use_native_leaky_relu;
+}
+
+kv_attn_dtype supertonic_kv_attn_type() {
+    return g_supertonic_kv_attn_type;
+}
+
+supertonic_op_dispatch_scope::supertonic_op_dispatch_scope(const supertonic_model & model)
+    : prev_use_cpu_custom_ops(g_supertonic_use_cpu_custom_ops),
+      prev_use_f16_attn(g_supertonic_use_f16_attn),
+      prev_use_native_leaky_relu(g_supertonic_use_native_leaky_relu),
+      prev_kv_attn_type(g_supertonic_kv_attn_type) {
+    g_supertonic_use_cpu_custom_ops    = model.backend_is_cpu;
+    g_supertonic_use_f16_attn          = model.use_f16_attn;
+    g_supertonic_use_native_leaky_relu = model.use_native_leaky_relu;
+    g_supertonic_kv_attn_type          = model.kv_attn_type;
+}
+
+supertonic_op_dispatch_scope::~supertonic_op_dispatch_scope() {
+    g_supertonic_use_cpu_custom_ops    = prev_use_cpu_custom_ops;
+    g_supertonic_use_f16_attn          = prev_use_f16_attn;
+    g_supertonic_use_native_leaky_relu = prev_use_native_leaky_relu;
+    g_supertonic_kv_attn_type          = prev_kv_attn_type;
+}
+
+// QVAC-18605 round 4 — pure-logic resolver for the multi-dtype
+// K/V dispatch policy.  Implementation matches the behaviour
+// matrix documented on the declaration in supertonic_internal.h.
+//
+// Out-of-range inputs throw to surface CLI typos loudly; probe-
+// rejected explicit requests fall back to f32 silently (same
+// "advisory probes" pattern as the round-1 use_f16_attn auto-
+// policy fallback).
+kv_attn_dtype resolve_kv_attn_type(int requested,
+                                   bool legacy_use_f16_attn,
+                                   bool backend_supports_f16,
+                                   bool backend_supports_bf16,
+                                   bool backend_supports_q8_0,
+                                   bool * out_was_downgraded) {
+    if (out_was_downgraded) *out_was_downgraded = false;
+    if (requested < -1 || requested > 3) {
+        throw std::runtime_error(
+            "supertonic: --kv-attn-type " + std::to_string(requested) +
+            " out of range (valid: -1=auto, 0=f32, 1=f16, 2=bf16, 3=q8_0)");
+    }
+    switch (requested) {
+        case -1:  // auto
+            // No downgrade flag — operator didn't ask for a
+            // specific dtype, so falling back to f32 is the
+            // auto-policy doing its job, not a surprise.
+            if (legacy_use_f16_attn && backend_supports_f16) return kv_attn_dtype::f16;
+            return kv_attn_dtype::f32;
+        case 0:   // f32 forced
+            return kv_attn_dtype::f32;
+        case 1:   // f16 forced (probe-gated fallback)
+            if (backend_supports_f16) return kv_attn_dtype::f16;
+            if (out_was_downgraded) *out_was_downgraded = true;
+            return kv_attn_dtype::f32;
+        case 2:   // bf16 forced (probe-gated fallback)
+            if (backend_supports_bf16) return kv_attn_dtype::bf16;
+            if (out_was_downgraded) *out_was_downgraded = true;
+            return kv_attn_dtype::f32;
+        case 3:   // q8_0 forced (probe-gated fallback)
+            if (backend_supports_q8_0) return kv_attn_dtype::q8_0;
+            if (out_was_downgraded) *out_was_downgraded = true;
+            return kv_attn_dtype::f32;
+        default:
+            // Unreachable — the range check above covers every
+            // valid request.  Defensive throw in case the switch
+            // is extended without updating the range check.
+            throw std::runtime_error("supertonic: resolve_kv_attn_type unreachable");
+    }
+}
+
+// ---------------------------------------------------------------------
+// Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter.
+//
+// Implementation lives here (in `supertonic_gguf.cpp`) rather than a
+// dedicated TU because:
+//   - the supertonic library already pulls this file in unconditionally
+//     (load_supertonic_gguf is the public entry point).
+//   - the file-local state (FILE *, mutex, env-probe latch) doesn't
+//     need to be shared across TUs.
+//
+// Storage model:
+//   - One `FILE *` opened at "first record after path set" time.
+//   - A mutex guards record / flush / set_path so the emitter is
+//     safe to call from any thread (the rest of the engine is
+//     single-threaded per model, but tests may spawn helpers).
+//   - The env var `SUPERTONIC_PROFILE_CSV` is probed lazily on the
+//     first `record` / `enabled` call after process start; tests
+//     override via `set_path(PATH)` which bypasses the env probe.
+//
+// Schema (matches the contract in
+// `test_supertonic_profile_csv.cpp`):
+//
+//   stage,island,step,wall_ms,unix_us
+//
+// The header row is written once, lazily, the first time we open
+// a new file that's empty.  Re-opening the same path appends, so
+// long-running bench harnesses can record many synths without
+// stomping their header / data.
+namespace {
+
+struct profile_csv_state {
+    std::mutex   mu;
+    std::FILE *  fp = nullptr;
+    std::string  path;
+    bool         env_checked = false;
+};
+
+profile_csv_state & profile_csv() {
+    static profile_csv_state s;
+    return s;
+}
+
+void profile_csv_close_locked(profile_csv_state & s) {
+    if (s.fp) {
+        std::fclose(s.fp);
+        s.fp = nullptr;
+    }
+    s.path.clear();
+}
+
+void profile_csv_open_locked(profile_csv_state & s, const std::string & path) {
+    // Append mode so multiple sessions can share one CSV.
+    // We only write the header when the file is empty (fresh).
+    bool need_header = false;
+    {
+        std::FILE * probe = std::fopen(path.c_str(), "rb");
+        if (probe) {
+            std::fseek(probe, 0, SEEK_END);
+            const long sz = std::ftell(probe);
+            need_header = (sz == 0);
+            std::fclose(probe);
+        } else {
+            need_header = true;
+        }
+    }
+    s.fp = std::fopen(path.c_str(), "ab");
+    if (!s.fp) return; // open failure → emitter stays disabled
+    s.path = path;
+    if (need_header) {
+        std::fprintf(s.fp, "stage,island,step,wall_ms,unix_us\n");
+        std::fflush(s.fp);
+    }
+}
+
+void profile_csv_atexit_flush() {
+    // Best-effort flush + close on normal process exit; if the
+    // bench harness segfaults we lose buffered rows but that's
+    // the same trade-off any FILE *-based logger makes.
+    profile_csv_state & s = profile_csv();
+    std::lock_guard<std::mutex> lk(s.mu);
+    if (s.fp) {
+        std::fflush(s.fp);
+        std::fclose(s.fp);
+        s.fp = nullptr;
+    }
+}
+
+void profile_csv_probe_env_locked(profile_csv_state & s) {
+    if (s.env_checked) return;
+    s.env_checked = true;
+    const char * env = std::getenv("SUPERTONIC_PROFILE_CSV");
+    if (env && *env) {
+        profile_csv_open_locked(s, env);
+        // Register an atexit hook the first time we open via the
+        // env var.  Tests that flip the path via `_set_path` get
+        // the flush via their explicit teardown call instead;
+        // they don't need an atexit because the unit harness
+        // explicitly cleans up.
+        std::atexit(profile_csv_atexit_flush);
+    }
+}
+
+} // namespace
+
+bool supertonic_profile_csv_enabled() {
+    profile_csv_state & s = profile_csv();
+    std::lock_guard<std::mutex> lk(s.mu);
+    profile_csv_probe_env_locked(s);
+    return s.fp != nullptr;
+}
+
+void supertonic_profile_csv_record(const char * stage, const char * island,
+                                   int step, double wall_ms) {
+    profile_csv_state & s = profile_csv();
+    std::lock_guard<std::mutex> lk(s.mu);
+    profile_csv_probe_env_locked(s);
+    if (!s.fp) return;
+    // Wall clock in microseconds-since-epoch so the CSV is sortable
+    // across separate bench harness invocations.  `steady_clock`
+    // would be cheaper but isn't comparable across processes; the
+    // CSV is post-analysed not perf-critical.
+    const auto now = std::chrono::system_clock::now().time_since_epoch();
+    const long long unix_us =
+        std::chrono::duration_cast<std::chrono::microseconds>(now).count();
+    std::fprintf(s.fp, "%s,%s,%d,%.3f,%lld\n",
+                 stage ? stage : "",
+                 island ? island : "",
+                 step,
+                 wall_ms,
+                 unix_us);
+}
+
+void supertonic_profile_csv_flush() {
+    profile_csv_state & s = profile_csv();
+    std::lock_guard<std::mutex> lk(s.mu);
+    if (s.fp) std::fflush(s.fp);
+}
+
+void supertonic_profile_csv_set_path(const char * path) {
+    profile_csv_state & s = profile_csv();
+    std::lock_guard<std::mutex> lk(s.mu);
+    profile_csv_close_locked(s);
+    // Latch the env probe even when the caller passes nullptr so
+    // that a subsequent enabled()/record() call doesn't accidentally
+    // re-pick-up the env var after the test asked us to disable.
+    s.env_checked = true;
+    if (path && *path) {
+        profile_csv_open_locked(s, path);
+    }
+}
+
 ggml_tensor * require_tensor(const supertonic_model & model, const std::string & name) {
     ggml_tensor * t = get_tensor_or_null(model, name);
     if (!t) throw std::runtime_error("missing tensor: " + name);
@@ -193,6 +1593,19 @@ ggml_tensor * require_source_tensor(const supertonic_model & model, const std::s
     return it->second;
 }
 
+ggml_tensor * try_source_tensor(const supertonic_model & model, const std::string & source_name) {
+    auto it = model.source_tensors.find(source_name);
+    if (it == model.source_tensors.end()) return nullptr;
+    return it->second;
+}
+
+ggml_tensor * try_pretransposed_weight(const supertonic_model & model, const ggml_tensor * w) {
+    if (!w) return nullptr;
+    auto it = model.pretransposed_weights.find(w);
+    if (it == model.pretransposed_weights.end()) return nullptr;
+    return it->second;
+}
+
 void supertonic_set_n_threads(supertonic_model & model, int n_threads) {
     configure_supertonic_blas_threads_once();
     if (n_threads <= 0) {
@@ -209,6 +1622,38 @@ void supertonic_graph_compute(const supertonic_model & model, ggml_cgraph * grap
     if (model.n_threads > 0) {
         ::tts_cpp::detail::backend_set_n_threads(model.backend, model.n_threads);
     }
+    static const bool count_dispatches = std::getenv("SUPERTONIC_COUNT_DISPATCHES") != nullptr;
+    static const bool dump_op_histogram = std::getenv("SUPERTONIC_DUMP_OP_HISTOGRAM") != nullptr;
+    if (dump_op_histogram) {
+        static thread_local int hist_call = 0;
+        ++hist_call;
+        const int n = ggml_graph_n_nodes(graph);
+        std::map<std::string, int> hist;
+        for (int i = 0; i < n; ++i) {
+            ggml_tensor * t = ggml_graph_node(graph, i);
+            hist[ggml_op_name(t->op)] += 1;
+        }
+        fprintf(stderr, "=== supertonic_graph_compute #%d op histogram (n_nodes=%d) ===\n", hist_call, n);
+        std::vector<std::pair<int, std::string>> sorted;
+        for (auto & kv : hist) sorted.emplace_back(kv.second, kv.first);
+        std::sort(sorted.rbegin(), sorted.rend());
+        for (auto & p : sorted) {
+            fprintf(stderr, "  %4d  %s\n", p.first, p.second.c_str());
+        }
+    }
+    if (count_dispatches) {
+        static thread_local int n_calls = 0;
+        static thread_local double total_us = 0.0;
+        ++n_calls;
+        const auto t0 = std::chrono::steady_clock::now();
+        ggml_backend_graph_compute(model.backend, graph);
+        const auto t1 = std::chrono::steady_clock::now();
+        const double us = std::chrono::duration<double, std::micro>(t1 - t0).count();
+        total_us += us;
+        fprintf(stderr, "supertonic_graph_compute #%d nodes=%d  wall=%.1fus  cumul=%.2fms\n",
+                n_calls, ggml_graph_n_nodes(graph), us, total_us / 1000.0);
+        return;
+    }
     ggml_backend_graph_compute(model.backend, graph);
 }
 
@@ -263,8 +1708,25 @@ static void bind_vocoder_weights(supertonic_model & model) {
 bool load_supertonic_gguf(const std::string & path,
                           supertonic_model & model,
                           int n_gpu_layers,
-                          bool verbose) {
+                          bool verbose,
+                          int f16_weights,
+                          supertonic_precision precision,
+                          int vulkan_device,
+                          const std::vector<std::string> & f16_weights_deny_list) {
     model.generation_id = next_supertonic_generation_id();
+    model.precision_id = static_cast<int>(precision);
+    // The load path supports F32 / F16 / Q8_0 destination types.
+    // - F32: fully wired.
+    // - Q8_0: storage on Metal only for `:onnx::MatMul_*` weights (the
+    //   optimised `kernel_mul_mm_q8_0_f32` dispatches via the swapped-
+    //   args `dense_matmul_time_wt_pretransposed_ggml` helper).  Other
+    //   tensors expand to f32.  On CPU everything expands to f32 so
+    //   cblas/AMX keeps the lead.
+    // - F16: same asymmetric scheme as Q8_0 — `:onnx::MatMul_*` weights
+    //   stay f16 on Metal (dispatches `kernel_mul_mm_f16_f32`), other
+    //   GGUF-f16 tensors (relpos embeddings, per-channel scales used in
+    //   plain `ggml_mul`) expand to f32 so they don't trip `ggml_metal_op_bin`'s
+    //   f32-only assertion.  Pretranspose pass covers f16 alongside f32/q8_0.
     ggml_context * tmp_ctx = nullptr;
     gguf_init_params gp = { /*.no_alloc=*/ false, /*.ctx=*/ &tmp_ctx };
     gguf_context * gguf_ctx = gguf_init_from_file(path.c_str(), gp);
@@ -299,32 +1761,297 @@ bool load_supertonic_gguf(const std::string & path,
         model.languages = get_string_array(gguf_ctx, "supertonic.languages");
         model.tts_json = get_string(gguf_ctx, "supertonic.tts_json");
 
-        model.backend = init_supertonic_backend(n_gpu_layers, verbose);
+        model.backend = init_supertonic_backend(n_gpu_layers, verbose, vulkan_device);
+        // The graph builders below dispatch between CBLAS-backed
+        // `ggml_custom_4d` fast paths (CPU only) and pure-GGML fallbacks
+        // (any backend) based on this flag.  Stable for the model's
+        // lifetime; see the supertonic_op_dispatch_scope comment in
+        // supertonic_internal.h for the threading contract.
+        model.backend_is_cpu = ggml_backend_is_cpu(model.backend);
+        // QVAC-18605 — Vulkan-specific dispatch capture.
+        //
+        // `backend_is_vk` is informational (the bench / engine show it
+        // in the human-readable backend description), but it also
+        // documents WHICH non-CPU backend the model resolved to —
+        // useful when triaging "why is leaky_relu slow on this run?"
+        // against the audit's expected fast-path matrix.
+        model.backend_is_vk = backend_is_vulkan(model.backend);
+        // Probe the backend's `LEAKY_RELU` capability so the
+        // `leaky_relu_portable_ggml` helper can route to the fused
+        // builtin on backends that have it (Vulkan / Metal / CUDA /
+        // CPU; OpenCL only with chatterbox patch) and to the
+        // RELU+SCALE+ADD decomposition otherwise.  Probe runs once
+        // per backend (memoised by `cached_backend_capabilities`)
+        // — zero hot-path cost.
+        model.use_native_leaky_relu = cached_backend_capabilities(model.backend).native_leaky_relu;
+        if (verbose) {
+            fprintf(stderr, "supertonic: backend_is_cpu=%s backend_is_vk=%s use_native_leaky_relu=%s\n",
+                    model.backend_is_cpu ? "true" : "false",
+                    model.backend_is_vk ? "true" : "false",
+                    model.use_native_leaky_relu ? "true" : "false");
+        }
+
+        // Phase 2A — auto/force policy for F16 weight materialization.
+        // Auto-enable on non-CPU backends; never auto-enable on CPU
+        // (the CBLAS custom-op fast paths require F32 storage).
+        //
+        // QVAC-18605 follow-up — the auto policy is now backend-
+        // capability-gated.  Symmetric to the F16-K/V flash-attn
+        // probe: a backend that ships F16 storage but rejects the
+        // hot `mul_mat(F16, F32)` shape Supertonic dispatches every
+        // step would crash at first synth call when this flipped on
+        // blindly.  The probe (`backend_supports_f16_mul_mat_uncached`
+        // → `cached_backend_capabilities`) tries the live shape
+        // (W=[256, 256] F16, X=[256, 16] F32) at backend resolution
+        // time; on a `false` answer the auto policy refuses to
+        // materialise F16 weights — slower but correct.  Manual
+        // override via `--f16-weights 1` still forces dispatch
+        // (useful for debug-shim backends and forward-compat tests).
+        if (f16_weights < 0) {
+            model.use_f16_weights = !model.backend_is_cpu &&
+                                    cached_backend_capabilities(model.backend).f16_mul_mat;
+        } else {
+            model.use_f16_weights = (f16_weights != 0);
+        }
+        if (verbose) {
+            fprintf(stderr, "supertonic: use_f16_weights=%s\n",
+                    model.use_f16_weights ? "true" : "false");
+            // Round 6 — log the user-supplied deny-list (if any) so
+            // operators can confirm their config got plumbed through.
+            // Empty list (the default) is silent — same baseline as
+            // the round-3 log output.
+            if (model.use_f16_weights && !f16_weights_deny_list.empty()) {
+                fprintf(stderr,
+                        "supertonic: f16_weights_deny_list (%zu pattern%s):\n",
+                        f16_weights_deny_list.size(),
+                        f16_weights_deny_list.size() == 1 ? "" : "s");
+                for (const auto & p : f16_weights_deny_list) {
+                    fprintf(stderr, "  - \"%s\"%s\n", p.c_str(),
+                            p.empty() ? " (empty — skipped at predicate time)" : "");
+                }
+            }
+        }
+
+        // Phase 2A pre-step: build a (tensor_name → source_name)
+        // lookup BEFORE the alloc loop so we can apply the hot-
+        // weight predicate at allocation time (and pick F16 vs F32
+        // storage accordingly).  Same metadata arrays as the
+        // post-alloc source_tensors map further below; reading them
+        // twice is cheap.
+        std::unordered_map<std::string, std::string> tensor_to_source_for_alloc;
+        if (model.use_f16_weights) {
+            int64_t id_tn = gguf_find_key(gguf_ctx, "supertonic.tensor_names");
+            int64_t id_sn = gguf_find_key(gguf_ctx, "supertonic.source_names");
+            if (id_tn >= 0 && id_sn >= 0) {
+                const size_t n_tn = gguf_get_arr_n(gguf_ctx, id_tn);
+                const size_t n_sn = gguf_get_arr_n(gguf_ctx, id_sn);
+                if (n_tn == n_sn) {
+                    for (size_t i = 0; i < n_tn; ++i) {
+                        tensor_to_source_for_alloc[gguf_get_arr_str(gguf_ctx, id_tn, i)] =
+                            gguf_get_arr_str(gguf_ctx, id_sn, i);
+                    }
+                }
+            }
+        }
 
         const int64_t num_tensors = gguf_get_n_tensors(gguf_ctx);
+        // Reserve a small surplus of tensor-overhead slots for the
+        // audit-driven pre-baked tensors that load_supertonic_gguf
+        // appends to `model.ctx_w` below: F2 vocoder bn_scale_pre +
+        // bn_shift_pre, plus F6's pre-transposed companions for the
+        // five hot t_proj weights.  A surplus of 16 covers the
+        // current roster + headroom for follow-up audit phases.
+        constexpr int64_t kPrebakedTensorSurplus = 16;
         ggml_init_params params = {
-            /*.mem_size=*/ ggml_tensor_overhead() * (size_t) num_tensors,
+            /*.mem_size=*/ ggml_tensor_overhead() * (size_t)(num_tensors + kPrebakedTensorSurplus),
             /*.mem_buffer=*/ nullptr,
             /*.no_alloc=*/ true,
         };
         model.ctx_w = ggml_init(params);
         if (!model.ctx_w) throw std::runtime_error("ggml_init failed");
 
-        std::unordered_map<std::string, std::vector<float>> expanded_f32_tensors;
+        std::unordered_map<std::string, std::vector<float>>     expanded_f32_tensors;
+        // Phase 2A: tensors materialised as F16 land their host-side
+        // F16 payload here.  `ggml_fp16_t` is a 16-bit half-float;
+        // we use `uint16_t` storage to avoid a public-header dep on
+        // ggml's f16 typedef.
+        std::unordered_map<std::string, std::vector<uint16_t>>   f16_materialised_tensors;
+        // Tensors that need a Metal-specific type conversion (e.g.
+        // f32 → q8_0 for `--precision q8_0`) keep their converted
+        // bytes here, held alive until the backend upload loop runs.
+        std::unordered_map<std::string, std::vector<uint8_t>>    converted_tensors;
+
+        // Ensure the source-alias map is populated even when the
+        // Phase 2A `use_f16_weights` path didn't already build it —
+        // the precision-driven decision below also needs it to
+        // recognise `:onnx::MatMul_` sources for Metal asymmetric load.
+        if (tensor_to_source_for_alloc.empty()) {
+            int64_t id_tn = gguf_find_key(gguf_ctx, "supertonic.tensor_names");
+            int64_t id_sn = gguf_find_key(gguf_ctx, "supertonic.source_names");
+            if (id_tn >= 0 && id_sn >= 0) {
+                const size_t n_tn = gguf_get_arr_n(gguf_ctx, id_tn);
+                const size_t n_sn = gguf_get_arr_n(gguf_ctx, id_sn);
+                if (n_tn == n_sn) {
+                    for (size_t i = 0; i < n_tn; ++i) {
+                        tensor_to_source_for_alloc[gguf_get_arr_str(gguf_ctx, id_tn, i)] =
+                            gguf_get_arr_str(gguf_ctx, id_sn, i);
+                    }
+                }
+            }
+        }
+
+        // Decide per-tensor destination type:
+        //  1. F32 sources on the F16-weights hot-path roster +
+        //     `use_f16_weights` on → materialise as F16 (Phase 2A).
+        //  2. Else fall through to the precision-driven path:
+        //     `target_supertonic_storage_type` returns F32 / F16 / Q8_0
+        //     depending on `--precision` and whether the source name is
+        //     a `:onnx::MatMul_` weight on a non-CPU backend.
+        //  3. Anything else preserves the source type via dup.
         for (int64_t i = 0; i < num_tensors; ++i) {
             const char * name = gguf_get_tensor_name(gguf_ctx, i);
             ggml_tensor * src = ggml_get_tensor(tmp_ctx, name);
             if (!src) throw std::runtime_error(std::string("missing tmp tensor: ") + name);
-            ggml_tensor * dst = should_expand_supertonic_tensor(src->type)
-                ? ggml_new_tensor(model.ctx_w, GGML_TYPE_F32, ggml_n_dims(src), src->ne)
-                : ggml_dup_tensor(model.ctx_w, src);
+
+            // Phase 2A predicate check.  Only fires when
+            // `use_f16_weights` was on and the source resolved to
+            // a hot-roster name AND its current GGML type is
+            // either F32 or one of the expand-to-F32 types
+            // (otherwise the source already carries narrower
+            // precision than F16 and we don't widen).
+            //
+            // QVAC-18605 round 6 — the 2-arg overload layers the
+            // user-supplied `f16_weights_deny_list` substring
+            // patterns on top of the curated allow-list.  Empty
+            // deny-list (the default) → identical behaviour to
+            // the round-1/2/3 path.  When the deny-list flips a
+            // would-be-hot tensor back to F32 we bump
+            // `model.f16_weights_excluded_count` so bench output
+            // can confirm the user's deny-list took effect.
+            //
+            // Master's Phase 2A keys the decision off the source
+            // name resolved from `tensor_to_source_for_alloc`
+            // (falling back to the dst `name` when absent); round
+            // 6 narrows that to require the map lookup to succeed
+            // so the deny-list operates on a known-stable source
+            // identifier.  Net: a tensor that previously went F16
+            // via the dst-name fallback now stays at its native
+            // precision-path type — the curated allow-list isn't
+            // expected to hit on dst names so this is a no-op in
+            // practice.
+            // Resolve a stable "decision name" up-front.  Used both
+            // by the round-6 deny-list check below and by master's
+            // precision-driven `target_supertonic_storage_type`
+            // dispatch.  Falls back to the dst tensor `name` when
+            // the source-map lookup misses (matches master's Phase
+            // 2A behaviour pre-rebase).
+            auto src_it = tensor_to_source_for_alloc.find(name);
+            const std::string decision_name =
+                (src_it != tensor_to_source_for_alloc.end())
+                    ? src_it->second
+                    : std::string(name);
+
+            bool f16_materialise = false;
+            if (model.use_f16_weights &&
+                src_it != tensor_to_source_for_alloc.end() &&
+                (src->type == GGML_TYPE_F32 ||
+                 should_expand_supertonic_tensor(src->type))) {
+                const bool curated_hot = should_materialise_f16_weight(decision_name);
+                const bool denied      = curated_hot &&
+                    !should_materialise_f16_weight(decision_name, f16_weights_deny_list);
+                if (denied) {
+                    ++model.f16_weights_excluded_count;
+                } else if (curated_hot) {
+                    f16_materialise = true;
+                }
+            }
+
+            ggml_type dst_type;
+            if (f16_materialise) {
+                dst_type = GGML_TYPE_F16;
+            } else {
+                // Precision-driven path (ours): F32 / F16 / Q8_0 per
+                // the `--precision` flag.  Returns src->type unchanged
+                // for tensors that don't need conversion.
+                dst_type = target_supertonic_storage_type(
+                    decision_name, src->type, precision,
+                    /*backend_is_cpu=*/ ggml_backend_is_cpu(model.backend));
+            }
+
+            ggml_tensor * dst = (dst_type == src->type)
+                ? ggml_dup_tensor(model.ctx_w, src)
+                : ggml_new_tensor(model.ctx_w, dst_type, ggml_n_dims(src), src->ne);
             ggml_set_name(dst, name);
             model.tensors[name] = dst;
-            if (should_expand_supertonic_tensor(src->type)) {
+
+            if (f16_materialise) {
+                // Phase 2A F16 materialise path.
+                std::vector<float> src_f32;
+                if (should_expand_supertonic_tensor(src->type)) {
+                    src_f32 = expand_supertonic_tensor_to_f32(src);
+                } else {
+                    const int64_t n = ggml_nelements(src);
+                    src_f32.resize((size_t) n);
+                    std::memcpy(src_f32.data(), ggml_get_data(src), (size_t) n * sizeof(float));
+                }
+                std::vector<uint16_t> & f16 = f16_materialised_tensors[name];
+                f16.resize(src_f32.size());
+                ggml_fp32_to_fp16_row(src_f32.data(),
+                                      reinterpret_cast<ggml_fp16_t *>(f16.data()),
+                                      (int64_t) src_f32.size());
+            } else if (needs_supertonic_tensor_conversion(src->type, dst_type)) {
+                // Precision-driven conversion (ours).  Covers f32 → q8_0,
+                // q8_0 → f32, f16 → f32 etc.  Buffered here, uploaded later.
+                convert_supertonic_tensor_data(src, dst_type, converted_tensors[name]);
+            } else if (should_expand_supertonic_tensor(src->type)) {
+                // Legacy fallback: f16/q8_0 src with f32 dst that
+                // didn't go through the conversion helper above.
                 expanded_f32_tensors[name] = expand_supertonic_tensor_to_f32(src);
             }
         }
 
+        // Audit finding F2 — declare the pre-baked vocoder BN
+        // tensors BEFORE `ggml_backend_alloc_ctx_tensors` so they
+        // get a slot in the same backend buffer as the rest of the
+        // model weights.  Data is uploaded after the source-tensor
+        // upload loop further down; see the F2 hook after
+        // `bind_vocoder_weights`.
+        model.vocoder.bn_scale_pre = ggml_new_tensor_1d(model.ctx_w, GGML_TYPE_F32, 512);
+        ggml_set_name(model.vocoder.bn_scale_pre, "vocoder/bn_scale_pre");
+        model.vocoder.bn_shift_pre = ggml_new_tensor_1d(model.ctx_w, GGML_TYPE_F32, 512);
+        ggml_set_name(model.vocoder.bn_shift_pre, "vocoder/bn_shift_pre");
+
+        // Audit finding F6 — declare the pre-transposed companion
+        // tensors for the four t_proj matmul weights.  Each one has
+        // shape [512, 64] in the GGUF (matches the Supertonic-2
+        // architecture's time-embedding projection); the transposed
+        // form is [64, 512], i.e. axes 0/1 swapped.  Data uploaded
+        // after `bind_vocoder_weights` in the F6 post-bind hook.
+        // The roster matches AUDIT_SUPERTONIC_OPENCL.md F6 + the
+        // test in test_supertonic_load_caches.cpp.
+        //
+        // Phase 2A interaction: the F6 hook only supports F32
+        // sources (the host-side transpose loop assumes 4-byte
+        // strides).  When F16 weights are on, the same matmul
+        // weights have already been materialised as F16, so we
+        // skip F6's allocation + upload entirely; call sites in
+        // `supertonic_vector_estimator.cpp` fall back to the
+        // legacy in-graph `ggml_cont(ggml_transpose(W))` path.
+        ggml_tensor * pretrans_t_proj[4] = {nullptr, nullptr, nullptr, nullptr};
+        static const char * const kF6PretransNames[4] = {
+            "vector_estimator:onnx::MatMul_3095__T",
+            "vector_estimator:onnx::MatMul_3140__T",
+            "vector_estimator:onnx::MatMul_3185__T",
+            "vector_estimator:onnx::MatMul_3230__T",
+        };
+        const bool f6_active = !model.use_f16_weights;
+        if (f6_active) {
+            for (int i = 0; i < 4; ++i) {
+                pretrans_t_proj[i] = ggml_new_tensor_2d(model.ctx_w, GGML_TYPE_F32, 64, 512);
+                ggml_set_name(pretrans_t_proj[i], kF6PretransNames[i]);
+            }
+        }
+
         model.buffer_w = ggml_backend_alloc_ctx_tensors(model.ctx_w, model.backend);
         if (!model.buffer_w) throw std::runtime_error("ggml_backend_alloc_ctx_tensors failed");
 
@@ -341,8 +2068,33 @@ bool load_supertonic_gguf(const std::string & path,
              cur;
              cur = ggml_get_next_tensor(model.ctx_w, cur)) {
             ggml_tensor * src = ggml_get_tensor(tmp_ctx, ggml_get_name(cur));
-            auto expanded = expanded_f32_tensors.find(ggml_get_name(cur));
-            if (expanded != expanded_f32_tensors.end()) {
+            if (!src) {
+                // Pre-baked tensor (F2 / F6 / future audit phases):
+                // declared in model.ctx_w earlier in this function but
+                // doesn't have a GGUF source row — data is uploaded by
+                // the dedicated post-bind hook further down.  Skip
+                // here so we don't deref a null `src`.
+                continue;
+            }
+            // Phase 2A: F16-materialised tensors take precedence over
+            // the precision-converted / F32-expanded paths (they may
+            // have been promoted from either F32 or F16/Q8_0 sources).
+            auto f16_mat = f16_materialised_tensors.find(ggml_get_name(cur));
+            if (f16_mat != f16_materialised_tensors.end()) {
+                ggml_backend_tensor_set(cur, f16_mat->second.data(), 0,
+                                        f16_mat->second.size() * sizeof(uint16_t));
+                continue;
+            }
+            // Precision-driven conversion (`--precision q8_0`/f16 etc.) —
+            // bytes are already in dst-type representation.
+            auto converted = converted_tensors.find(ggml_get_name(cur));
+            if (converted != converted_tensors.end()) {
+                ggml_backend_tensor_set(cur, converted->second.data(), 0,
+                                        converted->second.size());
+            } else if (auto expanded = expanded_f32_tensors.find(ggml_get_name(cur));
+                       expanded != expanded_f32_tensors.end()) {
+                // Legacy f16/q8_0 → f32 expansion (used when the
+                // conversion helper didn't run).
                 ggml_backend_tensor_set(cur, expanded->second.data(), 0,
                                         expanded->second.size() * sizeof(float));
             } else {
@@ -356,14 +2108,21 @@ bool load_supertonic_gguf(const std::string & path,
             ggml_backend_tensor_get(unicode, model.unicode_indexer.data(), 0, ggml_nbytes(unicode));
         }
 
-        std::vector<std::string> tensor_names = get_string_array(gguf_ctx, "supertonic.tensor_names");
-        std::vector<std::string> source_names = get_string_array(gguf_ctx, "supertonic.source_names");
-        if (tensor_names.size() != source_names.size()) {
-            throw std::runtime_error("supertonic tensor/source metadata length mismatch");
-        }
-        for (size_t i = 0; i < tensor_names.size(); ++i) {
-            ggml_tensor * t = require_tensor(model, tensor_names[i]);
-            model.source_tensors[source_names[i]] = t;
+        // Populate the model's source_tensors lookup from the
+        // GGUF's `supertonic.tensor_names` / `supertonic.source_names`
+        // pair (the `tensor_to_source_for_alloc` map above only carries
+        // the same data for the pre-alloc decision; we re-read here so
+        // we don't have to widen its scope).
+        {
+            std::vector<std::string> tensor_names = get_string_array(gguf_ctx, "supertonic.tensor_names");
+            std::vector<std::string> source_names = get_string_array(gguf_ctx, "supertonic.source_names");
+            if (tensor_names.size() != source_names.size()) {
+                throw std::runtime_error("supertonic.tensor_names / source_names length mismatch");
+            }
+            for (size_t i = 0; i < tensor_names.size(); ++i) {
+                ggml_tensor * t = require_tensor(model, tensor_names[i]);
+                model.source_tensors[source_names[i]] = t;
+            }
         }
 
         for (const std::string & voice_name : get_string_array(gguf_ctx, "supertonic.voice_names")) {
@@ -376,11 +2135,297 @@ bool load_supertonic_gguf(const std::string & path,
 
         bind_vocoder_weights(model);
 
-        // Build the scheduler. With a GPU primary, add a CPU backend so
-        // ops the GPU can't run (GGML_OP_CUSTOM, and any FA the driver
-        // rejects) are routed to CPU rather than silently skipped. With a
-        // CPU primary, the sched is a single-backend pass-through (no
-        // second CPU backend created).
+        // Audit finding F1 — cache the vector-estimator RoPE θ
+        // tensor on the host once at load time.  All four group
+        // attention sites in `supertonic_vector_step_ggml`'s
+        // production GGML path read from the same source tensor;
+        // caching here avoids 4 × N_STEPS GPU→host downloads per
+        // synth on a non-CPU backend.  Tensor is small (64 floats
+        // typical), so the host-side copy cost is negligible
+        // compared with the sync-point savings.  See
+        // AUDIT_SUPERTONIC_OPENCL.md F1 + PLAN Phase 2F.
+        //
+        // The source tensor is mandatory for any production
+        // Supertonic GGUF (all four group attention sites depend
+        // on it); fail-fast at load time so the call-site
+        // assumption "model.vector_rope_theta.data() is non-null"
+        // can stay assertion-free.  Matches the previous behaviour
+        // where the same tensor was looked up via
+        // `read_f32(model, "...theta")` on the hot path and would
+        // throw `runtime_error("missing source tensor: ...")`.
+        {
+            ggml_tensor * theta_src = require_source_tensor(model,
+                "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
+            model.vector_rope_theta.resize((size_t) ggml_nelements(theta_src));
+            ggml_backend_tensor_get(theta_src,
+                                    model.vector_rope_theta.data(),
+                                    0, ggml_nbytes(theta_src));
+        }
+
+        // Audit finding F2 — compute the vocoder BN scale / shift
+        // pre-bake.  Downloads the four final_norm.* tensors that
+        // were just uploaded a few lines above (so this is a single
+        // round-trip at load time, not per-synth), folds them into
+        // the BN-fused form, and uploads to bn_scale_pre /
+        // bn_shift_pre which the vocoder graph cache references
+        // directly as weights.  Every subsequent synth call skips
+        // the 4 reads + CPU compute + 2 uploads that the old path
+        // did.  See AUDIT_SUPERTONIC_OPENCL.md F2.
+        {
+            auto download = [](ggml_tensor * t, std::vector<float> & out) {
+                out.resize((size_t) ggml_nelements(t));
+                ggml_backend_tensor_get(t, out.data(), 0, ggml_nbytes(t));
+            };
+            std::vector<float> gamma, beta, mean, var;
+            download(model.vocoder.final_norm_g, gamma);
+            download(model.vocoder.final_norm_b, beta);
+            download(model.vocoder.final_norm_running_mean, mean);
+            download(model.vocoder.final_norm_running_var,  var);
+            if (gamma.size() != 512 || beta.size() != 512 ||
+                mean.size() != 512  || var.size()  != 512) {
+                throw std::runtime_error(
+                    "vocoder final_norm.* size mismatch (expected 512 each)");
+            }
+            std::vector<float> bn_scale_pre(512), bn_shift_pre(512);
+            for (int c = 0; c < 512; ++c) {
+                bn_scale_pre[c] = gamma[c] / std::sqrt(var[c] + 1e-5f);
+                bn_shift_pre[c] = beta[c] - mean[c] * bn_scale_pre[c];
+            }
+            ggml_backend_tensor_set(model.vocoder.bn_scale_pre,
+                                    bn_scale_pre.data(), 0, 512 * sizeof(float));
+            ggml_backend_tensor_set(model.vocoder.bn_shift_pre,
+                                    bn_shift_pre.data(), 0, 512 * sizeof(float));
+        }
+
+        // Audit finding F6 — populate the pre-transposed t_proj
+        // companions from the source tensors.  Gated on
+        // `f6_active`; see the declaration block above for the
+        // Phase 2A interaction note.
+        if (f6_active) {
+            static const char * const kF6Sources[4] = {
+                "vector_estimator:onnx::MatMul_3095",
+                "vector_estimator:onnx::MatMul_3140",
+                "vector_estimator:onnx::MatMul_3185",
+                "vector_estimator:onnx::MatMul_3230",
+            };
+            for (int i = 0; i < 4; ++i) {
+                if (!pretrans_t_proj[i]) continue;
+                auto it = model.source_tensors.find(kF6Sources[i]);
+                if (it == model.source_tensors.end() || !it->second) continue;
+                ggml_tensor * orig = it->second;
+                // Defensive: only pre-transpose the F32 [512, 64]
+                // shape the audit roster targets.  Any other layout
+                // means the GGUF doesn't fit the assumed
+                // architecture (or has already been quantized below
+                // F32, in which case the call-site rewrite would
+                // need a different lowering anyway).
+                if (orig->type != GGML_TYPE_F32 ||
+                    orig->ne[0] != 512 || orig->ne[1] != 64 ||
+                    orig->ne[2] != 1   || orig->ne[3] != 1) {
+                    continue;
+                }
+                std::vector<float> src((size_t) ggml_nelements(orig));
+                ggml_backend_tensor_get(orig, src.data(), 0, ggml_nbytes(orig));
+                std::vector<float> dst((size_t) 64 * 512);
+                // Transpose: dst[i, j] = src[j, i] where source ne=
+                // [512, 64].  Memory: src[j * 512 + i],
+                // dst[i * 64 + j].
+                for (int j = 0; j < 64; ++j) {
+                    for (int ii = 0; ii < 512; ++ii) {
+                        dst[(size_t) ii * 64 + j] = src[(size_t) j * 512 + ii];
+                    }
+                }
+                ggml_backend_tensor_set(pretrans_t_proj[i], dst.data(), 0, dst.size() * sizeof(float));
+                model.source_tensors[std::string(kF6Sources[i]) + "__T"] = pretrans_t_proj[i];
+            }
+        }
+
+        // Audit follow-up #2 — F13 + F16.
+        //
+        // F13: pre-download the text-encoder layer-norm weights
+        // that the GPU production path's scalar `layer_norm_channel`
+        // continuation consumes on every synth.  Roster covers the
+        // four `attn_encoder.norm_layers_{1,2}.{0..3}` pairs plus
+        // the trailing `speech_prompted_text_encoder.norm.norm.*`
+        // pair — 18 entries total — saving ~18 GPU→host syncs per
+        // synth on a non-CPU backend.  See
+        // `AUDIT_SUPERTONIC_OPENCL.md` § F13 (audit follow-up #2).
+        {
+            auto cache_if_present = [&](const std::string & name) {
+                auto it = model.source_tensors.find(name);
+                if (it == model.source_tensors.end() || !it->second) return;
+                std::vector<float> & dst = model.text_encoder_ln_weights[name];
+                dst.resize((size_t) ggml_nelements(it->second));
+                ggml_backend_tensor_get(it->second, dst.data(), 0, ggml_nbytes(it->second));
+            };
+            static const char * const kLnStems[] = {
+                "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1.0",
+                "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1.1",
+                "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1.2",
+                "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1.3",
+                "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2.0",
+                "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2.1",
+                "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2.2",
+                "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2.3",
+                "text_encoder:tts.ttl.speech_prompted_text_encoder.norm",
+            };
+            for (const char * stem : kLnStems) {
+                cache_if_present(std::string(stem) + ".norm.weight");
+                cache_if_present(std::string(stem) + ".norm.bias");
+            }
+        }
+
+        // F16: pre-download the two `tanh_k` tensors consumed by
+        // the speech-prompted attention's CPU-side packing loop.
+        // Each is ~50 × 256 floats; the per-synth pattern of "open
+        // a fresh ggml graph + read tanh_k + pack q/k/v + run
+        // flash attention + tear graph down" still requires the
+        // host-side tanh_k bytes for the pack loop, but those
+        // bytes don't need a fresh download on every synth.
+        {
+            static const char * const kTanhKSources[2] = {
+                "text_encoder:/speech_prompted_text_encoder/attention1/tanh/Tanh_output_0",
+                "text_encoder:/speech_prompted_text_encoder/attention2/tanh/Tanh_output_0",
+            };
+            for (int i = 0; i < 2; ++i) {
+                auto it = model.source_tensors.find(kTanhKSources[i]);
+                if (it == model.source_tensors.end() || !it->second) continue;
+                model.speech_tanh_k_cache[i].resize((size_t) ggml_nelements(it->second));
+                ggml_backend_tensor_get(it->second,
+                                        model.speech_tanh_k_cache[i].data(),
+                                        0, ggml_nbytes(it->second));
+            }
+        }
+
+        // Materialize pre-transposed copies of matmul weights to drop the
+        // runtime `cont(transpose(w))` dispatch that `dense_matmul_time_ggml`
+        // emits on every graph compute (~32 sites × 5 CFM steps per synth).
+        // CPU's `cblas_sgemm` already handles the transpose via its `Trans`
+        // flag, so this is a Metal-perf-only optimization — skip the extra
+        // memory + load-time cost on CPU.  Override via
+        // `SUPERTONIC_DISABLE_WEIGHT_PRETRANSPOSE=1` to debug the unpacked
+        // path.
+        //
+        // Coexists with the F6 pre-transposed t_proj pass above: that one
+        // handles 4 specific `[512, 64]` `t_proj` weights and registers
+        // them under the `__T` suffix; this one handles every other
+        // `:onnx::MatMul_` weight under the `:T` suffix.  No collisions.
+        static const bool disable_pretranspose =
+            std::getenv("SUPERTONIC_DISABLE_WEIGHT_PRETRANSPOSE") != nullptr;
+        if (!disable_pretranspose && model.backend &&
+            !ggml_backend_is_cpu(model.backend)) {
+            std::vector<std::pair<std::string, ggml_tensor *>> to_pretranspose;
+            for (const auto & [src_name, t] : model.source_tensors) {
+                if (!t) continue;
+                if (src_name.find(":onnx::MatMul_") == std::string::npos) continue;
+                if (ggml_n_dims(t) != 2) continue;
+                // Pretranspose f32 weights (default precision) AND q8_0 / f16
+                // weights (asymmetric load modes).  For q8_0 / f16 we
+                // dequant→transpose→requantize through f32; the round-trip
+                // introduces tiny rounding within the type's existing noise
+                // tolerance.  This is what unlocks A3 step 2
+                // (kernel_mul_mm_q8_0_f32 / kernel_mul_mm_f16_f32 dispatches
+                // when both (a) the pretransposed weight is available as
+                // src0 and (b) the new dense_matmul_time_wt_pretransposed_ggml
+                // swaps the mul_mat args so the weight is src0).
+                if (t->type != GGML_TYPE_F32 &&
+                    t->type != GGML_TYPE_F16  &&
+                    t->type != GGML_TYPE_Q8_0) continue;
+                to_pretranspose.push_back({src_name, t});
+            }
+            if (!to_pretranspose.empty()) {
+                ggml_init_params extra_params = {
+                    /*.mem_size=*/ ggml_tensor_overhead() * to_pretranspose.size(),
+                    /*.mem_buffer=*/ nullptr,
+                    /*.no_alloc=*/ true,
+                };
+                model.ctx_w_extra = ggml_init(extra_params);
+                if (!model.ctx_w_extra) {
+                    throw std::runtime_error("ggml_init ctx_w_extra failed");
+                }
+                std::vector<std::pair<ggml_tensor *, ggml_tensor *>> orig_to_pre;
+                orig_to_pre.reserve(to_pretranspose.size());
+                for (const auto & [src_name, t] : to_pretranspose) {
+                    // Pre tensor has same type as orig (f32 stays f32,
+                    // q8_0 stays q8_0); only the shape swaps.
+                    ggml_tensor * tt = ggml_new_tensor_2d(model.ctx_w_extra,
+                        t->type, t->ne[1], t->ne[0]);
+                    const std::string tt_name = std::string(ggml_get_name(t)) + ":T";
+                    ggml_set_name(tt, tt_name.c_str());
+                    model.source_tensors[src_name + ":T"] = tt;
+                    orig_to_pre.push_back({t, tt});
+                }
+                model.buffer_w_extra =
+                    ggml_backend_alloc_ctx_tensors(model.ctx_w_extra, model.backend);
+                if (!model.buffer_w_extra) {
+                    throw std::runtime_error(
+                        "ggml_backend_alloc_ctx_tensors ctx_w_extra failed");
+                }
+                // Upload the transposed data.  For f32 weights this is a
+                // straight host-side reorder.  For q8_0 weights we dequant
+                // to f32, transpose in f32, then requantize via from_float
+                // into the pretransposed q8_0 tensor.  Both directions go
+                // through the public ggml type-traits APIs.
+                for (const auto & [orig, pre] : orig_to_pre) {
+                    const int OC = (int) orig->ne[0];
+                    const int IC = (int) orig->ne[1];
+                    const size_t n = (size_t) OC * IC;
+
+                    // Step 1: download `orig` data, dequantize to f32 if needed.
+                    std::vector<float> host_orig_f32(n);
+                    if (orig->type == GGML_TYPE_F32) {
+                        ggml_backend_tensor_get(orig, host_orig_f32.data(), 0,
+                                                n * sizeof(float));
+                    } else {
+                        std::vector<uint8_t> raw(ggml_nbytes(orig));
+                        ggml_backend_tensor_get(orig, raw.data(), 0, raw.size());
+                        const ggml_type_traits * tr = ggml_get_type_traits(orig->type);
+                        if (!tr || !tr->to_float) {
+                            throw std::runtime_error(
+                                std::string("pretranspose: missing to_float for ") +
+                                ggml_type_name(orig->type));
+                        }
+                        tr->to_float(raw.data(), host_orig_f32.data(), (int64_t) n);
+                    }
+
+                    // Step 2: transpose in f32.
+                    std::vector<float> host_pre_f32(n);
+                    for (int oc = 0; oc < OC; ++oc) {
+                        for (int ic = 0; ic < IC; ++ic) {
+                            host_pre_f32[(size_t) ic + (size_t) oc * IC] =
+                                host_orig_f32[(size_t) oc + (size_t) ic * OC];
+                        }
+                    }
+
+                    // Step 3: upload (requantizing if needed).
+                    if (pre->type == GGML_TYPE_F32) {
+                        ggml_backend_tensor_set(pre, host_pre_f32.data(), 0,
+                                                n * sizeof(float));
+                    } else {
+                        const size_t dst_bytes = ggml_row_size(pre->type, n);
+                        std::vector<uint8_t> raw(dst_bytes);
+                        const ggml_type_traits_cpu * dtr =
+                            ggml_get_type_traits_cpu(pre->type);
+                        if (!dtr || !dtr->from_float) {
+                            throw std::runtime_error(
+                                std::string("pretranspose: missing from_float for ") +
+                                ggml_type_name(pre->type));
+                        }
+                        dtr->from_float(host_pre_f32.data(), raw.data(), (int64_t) n);
+                        ggml_backend_tensor_set(pre, raw.data(), 0, raw.size());
+                    }
+                    model.pretransposed_weights[orig] = pre;
+                }
+            }
+        }
+
+        // QVAC-19254 — build the scheduler.  With a GPU primary, add a
+        // CPU backend so ops the GPU can't run (GGML_OP_CUSTOM, and any
+        // FA the driver rejects) are routed to CPU rather than silently
+        // skipped.  With a CPU primary, the sched is a single-backend
+        // pass-through (no second CPU backend created).  Consumed by
+        // `supertonic_sched_alloc` / `supertonic_sched_compute` in the
+        // per-stage compute helpers.
         {
             ggml_backend_t backends[2] = { model.backend, nullptr };
             int n_backends = 1;
@@ -426,11 +2471,18 @@ void free_supertonic_model(supertonic_model & model) {
     if (model.generation_id != 0) {
         unregister_supertonic_alive(model.generation_id);
     }
-    // Free the scheduler before the backends/buffers it references.
+    // QVAC-19254 — free the scheduler before the backends / buffers it
+    // references; the sched holds non-owning pointers to model.backend +
+    // model.cpu_backend, so tearing those down first would leave the
+    // sched with dangling references during its destructor.
     if (model.sched) {
         ggml_backend_sched_free(model.sched);
         model.sched = nullptr;
     }
+    if (model.buffer_w_extra) {
+        ggml_backend_buffer_free(model.buffer_w_extra);
+        model.buffer_w_extra = nullptr;
+    }
     if (model.buffer_w) {
         ggml_backend_buffer_free(model.buffer_w);
         model.buffer_w = nullptr;
@@ -443,10 +2495,15 @@ void free_supertonic_model(supertonic_model & model) {
         ggml_backend_free(model.cpu_backend);
         model.cpu_backend = nullptr;
     }
+    if (model.ctx_w_extra) {
+        ggml_free(model.ctx_w_extra);
+        model.ctx_w_extra = nullptr;
+    }
     if (model.ctx_w) {
         ggml_free(model.ctx_w);
         model.ctx_w = nullptr;
     }
+    model.pretransposed_weights.clear();
     model.tensors.clear();
     model.source_tensors.clear();
     model.vocoder = {};
@@ -454,6 +2511,16 @@ void free_supertonic_model(supertonic_model & model) {
     model.unicode_indexer.clear();
     model.languages.clear();
     model.tts_json.clear();
+    // Reset the OpenCL optimization caches (audit F1 / F9 + F13 /
+    // F16) added to supertonic_model.  The vector-estimator RoPE θ
+    // cache is a bare std::vector so its clear() is sufficient; the
+    // time embedding cache map is mutable so we clear it explicitly
+    // here even though dtor would handle it on the next load reuse.
+    model.vector_rope_theta.clear();
+    model.time_emb_cache.clear();
+    model.text_encoder_ln_weights.clear();
+    for (auto & v : model.speech_tanh_k_cache) v.clear();
+    model.scalar_weight_cache.clear();
     model.generation_id = 0;
 }
 
diff --git a/tts-cpp/src/supertonic_internal.h b/tts-cpp/src/supertonic_internal.h
index 7e157f388f8..a231284b79b 100644
--- a/tts-cpp/src/supertonic_internal.h
+++ b/tts-cpp/src/supertonic_internal.h
@@ -2,16 +2,44 @@
 
 #include <cstdint>
 #include <array>
+#include <cmath>
 #include <map>
 #include <string>
 #include <unordered_map>
 #include <vector>
 
 #include "ggml-backend.h"
+#include "ggml-cpu.h"
 #include "ggml.h"
 
 namespace tts_cpp::supertonic::detail {
 
+// QVAC-18605 round 4 — multi-dtype K/V flash-attention dispatch.
+//
+// Generalises the round-1 `use_f16_attn` boolean (F16 vs F32
+// only) into a four-valued enum so operators can opt into BF16
+// K/V (Vulkan coopmat2 — better quality than F16 at identical
+// bandwidth, no underflow on small attention scores) or Q8_0 K/V
+// (Vulkan + half the K/V upload bandwidth) when their adapter
+// advertises the corresponding capability.
+//
+// Sentinel `autoselect` is used only on `EngineOptions::kv_attn_type`
+// (= -1) and as a "not yet resolved" marker; the resolver
+// always returns a concrete dispatch dtype (f32/f16/bf16/q8_0).
+//
+// Underlying-type-pinned int so the value can be cast cleanly
+// to/from `EngineOptions::kv_attn_type` (also int, default -1).
+//
+// Declared up here (above `supertonic_model`) so the model can
+// carry a `kv_attn_dtype` field without a forward declaration.
+enum class kv_attn_dtype : int {
+    autoselect = -1,
+    f32        = 0,
+    f16        = 1,
+    bf16       = 2,
+    q8_0       = 3,
+};
+
 struct supertonic_hparams {
     std::string arch = "supertonic2";
     std::string ftype = "f32";
@@ -32,6 +60,187 @@ struct supertonic_voice_style {
     ggml_tensor * dp  = nullptr; // (16, 8, 1) in ggml axis order for JSON (1, 8, 16)
 };
 
+// QVAC-18605 round 7 — voice ttl/dp host cache.
+//
+// `Engine::Impl::synthesize()` historically downloaded the per-
+// voice style tensors (`ttl`, `dp`) on EVERY call:
+//
+//     std::vector<float> style_ttl = read_tensor_f32(vit->second.ttl);
+//     std::vector<float> style_dp  = read_tensor_f32(vit->second.dp);
+//
+// On Vulkan / OpenCL backends each `read_tensor_f32` is a
+// synchronous GPU→host download.  The voice tensors are part of
+// the load-time GGUF state and never mutate after load, so
+// caching them per-engine keyed by voice name eliminates two sync
+// points per `synthesize()` call after the first per-voice.
+//
+// This helper is intentionally extracted from `Engine::Impl` so
+// the lookup-or-load semantics are testable on CPU without
+// instantiating a full Engine.  See
+// `test-supertonic-voice-host-cache` for the contract.
+//
+// Reference-stability contract: the returned `entry` reference is
+// stable across subsequent `get_or_load` calls for OTHER voices
+// (`std::unordered_map`'s reference-stability guarantee — element
+// references survive `insert` even when the table rehashes; only
+// iterators are invalidated).  Callers may hold the reference
+// across the next `get_or_load` on the same instance, BUT must
+// NOT call `clear()` or `erase()` on the cache while holding the
+// reference.  The Engine::Impl call site captures `e.ttl.data()`
+// / `e.dp.data()` and forwards them to the synthesis pipeline,
+// which expects them to stay valid for the duration of the
+// call — `clear()` is currently only reachable on Engine
+// destruction (post-synthesis).
+//
+// THREAD-SAFETY (PR #18 review): voice_host_cache is NOT
+// internally synchronised.  Concurrent invocations of any
+// non-const method (`get_or_load`, `clear`) from multiple
+// threads on the SAME instance is UB (standard `unordered_map`
+// rules: writes need exclusive access).  The Engine's
+// documented threading model is single-threaded synthesis per
+// Engine instance; concurrent synthesis requires one Engine per
+// thread (each Engine carries its own voice_host_cache), which
+// is also what the iOS load/unload race fix (36a2c56) enforces
+// for the s3gen preload path.  If a future refactor lifts that
+// constraint (e.g. a thread-pool dispatch over a single
+// Engine), the call site MUST add an external mutex around
+// `voice_host_cache::get_or_load` + the downstream `.data()`
+// capture, OR switch this cache to a `std::shared_mutex`-
+// guarded internal lock.  Marked deliberately as caller's
+// responsibility today because the single-threaded model also
+// keeps the cache hot-path zero-cost (no atomic / lock-acquire
+// per call) — the cache exists to eliminate per-call GPU
+// downloads, and giving back any of that saving to internal
+// locking would be premature.
+struct voice_host_cache {
+    struct entry {
+        std::vector<float> ttl;
+        std::vector<float> dp;
+    };
+
+    // Returns a stable reference to the cached entry for
+    // `voice_name`.  On cache miss, calls `read_tensor_f32` on
+    // `ttl_tensor` and `dp_tensor`, stores the result, and
+    // returns the new entry.  On cache hit, returns the existing
+    // entry without touching the GGML tensors at all (the host
+    // vectors are reused as-is — `ttl_tensor` / `dp_tensor` may
+    // legally be null on a cache hit).
+    //
+    // Throws std::runtime_error if the entry is missing AND
+    // either tensor pointer is null (loud-failure for an Impl
+    // bug; never expected to fire on the production path because
+    // Impl validates `voices.find()` before calling).
+    const entry & get_or_load(const std::string & voice_name,
+                              ggml_tensor * ttl_tensor,
+                              ggml_tensor * dp_tensor);
+
+    // Drops every cached entry.  Currently only reachable on
+    // Engine destruction; included for forward-compat with hot-
+    // swap scenarios where the underlying backend is replaced
+    // while the engine handle is reused.
+    void clear();
+
+    // Diagnostic — number of entries currently cached.  Used by
+    // the test to assert lookup-vs-load semantics (size doesn't
+    // grow on a cache hit).
+    size_t size() const;
+
+private:
+    std::unordered_map<std::string, entry> by_name_;
+};
+
+// QVAC-18605 round 10 — pointer-compare upload-skip tracker.
+//
+// Background: per-step uploads of `text_emb` to the front-block
+// cache and to the 3 group-graph caches happen 5 times per synth
+// (once per denoise step), but `text_emb` is a host
+// `std::vector<float>` allocated ONCE in
+// `Engine::Impl::synthesize()` (and once per bench run) — so the
+// SAME pointer flows through 4 caches × 5 steps = 20 uploads /
+// synth, of which 16 are redundant re-uploads of identical data.
+//
+// The F4 pattern (already in `vector_res_style_qkv_cache` for
+// `style_v_in` / `kctx_in`) skips redundant uploads via pointer
+// comparison: if the host vector pointer is the same as the last
+// successful upload's pointer, skip.  This struct generalises
+// that pattern.
+//
+// CROSS-SYNTH HAZARD: `text_emb` lives on the
+// `Engine::Impl::synthesize()` stack (or the bench loop's stack)
+// — destructed at end of call.  Modern heap allocators
+// (jemalloc / tcmalloc / glibc) very often return the SAME
+// address for an immediately-following same-size allocation
+// (size-class reuse, locality optimisation), so synth N+1 may
+// have `text_emb.data() == synth_N.text_emb.data()` despite
+// holding completely different data.  A naive pointer-compare
+// upload-skip would silently send stale text-encoder embeddings
+// to the next synth.
+//
+// MITIGATION: caller MUST invoke `reset()` at every synth
+// boundary (i.e., when `current_step == 0`).  The first step of
+// every synth always uploads (cold-miss), populating the
+// tracker; steps 1..N-1 hit the pointer-compare and skip.
+// Across synths, the reset invalidates the cached pointer so
+// the next synth's upload always fires regardless of pointer
+// match.
+//
+// Reset is also required after a cache rebuild (the underlying
+// GPU buffer is reallocated and any cached upload-skip state is
+// stale).  In tree, cache rebuilds happen via `cache = {}`
+// which zero-initialises the tracker fields and effectively
+// resets it without an explicit `reset()` call.
+struct upload_skip_tracker {
+    const void * last_uploaded = nullptr;
+
+    // True iff `current` differs from the last recorded pointer
+    // (i.e., we MUST upload).  False iff we can skip.  After
+    // the consumer's upload call returns, they MUST call
+    // `mark_uploaded(current)` to update the cached pointer
+    // (else the next call re-uploads).
+    bool needs_upload(const void * current) const {
+        return current != last_uploaded;
+    }
+
+    // Records a successful upload.  Call AFTER the upload
+    // completes (so a failed upload doesn't pin the pointer —
+    // the next call would correctly re-attempt).
+    void mark_uploaded(const void * current) {
+        last_uploaded = current;
+    }
+
+    // Drops the cached pointer.  Caller invokes at synth
+    // boundary (current_step == 0) AND on cache rebuild (cache
+    // = {} also achieves this via zero-init of last_uploaded).
+    void reset() {
+        last_uploaded = nullptr;
+    }
+};
+
+// QVAC-18605 round 7 — Vulkan env-var passthrough.
+//
+// Applies a map of `GGML_VK_*` env-var overrides via
+// `set_env_if_unset` so the `init_supertonic_backend()` path
+// picks them up at backend construction time.  `set_env_if_unset`
+// semantics: an operator-set env var (already present in the
+// environment when this is called) WINS over the EngineOptions
+// override.  Lets a debugging operator force-disable a setting
+// from the shell without recompiling, while still letting an
+// EngineOptions configuration set the same knob in production.
+//
+// Throws std::runtime_error on a key that doesn't start with
+// `GGML_VK_` (loud-failure for operator-config typos like
+// `GMML_VK_PREFER_HOST_MEMORY`).  ALL-OR-NOTHING: validation
+// happens BEFORE any env var is touched, so a partial-success
+// can't leave the env in a half-applied state.
+//
+// Pass an empty map for a no-op (the default
+// `EngineOptions::vulkan_env_overrides` value).
+//
+// Must be called BEFORE `init_supertonic_backend()` runs; called
+// from `Engine::Impl` ctor and from `supertonic-bench` main right
+// before `load_supertonic_gguf()`.
+void apply_vulkan_env_overrides(const std::map<std::string, std::string> & overrides);
+
 struct supertonic_vocoder_convnext_weights {
     ggml_tensor * dw_w = nullptr;
     ggml_tensor * dw_b = nullptr;
@@ -59,6 +268,26 @@ struct supertonic_vocoder_weights {
     ggml_tensor * head1_b = nullptr;
     ggml_tensor * head_prelu = nullptr;
     ggml_tensor * head2_w = nullptr;
+
+    // Audit finding F2 — pre-baked vocoder BN scale + shift.
+    //
+    //   bn_scale_pre[c] = final_norm_g[c] / sqrt(final_norm_var[c] + 1e-5)
+    //   bn_shift_pre[c] = final_norm_b[c] - final_norm_mean[c] * bn_scale_pre[c]
+    //
+    // Both are constants for the model lifetime; pre-computing once
+    // at `load_supertonic_gguf()` time and uploading into a small
+    // dedicated backend buffer avoids the per-synth pattern of:
+    //
+    //   - 4 × `ggml_backend_tensor_get` (final_norm_g/b/mean/var, 512 floats each)
+    //   - host-side 512-element scale/shift compute
+    //   - 2 × `ggml_backend_tensor_set` (bn_scale_in/bn_shift_in graph inputs)
+    //
+    // The vocoder graph cache references these tensors directly
+    // (no `ggml_set_input` markers needed — they're weights, not
+    // graph inputs).  See AUDIT_SUPERTONIC_OPENCL.md F2 + PLAN
+    // Phase 2F.
+    ggml_tensor * bn_scale_pre = nullptr;
+    ggml_tensor * bn_shift_pre = nullptr;
 };
 
 struct supertonic_trace_tensor {
@@ -85,25 +314,264 @@ struct supertonic_model {
     ggml_context * ctx_w = nullptr;
     ggml_backend_buffer_t buffer_w = nullptr;
 
+    // True when the resolved compute backend is the GGML CPU backend; the
+    // BLAS-backed `ggml_custom_4d` fast paths in the vocoder / vector
+    // estimator depend on the backend's CPU-side scheduler invoking the
+    // op callbacks and the tensor data pointers being host-addressable.
+    // On any non-CPU backend (CUDA / Metal / Vulkan / OpenCL) the runtime
+    // must take the pure-GGML fallback path instead — that's what the
+    // supertonic_op_dispatch_scope below toggles inside the graph-build
+    // helpers.  Set once in load_supertonic_gguf() right after
+    // init_supertonic_backend() resolves the device and is stable for
+    // the lifetime of the model.  See `OpenCL bring-up` section in
+    // PROGRESS_SUPERTONIC.md for the rationale.
+    bool backend_is_cpu = true;
+    // QVAC-18605 / Vulkan bring-up: True when the resolved backend is
+    // ggml-vulkan (`ggml_backend_is_vk`).  Mirrors `backend_is_cpu` in
+    // intent — informational + dispatch-key.  Set once in
+    // load_supertonic_gguf() right after the backend is resolved.
+    // Stable for the model lifetime.  Used by supertonic_bench /
+    // engine.cpp for the human-readable backend description (so the
+    // bench log shows "Vulkan (device 0: NVIDIA RTX 5090)" instead
+    // of just "Vulkan") and by the dispatch helpers below to pick
+    // between the OpenCL-conservative `leaky_relu_portable_ggml`
+    // decomposition and the native `ggml_leaky_relu` op.  See the
+    // PROGRESS_SUPERTONIC.md "Vulkan bring-up" section for the
+    // rationale + supported-op matrix.
+    bool backend_is_vk = false;
+    // QVAC-18605 — backend supports `GGML_OP_LEAKY_RELU` natively.
+    // Resolved at load time via `ggml_backend_supports_op` against
+    // a synthetic LEAKY_RELU node.  Three reasons we don't piggy-
+    // back on `backend_is_cpu`:
+    //   1. CPU obviously supports it (builtin); we want the same flag
+    //      to ride the CPU path through the helper without a special
+    //      case.
+    //   2. Vulkan / Metal / CUDA support it natively (verified against
+    //      ggml-vulkan.cpp:`pipeline_leaky_relu_f32`,
+    //      ggml-metal:`kernel_leaky_relu_f32`,
+    //      ggml-cuda:`leaky_relu`).
+    //   3. Plain upstream ggml-opencl does NOT support it; chatterbox
+    //      ships a patch that adds the kernel (see chatterbox
+    //      PROGRESS.md "What was missing"), but that patch may or may
+    //      not be applied at the consumer's vendored ggml.
+    // The dynamic `ggml_backend_supports_op` query handles all four
+    // cases without a hard-coded backend table.  When the query
+    // returns `false`, `leaky_relu_portable_ggml` decomposes into
+    // RELU + SCALE + ADD (universally supported, slightly more
+    // dispatches).  When it returns `true`, the helper emits the
+    // single fused builtin — fewer dispatches, lower scheduler
+    // overhead on the GPU command-buffer side.  Default `true`
+    // matches the historical CPU-only path.
+    bool use_native_leaky_relu = true;
+    // When true, the per-step vector-estimator attention graphs materialise
+    // K/V into contiguous F16 before calling ggml_flash_attn_ext so OpenCL
+    // (and other backends carrying the mixed-precision kernel) dispatch
+    // the `flash_attn_f32_f16` path instead of the F32-only one — large
+    // win on Adreno (see chatterbox PROGRESS.md OpenCL log).  Defaults to
+    // false on CPU (the cblas attention path is already efficient there);
+    // engine.cpp auto-enables it when the resolved backend is non-CPU,
+    // matching chatterbox's --cfm-f16-kv-attn behaviour.  On Vulkan the
+    // F16 K/V path goes through `kernel_flash_attn_*` shaders that
+    // accept any HSK / HSV that's a multiple of 8 (see
+    // ggml-vulkan.cpp `GGML_OP_FLASH_ATTN_EXT` supports_op gate);
+    // Supertonic's head_dim=64 satisfies that constraint by
+    // construction.
+    bool use_f16_attn = false;
+
+    // Phase 2A — load-time F16 materialization for the hot
+    // matmul / pointwise-conv weights identified by
+    // `should_materialise_f16_weight`.  Halves the GPU read
+    // bandwidth into those ops on non-CPU backends.  Captured on
+    // the model state at load time so the graph builders can fall
+    // back through `repeat_like(model.vocoder.bn_scale_pre, …)`-
+    // style casts when a tensor's storage type changed.  Auto-
+    // enables on GPU backends, off on CPU (mirrors `use_f16_attn`).
+    // Override via `EngineOptions::f16_weights` / `--f16-weights`.
+    bool use_f16_weights = false;
+
+    // The compute precision the model was loaded with — set by
+    // `load_supertonic_gguf`.  Lets graph builders dispatch precision-
+    // specific code paths (e.g. asymmetric q8_0 load on Metal).
+    // Orthogonal to `use_f16_weights` above (that's a per-op runtime
+    // selector for the OpenCL hot-weight materialisation; this is the
+    // global storage-type selector).
+    int precision_id = 0; // supertonic_precision::F32
+
+    // QVAC-18605 round 6 — count of tensors that the curated allow-
+    // list would have promoted to F16 but the user-supplied
+    // `f16_weights_deny_list` excluded.  Surfaced in bench output
+    // so operators can confirm their deny-list took effect.  Zero
+    // for the default empty deny-list path (zero behaviour change).
+    int f16_weights_excluded_count = 0;
+
+    // QVAC-18605 round 4 — resolved K/V flash-attention dispatch
+    // dtype.  Default `f32` (no surprise dispatch on a default-
+    // constructed model).  `load_supertonic_gguf` resolves the
+    // policy from `EngineOptions::kv_attn_type` + the round-2/3
+    // backend probes via `resolve_kv_attn_type` and sets this.
+    // The `supertonic_op_dispatch_scope` mirrors it onto the
+    // thread-local accessor read by the vector-estimator
+    // dispatch site.
+    //
+    // Forward-compat note: when `kv_attn_type != f32`, the
+    // legacy `use_f16_attn` boolean above is ALSO updated to
+    // `(kv_attn_type == f16)` so any code path still keying on
+    // the boolean (text-encoder / duration / vocoder) sees the
+    // historically-correct value.  The vector estimator (the
+    // only consumer that gains from the multi-dtype dispatch)
+    // reads `kv_attn_type` directly.
+    kv_attn_dtype kv_attn_type = kv_attn_dtype::f32;
+
     std::map<std::string, ggml_tensor *> tensors;
     std::unordered_map<std::string, ggml_tensor *> source_tensors;
     std::unordered_map<std::string, supertonic_voice_style> voices;
 
+    // Pre-transposed copies of matmul weights, materialized at load time
+    // to eliminate the per-call `cont(transpose(w))` dispatch that
+    // `dense_matmul_time_ggml` issues on every graph compute.  Keyed by
+    // the ORIGINAL weight tensor pointer (i.e. the value in
+    // `source_tensors[<MatMul_*>]`); the mapped value is the transposed
+    // f32 copy with `ne = [IC, OC]` and lives in `ctx_w_extra` /
+    // `buffer_w_extra`.  Lookup via `try_pretransposed_weight(model, w)`.
+    ggml_context * ctx_w_extra = nullptr;
+    ggml_backend_buffer_t buffer_w_extra = nullptr;
+    std::unordered_map<const ggml_tensor *, ggml_tensor *> pretransposed_weights;
+
     std::vector<int32_t> unicode_indexer;
     std::vector<std::string> languages;
     std::string tts_json;
+
+    // ----- OpenCL optimization caches (audit F1 / F9) -----
+    //
+    // F1: cached copy of the vector-estimator RoPE θ tensor (the
+    // `vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta`
+    // entry).  All four group attention sites in the production GGML
+    // path read from the same source tensor; caching once at load
+    // saves 4 × N_STEPS GPU→host downloads per synth on a non-CPU
+    // backend.  Empty if the GGUF doesn't carry the theta tensor.
+    // Populated unconditionally at load time so call sites can use
+    // it without a fallback.
+    std::vector<float> vector_rope_theta;
+
+    // F9: per-(current_step, total_steps) cache of
+    // `time_embedding(model, …)` outputs.  The vector denoising
+    // schedule fires at most `total_steps` distinct (current, total)
+    // pairs per synth; cache hit rate is ≥(steps − 1) / steps once
+    // warm.  `mutable` because the cache populates lazily on
+    // const-method paths; thread-unsafe by design (matches the rest
+    // of supertonic_model: one engine per thread).  Key is
+    // `(current << 32) | total`.
+    mutable std::unordered_map<uint64_t, std::array<float, 64>> time_emb_cache;
+
+    // ----- Audit follow-up #2 caches (F13 / F16) -----
+    //
+    // F13: text-encoder LN weight host-side cache.  The text-encoder
+    // GGML production path runs four relpos + LN + FFN + LN
+    // iterations followed by a final speech-prompted LN; the LN
+    // step on each iteration calls the scalar `layer_norm_channel`
+    // which used to download γ + β from the backend on every call
+    // (~18 GPU→host downloads / synth on a non-CPU backend).
+    // Populated at `load_supertonic_gguf` time from
+    // `text_encoder:...attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}`
+    // plus the final `speech_prompted_text_encoder.norm.norm.*`.
+    // Keyed by the source-tensor name so the call-site rewrite
+    // becomes `auto & v = model.text_encoder_ln_weights[name]`.
+    // Empty entries fall back to `read_f32(model, name)` so a GGUF
+    // missing one of the rostered names degrades gracefully.
+    std::unordered_map<std::string, std::vector<float>> text_encoder_ln_weights;
+
+    // F16: speech-prompted attention `tanh_k` host-side cache.
+    // Indexed by attention layer (0 or 1).  Source tensors:
+    //   speech_tanh_k_cache[0] ←
+    //     "text_encoder:/speech_prompted_text_encoder/attention1/tanh/Tanh_output_0"
+    //   speech_tanh_k_cache[1] ←
+    //     "text_encoder:/speech_prompted_text_encoder/attention2/tanh/Tanh_output_0"
+    // Each ≈ 50 × 256 = 51.2 KiB; saves 2 sync points + ~100 KiB
+    // of redundant traffic per synth.
+    std::array<std::vector<float>, 2> speech_tanh_k_cache;
+
+    // ----- Audit follow-up #3 cache (F17) -----
+    //
+    // F17: generic lazy host-side cache for any source weight that
+    // a scalar-CPU continuation needs.  The duration stage's
+    // post-graph scalar attention (relpos K/V embeddings, conv_o,
+    // 4 LN pairs, 2 FFN's conv_{1,2} pairs, proj_out weight) — and
+    // any future stage that uses `cached_read_f32` — populates
+    // this on first touch.  Keyed by the source-tensor name; value
+    // is the F32 byte payload sized to `ggml_nelements(src)`.
+    //
+    // Memory cost: bounded by the union of stages' scalar-
+    // continuation weight footprints.  Empirically ~3-5 MB on a
+    // Supertonic-2 GGUF, vs. the savings of ~30 GPU→host syncs per
+    // duration synth (+ ~15 from the text-encoder LN cache (F13)
+    // and the speech tanh_k cache (F16) already shipped).
+    //
+    // `mutable` because the cache populates lazily on const-method
+    // paths; thread-unsafe by design (one engine per thread).
+    mutable std::unordered_map<std::string, std::vector<float>> scalar_weight_cache;
+};
+
+// `f16_weights`:
+//   -1 → auto (on when the resolved backend is non-CPU, off on CPU).
+//    0 → force off (every hot weight stays at its GGUF storage type).
+//    1 → force on  (every hot weight matching
+//        `should_materialise_f16_weight` is allocated as F16,
+//        regardless of backend).
+// See Phase 2A in `aiDocs/PLAN_SUPERTONIC_OPENCL.md` for the
+// roster + auto-policy rationale.
+//
+// `precision` (separate concern): selects the storage type for
+// matmul weights at GGUF load time.  Mirrors the public
+// `tts_cpp::supertonic::Precision` enum.  F32 is the historical
+// default; Q8_0 / F16 trigger asymmetric loads on Metal.
+enum class supertonic_precision {
+    F32 = 0,
+    F16 = 1,
+    Q8_0 = 2,
 };
 
+// `vulkan_device` (QVAC-18605):
+//   ≥ 0 → adapter index passed to `ggml_backend_vk_init(idx)`.
+//        Range-checked against `ggml_backend_vk_get_device_count()`;
+//        an out-of-range index is a hard error (no silent CPU
+//        fallback — that would mask CLI typos / wrong-machine
+//        config).  Default 0 (the historical hard-coded value).
+//   < 0 → reserved for future "auto-pick best device" behaviour;
+//        treated as 0 today.
+// Has no effect when the build wasn't compiled with `GGML_VULKAN`
+// or when `n_gpu_layers <= 0`.
+// QVAC-18605 round 6 — `f16_weights_deny_list`:
+//   Extra deny-list (substring patterns) for the F16-weights
+//   materialization predicate.  Layered ON TOP of the curated
+//   allow-list in `should_materialise_f16_weight()`.  Empty
+//   default → zero behaviour change for every existing call site.
+//   See `EngineOptions::f16_weights_deny_list` for the full
+//   contract + use cases.
 bool load_supertonic_gguf(const std::string & path,
                           supertonic_model & model,
                           int n_gpu_layers = 0,
-                          bool verbose = false);
+                          bool verbose = false,
+                          int f16_weights = -1,
+                          supertonic_precision precision = supertonic_precision::F32,
+                          int vulkan_device = 0,
+                          const std::vector<std::string> & f16_weights_deny_list = {});
 void free_supertonic_model(supertonic_model & model);
 void supertonic_set_n_threads(supertonic_model & model, int n_threads);
 void supertonic_graph_compute(const supertonic_model & model, ggml_cgraph * graph);
 
-// Scheduler-based alloc + compute (Option A), used by stages migrated off
-// the per-graph ggml_gallocr. Pairing contract at each call site:
+// True when the model's compute backend supports the per-stage CPU fast paths
+// (the `ggml_custom_4d` callbacks in conv1d_f32 / depthwise_same_ggml /
+// layer_norm_ggml etc.).  ggml custom ops are CPU-only by design; on Metal /
+// CUDA / Vulkan the helpers must fall through to their stock-ggml-op paths.
+// Mirrors the `!ggml_backend_is_cpu(backend)` idiom Chatterbox uses to gate
+// its Metal-only batched-CFG path.
+inline bool model_prefers_cpu_kernels(const supertonic_model & model) {
+    return model.backend == nullptr || ggml_backend_is_cpu(model.backend);
+}
+
+// QVAC-19254 — scheduler-based alloc + compute (Option A), used by stages
+// migrated off the per-graph ggml_gallocr.  Pairing contract at each call
+// site:
 //   supertonic_sched_alloc(model, gf);            // reset + allocate via sched
 //   ggml_backend_tensor_set(input_leaf, ...);     // inputs now have memory
 //   supertonic_sched_compute(model, gf);          // run (routes customs -> CPU)
@@ -114,16 +582,27 @@ void supertonic_sched_compute(const supertonic_model & model, ggml_cgraph * grap
 
 ggml_tensor * require_tensor(const supertonic_model & model, const std::string & name);
 ggml_tensor * require_source_tensor(const supertonic_model & model, const std::string & source_name);
+ggml_tensor * try_source_tensor(const supertonic_model & model, const std::string & source_name);
+
+// Look up a pre-transposed copy of a matmul weight.  Returns nullptr if no
+// pre-transposed copy was materialized for `w` at load time (e.g. CPU backend
+// — pre-transposition is a Metal-perf-only optimization).  When non-null, the
+// returned tensor has `ne = [IC, OC]` (the swapped layout of `w`), is f32 and
+// contiguous in `model.buffer_w_extra`.  Callers should reshape it as the
+// conv1d kernel `[K=1, IC, OC]` directly and skip the cont(transpose(w)).
+ggml_tensor * try_pretransposed_weight(const supertonic_model & model, const ggml_tensor * w);
 
 std::string supertonic_preprocess_text(const std::string & text,
                                        const std::string & language,
-                                       const std::string & language_wrap_mode);
+                                       const std::string & language_wrap_mode,
+                                       bool is_continuation = false);
 bool supertonic_text_to_ids(const supertonic_model & model,
                             const std::string & text,
                             const std::string & language,
                             std::vector<int32_t> & ids,
                             std::string * normalized_text = nullptr,
-                            std::string * error = nullptr);
+                            std::string * error = nullptr,
+                            bool is_continuation = false);
 
 bool supertonic_vocoder_forward_cpu(const supertonic_model & model,
                                     const float * latent,
@@ -187,6 +666,83 @@ bool supertonic_text_encoder_forward_ggml(const supertonic_model & model,
                                           std::vector<float> & text_emb_out,
                                           std::string * error = nullptr);
 
+// QVAC-18605 round 12 #6 — text-encoder speech-prompted-attention
+// GPU bridge.
+//
+// Master's Metal-port branch (PR #15) shipped a fully-built
+// `speech_prompted_merged_cache` graph in
+// `supertonic_text_encoder.cpp` — one ggml graph that does QKV
+// projection + head-split + flash-attn + out-proj end-to-end on
+// the GPU.  The graph builder
+// (`build_speech_prompted_merged_cache`) was present + reviewed
+// at the implementation level but the run path was never wired
+// in.  So the production text-encoder path stayed on the pre-
+// Phase-A4 two-cache pattern with host-side Q/V download →
+// pack → re-upload between the QKV cache and the flash-attn
+// cache (5 sync points × 2 layers per synth).
+//
+// Round 12 adds `run_speech_prompted_merged_cache` and switches
+// the dispatch in `speech_prompted_attention_ggml` to use it on
+// non-CPU backends.  CPU stays on the legacy two-cache path
+// because that path leans on the host BLAS fast path for the
+// QKV matmuls and downstream scalar code keeps the host-side
+// head-split as a free-ish memcpy.  Saves 10 sync points /
+// synth on Vulkan / OpenCL / Metal.
+//
+// Struct + helpers exposed via the header so a CPU-only unit
+// test can SFINAE-pin the field contract + free-default
+// destructor without dragging the whole text-encoder TU into
+// the test binary.
+struct speech_prompted_merged_cache {
+    const supertonic_model * model = nullptr;
+    uint64_t generation_id = 0;
+    int idx = -1;
+    int L = 0;
+    int Lctx = 0;
+    std::string out_w_source;
+    std::string out_b_source;
+    std::vector<uint8_t> buf;
+    ggml_context * ctx = nullptr;
+    ggml_cgraph * gf = nullptr;
+    ggml_gallocr_t allocr = nullptr;
+    ggml_tensor * x_in = nullptr;       // ne=[L, C], channel-major-flat memory
+    ggml_tensor * style_in = nullptr;   // ne=[Lctx, C], same memory layout
+    ggml_tensor * out = nullptr;        // ne=[L, C] result, channel-major-flat
+};
+
+void free_speech_prompted_merged_cache(speech_prompted_merged_cache & cache);
+
+void build_speech_prompted_merged_cache(speech_prompted_merged_cache & cache,
+                                        const supertonic_model & m,
+                                        int idx,
+                                        int L,
+                                        int Lctx,
+                                        const std::string & q_w_source,
+                                        const std::string & v_w_source,
+                                        const std::string & out_w_source,
+                                        const std::string & out_b_source,
+                                        const std::string & tanh_k_source,
+                                        const std::string & q_b_source,
+                                        const std::string & v_b_source);
+
+// Round 12: run the merged graph once with the given host-side
+// `x_lc` / `style_ttl` inputs.  Caller MUST have ensured the
+// cache is built (`build_speech_prompted_merged_cache`) AND keyed
+// against the current `(model, idx, L, Lctx)`.  This is the
+// drop-in replacement for the legacy two-cache path inside
+// `speech_prompted_attention_ggml` — same input / output
+// conventions (`x_lc`, `out_lc` are time-major-flat `[t*C + c]`).
+//
+// `style_ttl` is also time-major-flat (`style_ttl[t*C + c]`),
+// matching the layout `speech_prompted_attention_ggml`'s caller
+// in `supertonic_text_encoder_forward_ggml` passes.
+void run_speech_prompted_merged_cache(speech_prompted_merged_cache & cache,
+                                       const supertonic_model & m,
+                                       const std::vector<float> & x_lc,
+                                       int L,
+                                       const float * style_ttl,
+                                       std::vector<float> & out_lc);
+
 bool supertonic_text_encoder_trace_ggml(const supertonic_model & model,
                                         const int64_t * text_ids,
                                         int text_len,
@@ -218,6 +774,127 @@ bool supertonic_vector_step_ggml(const supertonic_model & model,
                                  std::vector<float> & next_latent_out,
                                  std::string * error = nullptr);
 
+// Audit finding F9 — `time_embedding(model, current, total)` is a
+// pure function over (current_step, total_steps) whose output (64
+// floats) is reused once per group inside the vector estimator.
+// `cached_time_embedding` populates `model.time_emb_cache` on first
+// touch and returns a stored reference on every subsequent call
+// with the same key.  Steady-state per-synth recomputation cost
+// drops from `total_steps` invocations to zero after the first
+// synth.  See PLAN_SUPERTONIC_OPENCL.md Phase 2F.
+std::array<float, 64> cached_time_embedding(const supertonic_model & model,
+                                            int current_step,
+                                            int total_steps);
+
+// Phase 2A — hot-weight predicate for F16 materialization.
+//
+// Returns `true` when `source_name` (the
+// `<stage>:<onnx-or-pytorch-path>` source key in
+// `model.source_tensors`) names one of the bandwidth-bound matmul /
+// pointwise-conv weights identified by the audit, and the load-time
+// hook should allocate it as `GGML_TYPE_F16` instead of `F32` when
+// `model.use_f16_weights` is on.  Pure function over the string; no
+// model state needed.  Documented in test_supertonic_f16_weights.cpp
+// with explicit positive + negative + edge-case rosters.
+//
+// Conservative roster:
+//   - vector_estimator attention W_query/W_key/W_value/W_out matmul
+//     weights (only those whose source name matches `onnx::MatMul_NNNN`
+//     where NNNN ∈ {3101..3110, 3116..3119, 3146..3155, 3161..3164,
+//                   3191..3200, 3206..3209, 3236..3245, 3251..3254}).
+//   - vector_estimator pwconv1/pwconv2 inside every convnext block,
+//     including `last_convnext`.
+//   - vocoder convnext pwconv1/pwconv2 + `head.layer1.net.weight`.
+//   - text-encoder linear weights `text_encoder:onnx::MatMul_*` and
+//     the per-layer FFN conv1/conv2 weights (`conv_1.weight`,
+//     `conv_2.weight`).
+//
+// Cold-weights list (predicate must return `false`):
+//   biases, per-channel γ/β, embedding tables, depthwise conv
+//   kernels, RoPE θ, BN scale/shift, normalizer scalars,
+//   pre-transposed `__T` companions, and anything else not on the
+//   audit's hot list.  See test_supertonic_f16_weights.cpp.
+bool should_materialise_f16_weight(const std::string & source_name);
+
+// QVAC-18605 round 6 — 2-arg overload that layers a user-
+// overridable substring deny-list on top of the curated allow-
+// list above.  Returns `false` when ANY non-empty substring in
+// `extra_deny_substrings` is found inside `source_name`; otherwise
+// forwards to the 1-arg version.
+//
+// Contract:
+//   - Empty deny-list (default for every existing call site)
+//     behaves identically to the 1-arg version — zero behaviour
+//     change for the default path.
+//   - The deny-list is a DENY list, not an allow list: it can
+//     only flip `true → false`, never `false → true`.  A pattern
+//     that matches a cold weight is a no-op (cold + deny = cold).
+//   - Empty strings inside the deny-list are SKIPPED, not treated
+//     as universal matches (defensive against config typos that
+//     would otherwise silently disable F16 weights entirely).
+//   - Substring matching, not regex (matches the curated
+//     predicate's audit-friendly style; no regex compile cost,
+//     no invalid-pattern error surface).
+//
+// Use cases:
+//   - Researcher A/B testing a specific tensor pattern without
+//     recompiling.
+//   - Operator force-keeping a tensor as F32 if they observe
+//     drift on their hardware.
+//   - Safety net for new tensor patterns added in future GGUFs
+//     that the curated allow-list inadvertently scoops in.
+//
+// Plumbed through `EngineOptions::f16_weights_deny_list` →
+// `load_supertonic_gguf(..., f16_weights_deny_list)` → the
+// per-tensor allocation loop in `load_supertonic_gguf`.
+bool should_materialise_f16_weight(const std::string & source_name,
+                                   const std::vector<std::string> & extra_deny_substrings);
+
+// Phase 2D — machine-readable per-island timing emitter.
+//
+// Three-function API:
+//   - `supertonic_profile_csv_enabled()` — true when either the
+//     env var `SUPERTONIC_PROFILE_CSV=PATH.csv` is set OR a
+//     subsequent `_set_path(PATH)` has installed a path.
+//   - `supertonic_profile_csv_record(stage, island, step, wall_ms)`
+//     — appends one row to the CSV.  No-op when disabled.
+//   - `supertonic_profile_csv_flush()` — flushes buffered writes
+//     to disk.  Called from each per-stage profile hook after the
+//     synth completes, plus at process exit via atexit.
+//   - `supertonic_profile_csv_set_path(PATH | nullptr)` — test-only
+//     hook to override the env var without touching `setenv`.
+//     Passing `nullptr` closes the active file + disables the
+//     emitter; passing a new path reopens (header is written
+//     only when the file is empty, so re-open appends).
+//
+// Thread-safety: single-threaded by design.  Recording from
+// multiple threads at once is undefined; callers serialise via the
+// usual single-engine-per-thread convention.  See
+// `test_supertonic_profile_csv.cpp` for the schema contract.
+bool supertonic_profile_csv_enabled();
+void supertonic_profile_csv_record(const char * stage, const char * island,
+                                   int step, double wall_ms);
+void supertonic_profile_csv_flush();
+void supertonic_profile_csv_set_path(const char * path);
+
+// Phase A1+A2 (Metal): run ALL `total_steps` CFM denoising steps inside
+// ONE ggml_cgraph, dispatched with a single ggml_backend_graph_compute
+// call.  On non-CPU backends this replaces the engine's per-step loop
+// entirely (latent stays in GPU memory step-to-step, no host round-trip).
+// On CPU it falls back to a per-step loop over `supertonic_vector_step_ggml`
+// so the cblas fastpaths still apply.  Override the GPU path with
+// SUPERTONIC_DISABLE_LOOP_GRAPH=1 to A/B against the per-step path.
+bool supertonic_vector_loop_ggml(const supertonic_model & model,
+                                  const float * initial_noisy_latent,
+                                  int latent_len,
+                                  const float * text_emb,
+                                  int text_len,
+                                  const float * style_ttl,
+                                  const float * latent_mask,
+                                  int total_steps,
+                                  std::vector<float> & final_latent_out,
+                                  std::string * error = nullptr);
+
 bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
                                        const float * noisy_latent,
                                        const float * text_emb,
@@ -266,4 +943,880 @@ inline void supertonic_safe_gallocr_free(ggml_gallocr_t & allocr, uint64_t gener
     allocr = nullptr;
 }
 
+// ---------------------------------------------------------------------
+// Portable LeakyReLU(x, α) = (1-α)·relu(x) + α·x.
+//
+// `ggml_leaky_relu` (GGML_OP_LEAKY_RELU) is a CPU builtin and is also
+// present on the QVAC `ggml-speech` vcpkg port via the chatterbox
+// `ggml-opencl-chatterbox-ops.patch`, but baseline upstream
+// `ggml-opencl` and several other GPU backends still reject the op at
+// graph-execute time.  Routing through this helper keeps every
+// Supertonic graph executable on every backend:
+//
+//   - On CPU we keep the single fused builtin (cheaper, single op
+//     callback per row instead of three).
+//   - On GPU we decompose into `RELU + SCALE + ADD`, all universally
+//     supported (see `ggml_opencl_supports_op()`).
+//
+// Defined inline in the header so every TU that includes this header
+// gets the same lowering, and so the dispatch test can call it
+// directly without depending on which TU happens to instantiate it.
+// The thread-local `supertonic_use_cpu_custom_ops()` flag flips
+// behaviour; the inline body is a thin wrapper, so neither branch
+// retains hidden state.
+//
+// Bit-exact equivalence between the two lowerings is checked in
+// `test/test_supertonic_portable_ops.cpp` on a CPU backend.
+inline ggml_tensor * leaky_relu_portable_ggml(ggml_context * ctx, ggml_tensor * x, float alpha);
+
+// ---------------------------------------------------------------------
+// Op-dispatch policy for the GGML graph builders.
+//
+// The Supertonic vocoder + vector estimator carry several
+// `ggml_custom_4d` fast paths whose op callbacks invoke CBLAS / direct
+// pointer loads against the tensor `data` field.  Those paths are
+// only valid on the GGML CPU backend (the only backend that exposes
+// host-addressable tensor data inside an op callback and schedules
+// custom ops at all — every other backend rejects GGML_OP_CUSTOM
+// outright).  When the resolved compute backend is non-CPU
+// (CUDA / Metal / Vulkan / OpenCL) those sites must take the
+// pure-GGML fallback path so the graph stays GPU-executable.
+//
+// Threading the decision through every graph-build helper would
+// touch dozens of file-static functions across three TUs.  Instead,
+// each public forward entry point (e.g. supertonic_vocoder_forward_ggml,
+// supertonic_vector_step_ggml) instantiates a
+// `supertonic_op_dispatch_scope` on entry, which sets a thread_local
+// flag mirroring `model.backend_is_cpu`.  Graph-build helpers query
+// it via `supertonic_use_cpu_custom_ops()` at the cblas-vs-fallback
+// branch.  RAII teardown guarantees the flag is cleared even on
+// exception paths, so a CPU-only second engine in the same thread
+// still sees the default `true` after a GPU engine's forward returns.
+bool supertonic_use_cpu_custom_ops();
+bool supertonic_use_f16_attn();
+
+// QVAC-18605 round 4 — thread-local accessor for the currently-
+// active K/V dispatch dtype, mirroring `supertonic_use_f16_attn`'s
+// pattern.  Returns `kv_attn_dtype::f32` when no
+// `supertonic_op_dispatch_scope` is active (matches the model's
+// default-constructed value, so a graph builder called outside a
+// scope never accidentally takes the F16 / BF16 / Q8_0 path).
+//
+// The dispatch-scope ctor populates this from
+// `model.kv_attn_type`; the dtor restores the previous value
+// (RAII teardown, exception-safe).
+kv_attn_dtype supertonic_kv_attn_type();
+
+// QVAC-18605 round 4 — pure-logic resolver for the multi-dtype
+// K/V dispatch policy.  Maps the EngineOptions int + the
+// resolved-backend probes into the concrete `kv_attn_dtype` to
+// dispatch.
+//
+// Behaviour matrix:
+//
+//   | requested | legacy_use_f16_attn | resolved                       |
+//   |-----------|---------------------|--------------------------------|
+//   | -1 (auto) | true                | f16 if supports_f16 else f32   |
+//   | -1 (auto) | false               | f32                            |
+//   |  0 (f32 force) | any            | f32                            |
+//   |  1 (f16 force) | any            | f16 if supports_f16 else f32   |
+//   |  2 (bf16 force)| any            | bf16 if supports_bf16 else f32 |
+//   |  3 (q8_0 force)| any            | q8_0 if supports_q8_0 else f32 |
+//   | < -1 or > 3    | any            | throws std::runtime_error      |
+//
+// Fall-through to `f32` (instead of throw) on probe-rejected
+// explicit requests is intentional: probes are advisory, and an
+// operator setting `--kv-attn-type bf16` once in their production
+// config should work on both NVIDIA Ampere+ (BF16 effective) and
+// Intel ARC (no coopmat2 → silent F32 fallback) without crashing.
+// Loud-failure stays for actual config errors (out-of-range int).
+//
+// PR #18 reviewer (Omar) follow-up — the "silent" part of that
+// fallback was hiding an operator surprise.  Optional
+// `out_was_downgraded` pointer is set to `true` IFF the operator
+// explicitly requested f16 / bf16 / q8_0 AND the corresponding
+// backend probe returned false AND the resolver therefore
+// returned `f32` instead.  The CLI-facing call sites (Engine
+// ctor + supertonic-bench) consult this flag and emit a
+// `fprintf(stderr, "warning: ...")` so the operator knows their
+// `--kv-attn-type bf16` config silently degraded.  Auto (`-1`)
+// + missing probe is NOT a downgrade (the operator didn't ask
+// for a specific dtype, so the auto-policy is doing its job) —
+// the flag stays false on that path.
+//
+// Pass `nullptr` (the default) to ignore the downgrade signal
+// — the pure-logic unit tests use this so test runs don't spam
+// stderr with warnings.
+//
+// Pure logic, no Vulkan symbols touched here — same split
+// pattern as `resolve_vulkan_device_index` from round 3.
+kv_attn_dtype resolve_kv_attn_type(int requested,
+                                   bool legacy_use_f16_attn,
+                                   bool backend_supports_f16,
+                                   bool backend_supports_bf16,
+                                   bool backend_supports_q8_0,
+                                   bool * out_was_downgraded = nullptr);
+// QVAC-18605 — true when the resolved backend supports
+// `GGML_OP_LEAKY_RELU` natively.  Mirrored from
+// `supertonic_model::use_native_leaky_relu` by
+// `supertonic_op_dispatch_scope` for the duration of each public
+// `*_forward_ggml` / `*_trace_ggml` entry.  Consulted by
+// `leaky_relu_portable_ggml` to skip the RELU+SCALE+ADD
+// decomposition when the backend has the fused op available.
+bool supertonic_use_native_leaky_relu();
+
+// QVAC-18605 — load-time backend-capability probes used by the
+// engine + bench auto-policy for `use_f16_attn`.  Returns `true`
+// when the resolved backend would accept a Supertonic-shaped
+// `ggml_flash_attn_ext(Q=F32, K/V=F16)` graph node — the auto-
+// enable policy gates on this so a backend that doesn't ship the
+// mixed-precision kernel doesn't crash at first synth call.
+// Manual override via `EngineOptions::f16_attn=1` still forces
+// dispatch (useful for benchmarking with a debug-shim backend).
+//
+// QVAC-18605 follow-up — both probes are now memoised
+// process-wide by `ggml_backend_t` handle, so the engine + bench
+// + load_supertonic_gguf trio doesn't re-run the same probe two
+// or three times per backend.  Defined out of line in
+// supertonic_gguf.cpp.
+bool supertonic_backend_supports_f16_kv_flash_attn(ggml_backend_t backend);
+
+// QVAC-18605 follow-up — load-time backend-capability probe used by
+// the engine + bench + `load_supertonic_gguf` auto-policy for
+// `use_f16_weights`.  Symmetric to the F16-K/V flash-attn probe:
+// returns `true` when the resolved backend would accept the hot
+// `mul_mat(F16 weight, F32 activation) → F32` graph node Supertonic
+// dispatches every step (vector-estimator W_query, vocoder head
+// linear, text-encoder linears, etc.).  The auto-enable policy
+// gates on this so a partial-port backend that ships F16 storage
+// but rejects F16 mul_mat for the hot shape keeps the F32 path
+// — slower but guaranteed not to crash at first synth call.
+// Manual override via `EngineOptions::f16_weights=1` still forces
+// materialisation.
+bool supertonic_backend_supports_f16_mul_mat(ggml_backend_t backend);
+
+// QVAC-18605 follow-up — load-time backend-capability probe for
+// the Q8_0 K/V `FLASH_ATTN_EXT` variant.  Forward-compat: returns
+// `true` when the backend would accept a Supertonic-shaped
+// `ggml_flash_attn_ext(Q=F32, K/V=Q8_0)` graph node.  Vulkan's
+// `supports_op` advertises Q8_0 K/V in both scalar and coopmat2
+// paths (`ggml-vulkan.cpp:GGML_OP_FLASH_ATTN_EXT`), which would
+// halve the per-step K/V upload bandwidth on memory-bandwidth-
+// bound mobile GPUs in exchange for a small (~0.5 %) drift on the
+// attention output.  This PR adds the probe + caches the result;
+// the live dispatch site is not yet wired through Q8_0 because the
+// drift hasn't been measured against the F16 K/V parity harness on
+// a real Vulkan adapter.  See PROGRESS_SUPERTONIC.md "Deferred
+// work" for the follow-up.
+bool supertonic_backend_supports_q8_0_kv_flash_attn(ggml_backend_t backend);
+
+// QVAC-18605 round 3 — load-time backend-capability probe for the
+// BF16 K/V `FLASH_ATTN_EXT` variant.  Forward-compat: returns
+// `true` when the backend would accept a Supertonic-shaped
+// `ggml_flash_attn_ext(Q=F32, K/V=BF16)` graph node.  Vulkan
+// advertises BF16 K/V in the coopmat2 path only
+// (`ggml-vulkan.cpp:GGML_OP_FLASH_ATTN_EXT`); BF16 has the same
+// 2-byte per-element footprint as F16 (so identical upload
+// bandwidth) but the wider 8-bit exponent range avoids the
+// occasional small-score underflow that drives F16's tolerance
+// widening on the parity harness.  Live dispatch site isn't yet
+// wired (a follow-up gates `--kv-attn-type bf16` on this probe);
+// caching it here primes the cache for that work.
+bool supertonic_backend_supports_bf16_kv_flash_attn(ggml_backend_t backend);
+
+// QVAC-18605 round 3 — backend capability probe for Vulkan's
+// `ggml_backend_vk_host_buffer_type()`.  Returns `true` iff the
+// backend is Vulkan AND the host-pinned buffer type is non-null.
+// Forward-compat — primes the capability cache for a follow-up
+// per-engine input-scratchpad refactor that skips ggml-vulkan's
+// internal staging-buffer hop on per-step uploads (text-emb,
+// time-step encoding, style embedding) by allocating those
+// tensors in the host-pinned buffer type instead of the default
+// device-local buffer.
+bool supertonic_backend_supports_pinned_host_buffer(ggml_backend_t backend);
+
+// QVAC-18605 round 12 #5 — pinned-host-buffer input allocator.
+//
+// Round 3 shipped the capability probe; round 12 lands the actual
+// per-engine input-scratchpad refactor.  Callers create a small
+// `ggml_context` (with `no_alloc=true`) containing ONLY the hot
+// per-step input tensors (front-block `x_in` / `mask_in` /
+// `t_emb_in`, group-cache `x_in` / `temb_in`, etc.) and pass it
+// here.  On Vulkan (where `ggml_backend_vk_host_buffer_type()`
+// returns non-null) the helper allocates a buffer from the
+// host-pinned buft and binds every tensor in `input_ctx` to it
+// — `ggml_backend_tensor_set` then writes from the host's heap
+// directly into BAR-mapped GPU memory without an intermediate
+// staging-buffer copy.
+//
+// Return contract:
+//   - `nullptr` if `model.backend == nullptr`, `input_ctx == nullptr`,
+//     or the backend doesn't expose `ggml_backend_vk_host_buffer_type()`.
+//     Caller falls back to letting `ggml_gallocr_alloc_graph`
+//     handle the input tensors via the default buffer type —
+//     correct, just one staging-buffer hop per upload.
+//   - Otherwise the returned `ggml_backend_buffer_t` is OWNED by
+//     the caller.  Free at cache destruction with
+//     `ggml_backend_buffer_free(buf)`.
+//
+// On Vulkan adapters that expose a host-coherent BAR-mapped pool
+// (every modern discrete + every UMA iGPU), this skips one
+// memcpy per `ggml_backend_tensor_set` on the bound tensors.
+// Per synth at the 4 attention-feeding caches × 3 small per-step
+// inputs × 5 denoise steps ≈ 60 staging-hops saved.  Each hop
+// is ~5–15 us on the dev rig; aggregate ~0.3–1 ms / synth.
+//
+// CPU-only test (`test_supertonic_pinned_host_buffer.cpp`) pins
+// the symbol + the conservative `nullptr` return contract on
+// CPU backend + null-input safety in error paths.  End-to-end
+// behaviour validated by Vulkan synth + bench on real hardware.
+ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
+    const supertonic_model & model,
+    ggml_context * input_ctx);
+
+// QVAC-18605 round 13 #1 — input-scratchpad allocator that
+// consolidates the round-12 boilerplate.
+//
+// Round 12 #5 inlined the "try pinned-host first, fall back to
+// default backend buffer, throw if both fail" idiom at 4 cache
+// sites.  Round 13 extends the pattern to 5+ additional cache
+// sites (vector_loop_one_graph, vocoder, style residual + QKV,
+// merged speech-prompted, ...) — a 5x boilerplate copy is
+// error-prone (the failure-cleanup ordering is subtle:
+// `ggml_free(input_ctx)` BEFORE nulling the input-tensor
+// pointers leaves dangling pointers in the cache struct that a
+// subsequent free path will dereference).
+//
+// Contract:
+//   - Tries `try_alloc_inputs_in_pinned_host_buffer(model, ctx)`
+//     first.  Returns its buffer on success.
+//   - On failure (CPU / non-Vulkan / probe miss), falls back to
+//     `ggml_backend_alloc_ctx_tensors(ctx, model.backend)`.
+//     Returns that buffer on success.
+//   - On BOTH failing (system resource exhaustion, dead backend,
+//     etc.), throws `std::runtime_error` with a message that
+//     includes `cache_name` so operators can attribute the
+//     failure to a specific cache.
+//   - Defensive throws on `model.backend == nullptr`,
+//     `input_ctx == nullptr`, `cache_name == nullptr` — these
+//     are caller-bug guards in error-handler paths.
+//
+// Caller owns the returned buffer.  Standard teardown order
+// remains: gallocr → main ctx → input_buf → input_ctx (reversed
+// would dangle pointers in the cache struct).
+//
+// CPU-only test (`test_supertonic_input_scratchpad.cpp`) pins
+// the symbol + CPU-fallback contract + null-argument throws.
+// End-to-end Vulkan validation lives in the cache-build paths
+// that consume the helper (round 13 #1 wiring at
+// `vector_loop_one_graph_cache`, `vocoder_graph_cache`, etc.).
+ggml_backend_buffer_t alloc_input_scratchpad_or_throw(
+    const supertonic_model & model,
+    ggml_context * input_ctx,
+    const char * cache_name);
+
+// QVAC-18605 round 3 — multi-device Vulkan auto-pick policy.
+//
+// `init_supertonic_backend` calls `ggml_backend_vk_get_device_count()`
+// + `ggml_backend_vk_get_device_memory()` per device to build the
+// `free_vram_per_device` list, then dispatches into this pure-
+// logic helper to pick the device index.  Splitting the policy
+// from the Vulkan-only plumbing means the policy is testable on
+// CPU with synthetic inputs (see test_supertonic_vulkan_device_select.cpp).
+//
+// Behaviour matrix:
+//
+//   | requested | dev_count | result                                  |
+//   |-----------|-----------|-----------------------------------------|
+//   | -1        | 0         | throws (no device to pick)              |
+//   | N>=0      | 0         | throws (no device to pick)              |
+//   | -1        | 1         | 0  (only choice)                        |
+//   | -1        | N>1       | argmax(free_vram); ties → lower index   |
+//   | N>=0      | dev_count | N if N<dev_count, else throws           |
+//   | N<-1      | any       | throws (negative != -1 reserved)        |
+//
+// Throws `std::runtime_error` on invalid input; the caller surfaces
+// the message verbatim (same pattern as the existing
+// `--vulkan-device N out of range` error in `init_supertonic_backend`).
+//
+// Tie-breaking on equal free VRAM picks the lower index so
+// identical-spec multi-GPU machines (lab racks of A100s, e.g.)
+// produce stable per-run device assignment instead of depending
+// on driver enumeration order.  Operators who need a different
+// policy can `--vulkan-device N` explicitly.
+//
+// QVAC-18605 round 12 — `is_uma_per_device` (optional 3rd arg)
+// biases the auto-pick against UMA / iGPU devices when a
+// discrete device is also present.  Background: on hybrid
+// machines (NVIDIA RTX 5090 discrete + AMD RADV iGPU, or
+// similar), `ggml_backend_vk_get_device_memory()` reports the
+// iGPU's free pool as system RAM (often 120+ GB) because UMA
+// shares the host RAM with the CPU.  The round-3 argmax then
+// picks the iGPU, silently dropping ~40× realtime on synth
+// throughput vs. the discrete card.
+//
+// New policy (when `is_uma_per_device.size() == free_vram_per_device.size()`):
+//
+//   1. If at least one device has `is_uma_per_device[i] == false`,
+//      run argmax(free_vram) over the DISCRETE subset only.
+//   2. Otherwise (all UMA) fall back to argmax over all devices.
+//   3. Explicit `requested >= 0` passthrough is UMA-agnostic
+//      (operator-pinned index always wins).
+//
+// `is_uma_per_device` is OPTIONAL — empty list (default) means
+// "no UMA flags available, use round-3 policy".  Mismatched-
+// length non-empty lists throw (caller bug guard).
+//
+// Caller wiring lives in `init_supertonic_backend`: query
+// `ggml_backend_vk_get_device_type()` per device, set the bool
+// to `true` for `IntegratedGpu` / `Cpu` / `Other` types.  Pure
+// logic, no Vulkan symbols touched here — same split pattern
+// as the round-3 free-VRAM list.
+int resolve_vulkan_device_index(int requested,
+                                const std::vector<size_t> & free_vram_per_device,
+                                const std::vector<bool> & is_uma_per_device = {});
+
+// QVAC-18605 follow-up — test seams for the capability cache.
+// `supertonic_clear_capability_cache` drops every cached entry so
+// the regression test in `test_supertonic_capability_cache.cpp`
+// can verify the cache short-circuits on a hit (the cold-cache
+// call bumps `supertonic_capability_probe_call_count`; subsequent
+// cached calls don't until the cache is cleared).
+//
+// Not part of the supported public API — exported only for the
+// in-process test harness.  Keeping the declaration in this
+// internal header (which production callers don't include) is
+// the cheapest way to avoid the symbol leaking into the public
+// surface while still letting the unit test reach it.
+void supertonic_clear_capability_cache();
+uint64_t supertonic_capability_probe_call_count();
+
+struct supertonic_op_dispatch_scope {
+    bool prev_use_cpu_custom_ops;
+    bool prev_use_f16_attn;
+    bool prev_use_native_leaky_relu;
+    // QVAC-18605 round 4 — saved K/V dispatch dtype for RAII
+    // teardown.  Restored on scope destruction so a follow-on
+    // engine on the same thread sees the default value, not the
+    // previous engine's dispatch dtype (matters for nested
+    // synthesis flows where two engines share a worker thread).
+    kv_attn_dtype prev_kv_attn_type;
+    explicit supertonic_op_dispatch_scope(const supertonic_model & model);
+    ~supertonic_op_dispatch_scope();
+    supertonic_op_dispatch_scope(const supertonic_op_dispatch_scope &)             = delete;
+    supertonic_op_dispatch_scope & operator=(const supertonic_op_dispatch_scope &) = delete;
+};
+
+// ---------------------------------------------------------------------
+// Audit finding F20 (partial / Phase 2H) — RoPE rotation in-graph
+// with host-precomputed cos/sin tables.
+//
+// Replaces the per-attention-site `apply_rope(theta, q, L, H, D)`
+// host loop with a GPU-native rotation that reuses cos/sin tables
+// uploaded once per (L, θ).  Eliminates the CPU rotation step
+// (~50 µs × 40 sites/synth ≈ 2 ms) and is the prerequisite for a
+// follow-up that wires Q/K directly from the QKV graph into the
+// attention graph (cuts the host round-trip on Q and K outright).
+//
+// Formula it matches (exactly mirrors the scalar `apply_rope` in
+// `supertonic_vector_estimator.cpp`):
+//
+//     angle = (t / L) * theta[d]            ← `t/L`, not absolute t
+//     cs = cos(angle), sn = sin(angle)
+//     for d in [0, half):
+//         x[t, h, d]      := x[t, h, d]*cs       - x[t, h, half+d]*sn
+//         x[t, h, half+d] := x[t, h, half+d]*cs  + x[t, h, d]*sn
+//
+// Tensor contract:
+//   - `x`         : F32, ne=[head_dim, n_heads, L].  Memory layout
+//                   matches the scalar reference's
+//                   `data[t*H*D + h*D + d]`.
+//   - `cos_table` : F32, ne=[half, L]. cos_table[t*half + d] = cos((t/L)*θ[d]).
+//   - `sin_table` : F32, ne=[half, L]. Analogous.
+//   - returns     : F32, ne=[head_dim, n_heads, L].  Rotated x.
+//
+// Op-set used:
+//   `ggml_view_3d`, `ggml_reshape_3d`, `ggml_repeat`, `ggml_mul`,
+//   `ggml_sub`, `ggml_add`, `ggml_concat`.
+// All universally supported (incl. baseline upstream OpenCL —
+// see `ggml_opencl_supports_op()`), so the helper doesn't require
+// the chatterbox-patched `ggml_sin` / `ggml_cos` / `ggml_rope`.
+//
+// Parity-tested in `test_supertonic_rope_in_graph.cpp` against
+// the scalar `apply_rope` for the two hot vector-estimator shapes
+// + a zero-θ identity check.  Tolerance `1e-4` absolute.
+inline ggml_tensor * apply_rope_in_graph(ggml_context * ctx,
+                                         ggml_tensor * x,
+                                         ggml_tensor * cos_table,
+                                         ggml_tensor * sin_table) {
+    // Shape contracts (asserted at caller via test harness; here
+    // we only deref the fields).
+    const int64_t head_dim = x->ne[0];
+    const int64_t n_heads  = x->ne[1];
+    const int64_t L        = x->ne[2];
+    const int64_t half     = head_dim / 2;
+
+    // Split x along axis 0 into lower and upper halves.  Both
+    // halves share x's strides (`nb[0..2]`); the upper half just
+    // adds a half-byte offset.  Memory underneath is unchanged;
+    // these are views, not copies.
+    ggml_tensor * x_lower = ggml_view_3d(
+        ctx, x, half, n_heads, L,
+        /*nb1=*/x->nb[1], /*nb2=*/x->nb[2],
+        /*offset=*/0);
+    ggml_tensor * x_upper = ggml_view_3d(
+        ctx, x, half, n_heads, L,
+        /*nb1=*/x->nb[1], /*nb2=*/x->nb[2],
+        /*offset=*/(size_t) half * x->nb[0]);
+
+    // Broadcast cos/sin over n_heads: cos has ne=[half, L]; we
+    // need [half, n_heads, L] to align with x_lower/x_upper.
+    // `ggml_reshape_3d(c, half, 1, L)` gives ne=[half, 1, L] (a
+    // shape-changing zero-cost view of the same memory); then
+    // `ggml_repeat(c_3d, x_lower)` broadcasts axis 1 from 1 to
+    // n_heads.  ggml_can_repeat accepts the (..., 1, ...) → (...,
+    // N, ...) broadcast pattern unconditionally.
+    ggml_tensor * cos_3d = ggml_reshape_3d(ctx, cos_table, half, 1, L);
+    ggml_tensor * sin_3d = ggml_reshape_3d(ctx, sin_table, half, 1, L);
+    ggml_tensor * cos_b  = ggml_repeat(ctx, cos_3d, x_lower);
+    ggml_tensor * sin_b  = ggml_repeat(ctx, sin_3d, x_lower);
+
+    // Rotation: standard 2×2 cos/-sin / sin/cos block applied
+    // pointwise.  ggml_concat dim=0 stitches the lower + upper
+    // halves back into a [head_dim, n_heads, L] tensor with the
+    // same memory layout x came in with.
+    ggml_tensor * new_lower = ggml_sub(ctx,
+        ggml_mul(ctx, x_lower, cos_b),
+        ggml_mul(ctx, x_upper, sin_b));
+    ggml_tensor * new_upper = ggml_add(ctx,
+        ggml_mul(ctx, x_upper, cos_b),
+        ggml_mul(ctx, x_lower, sin_b));
+    return ggml_concat(ctx, new_lower, new_upper, /*dim=*/0);
+}
+
+// Host-side helper: precompute the (cos, sin) tables consumed by
+// `apply_rope_in_graph` for a given (L, θ) pair.  Output layout
+// matches the GGML tensor's natural row-major upload: element
+// (t, d) at `out[t*half + d]`.  Callers cache by L on
+// `supertonic_model::rope_cos_sin_cache` and upload once per cold
+// miss.  Pure function over (theta, L, half); no model state.
+inline void make_rope_cos_sin_tables(const float * theta,
+                                     int L,
+                                     int half,
+                                     std::vector<float> & cos_out,
+                                     std::vector<float> & sin_out) {
+    cos_out.resize((size_t) L * half);
+    sin_out.resize((size_t) L * half);
+    for (int t = 0; t < L; ++t) {
+        const float t_frac = (float) t / (float) L;
+        for (int d = 0; d < half; ++d) {
+            const float angle = t_frac * theta[d];
+            cos_out[(size_t) t * half + d] = std::cos(angle);
+            sin_out[(size_t) t * half + d] = std::sin(angle);
+        }
+    }
+}
+
+// ---------------------------------------------------------------------
+// Audit finding F23 (F20 integration / Phase 2H follow-through) —
+// packed-QK RoPE adapter for the Q/K-producing graphs.
+//
+// `apply_rope_in_graph` operates on a tensor with `ne=[head_dim,
+// n_heads, L]` — the natural layout the scalar `apply_rope`
+// reference indexes into (`data[t*H*D + h*D + d]`).  Every actual
+// call site in the vector estimator produces Q/K via
+// `dense_matmul_time_ggml`, whose output is a 2D tensor with
+// `ne=[L, HD]` — axis 0 = L (time, fastest along natural strides
+// `nb=[elem, L*elem]`) and axis 1 = HD = n_heads * head_dim
+// (packed channels h*D+d, slowest).  In flat memory the element
+// (t, c) sits at byte offset `(t + c*L)*elem` — i.e. **channel-
+// major-flat** (`data[t + c*L]`), which is the bit-exact transpose
+// of the time-major-flat layout the scalar `apply_rope` reference
+// indexes through (`data[t*H*D + h*D + d]`).
+//
+// QVAC-18966 — same-shape matmul on every backend: confirmed by
+// inspection of the CPU custom-op fast path (`ggml_custom_4d(F32,
+// x->ne[0] /* = L */, w->ne[0] /* = OC */, …)` → `[L, OC]`) and
+// the `conv1d_f32(K=1)` fallback (`ggml_reshape_3d(result,
+// im2col->ne[1] /* = L */, kernel->ne[2] /* = OC */, …)` → also
+// `[L, OC]`).  Both code paths produce the same ne contract — so
+// this helper's adapter has to bridge the **matmul-output**
+// channel-major-flat layout onto `apply_rope_in_graph`'s natural-
+// strides `[D, H, L]` contract.
+//
+// History note: the original (PR #16 follow-up #5) version of
+// this helper assumed `q->ne[0] = HD` and `q->ne[1] = L` — i.e.,
+// the transpose of what the matmul actually produces.  That
+// older contract crashed at the defensive assertion below on
+// every real synth (the moment a GGUF carrying `vector_rope_theta`
+// enabled the in-graph rotation path).  The CPU unit test that
+// landed alongside `apply_rope_to_packed_qk` hand-built Q under
+// the `[HD, L]` assumption, so the failure mode was invisible to
+// CI.  GPU backends (Metal / CUDA / Vulkan / OpenCL) silently
+// dispatched a transposed view through the rotation, masking the
+// shape problem until a CPU `--n-gpu-layers 0` synth hit the
+// assert.  See QVAC-18966.  `test_supertonic_rope_packed_qk.cpp`
+// now reproduces the **production** matmul layout and pins both
+// the input and output shape contracts.
+//
+// Pipeline (production layout):
+//   - Step 1: `ggml_cont(ggml_transpose(q))` — view-swap axes
+//     0/1 (zero-cost stride flip) then materialise to natural
+//     strides.  Result has ne=[HD, L] with **time-major-flat**
+//     memory layout (`data[c + t*HD]`).  This is the SAME layout
+//     `q_tc_in` (`ggml_new_tensor_2d(A, L)` in
+//     `vector_text_attention_cache`) expects for the
+//     `ggml_backend_tensor_copy` device→device blit at the GPU-
+//     bridge dispatch site.
+//   - Step 2: Re-view the packed tensor as `[head_dim, n_heads,
+//     L]` via the zero-cost stride trick `nb[0]=elem,
+//     nb[1]=D*elem, nb[2]=HD*elem` — element (d, h, l) lands at
+//     offset `d + h*D + l*HD` (elem units), identical to the
+//     post-transpose layout's element (col=h*D+d, row=l) at
+//     `col + row*HD`.
+//   - Step 3: Materialise a contiguous `[D, H, L]` copy so the
+//     downstream `ggml_concat` inside `apply_rope_in_graph` sees
+//     monotonically-increasing strides.
+//   - Step 4: `apply_rope_in_graph(ctx, x_dhl, cos, sin)`.
+//   - Step 5: Reshape the rotated `[D, H, L]` result back to
+//     `[HD, L]` — same memory, different ne labels.  Bytes are
+//     in time-major-flat layout `data[c + t*HD]`, byte-for-byte
+//     identical to scalar `apply_rope`'s output and to what
+//     `q_tc_in` expects.
+//
+// Call-site impact for the bytes-out contract:
+//   - GPU bridge (`run_text_attention_cache_gpu`): unchanged.
+//     `ggml_backend_tensor_copy(q_rope, q_tc_in)` already passes
+//     `ggml_nbytes(src) == ggml_nbytes(dst)` (same nelements)
+//     and now also matches the destination's memory layout
+//     bit-for-bit.
+//   - Legacy host bridge: `tensor_to_time_channel(q_rope)` was
+//     designed for the (incorrectly-shaped) old contract and
+//     would now read the transpose-of-the-transpose if called
+//     unchanged.  Use `tensor_raw_f32(q_rope)` instead — the
+//     bytes are already time-major-flat (matches scalar
+//     `apply_rope`'s output buffer contract), and uploading
+//     them via `ggml_backend_tensor_set` to `q_tc_in` lands the
+//     same bytes the GPU-bridge `ggml_backend_tensor_copy`
+//     would.  The four production call sites in
+//     `supertonic_vector_estimator.cpp` are updated in lock-step
+//     with this helper.
+//   - Trace mode: the `PUSH_GGML_TRACE` entries push a
+//     `std::vector<float>` shaped as `{L, HD}` (i.e., flat
+//     `out[t*HD + c]` — scalar `apply_rope`'s native indexing).
+//     `tensor_raw_f32(q_rope)` returns exactly that layout, so
+//     trace parity vs. the scalar harness is preserved without
+//     any further re-pack.
+//
+// Cost vs. the pre-fix (broken) helper:
+//   - Adds one `ggml_cont` per site (the head-of-pipeline
+//     transpose).  On CPU it is a single memcpy of `L * HD * 4`
+//     bytes; on GPU backends (Vulkan one ~256-thread shader
+//     dispatch, Metal / OpenCL equivalents) it is one shader
+//     dispatch per cache build.  The cache is built ONCE and
+//     reused across all 5 denoise steps, so the cost is fully
+//     amortised.
+//   - Eliminates 40 CPU rotations / synth (~50 µs each ≈ 2 ms
+//     wall-time on the default 5-step × 4-RoPE-site schedule).
+//   - Net (Vulkan branch only): the original rounds-8/9 GPU-
+//     bridge wins are preserved AND now actually run end-to-end
+//     without crashing.
+//
+// Universally-supported ops only: `ggml_transpose`, `ggml_cont`,
+// `ggml_view_3d`, `ggml_reshape_2d` + everything
+// `apply_rope_in_graph` uses.  Green on baseline upstream OpenCL.
+//
+// Parity-tested in `test_supertonic_rope_packed_qk.cpp` against
+// the scalar `apply_rope` on the two hot vector-estimator shapes
+// (`q_len=20 × H=4 × D=64`, `kv_len=32 × H=4 × D=64`), a
+// degenerate `L=1` trip-wire, and an explicit output-shape
+// contract check that pins `ne[0]=HD, ne[1]=L`.  Tolerance
+// `1e-4` absolute.
+inline ggml_tensor * apply_rope_to_packed_qk(ggml_context * ctx,
+                                              ggml_tensor * q,
+                                              ggml_tensor * cos_table,
+                                              ggml_tensor * sin_table,
+                                              int n_heads,
+                                              int head_dim) {
+    // Step 1 — transpose `ne=[L, HD]` (matmul-output contract,
+    // channel-major-flat memory) into `ne=[HD, L]` with natural
+    // time-major-flat memory.  `ggml_transpose` is a view-only
+    // axis swap (nb[0] ↔ nb[1]); `ggml_cont` materialises the
+    // natural strides `nb=[elem, HD*elem]`.  This is the SAME
+    // memory layout the downstream `q_tc_in` consumes — the
+    // helper's output then plumbs unchanged into both the GPU-
+    // bridge `ggml_backend_tensor_copy` and the legacy host-
+    // bridge `tensor_raw_f32` paths.
+    ggml_tensor * q_packed = ggml_cont(ctx, ggml_transpose(ctx, q));
+
+    const int64_t L  = q_packed->ne[1];
+    const int64_t HD = q_packed->ne[0];
+    (void) HD; // assertion-only; compiler may drop in NDEBUG.
+    GGML_ASSERT(HD == (int64_t) n_heads * head_dim);
+
+    // Step 2 — re-view the `[HD, L]` packed tensor as `[D, H, L]`
+    // via the zero-cost stride trick.  q_packed has natural
+    // strides nb=[elem, HD*elem]; the view nb=[elem, D*elem,
+    // HD*elem] gives element (d, h, l) at offset `d + h*D + l*HD`
+    // (elem units) — bit-identical to (col=h*D+d, row=l) at
+    // `col + row*HD` in the original packed layout.
+    ggml_tensor * q_dhl_view = ggml_view_3d(ctx, q_packed,
+        head_dim, n_heads, L,
+        /*nb1=*/(size_t) head_dim * sizeof(float),
+        /*nb2=*/(size_t) n_heads * head_dim * sizeof(float),
+        /*offset=*/0);
+    // Step 3 — materialise a contiguous [D, H, L] copy so the
+    // downstream `ggml_concat` / `ggml_repeat` ops in
+    // `apply_rope_in_graph` see natural strides
+    // (`nb=[elem, D*elem, D*H*elem]`).  The view above is legal
+    // but non-natural (`nb[1]<nb[2]` with a `D*elem`/`H*D*elem`
+    // ratio that some backends' op implementations refuse).
+    ggml_tensor * q_dhl = ggml_cont(ctx, q_dhl_view);
+    ggml_tensor * q_rot = apply_rope_in_graph(ctx, q_dhl, cos_table, sin_table);
+    // Step 4 — reshape back to the packed `[HD, L]` shape; same
+    // memory, different ne labels.  Bytes are in time-major-flat
+    // layout `data[c + t*HD]`.
+    return ggml_reshape_2d(ctx, q_rot, (int64_t) n_heads * head_dim, L);
+}
+
+// ---------------------------------------------------------------------
+// Audit finding F7 / Phase 2J — fused ConvNeXt block builder for
+// the Supertonic vocoder.
+//
+// `convnext_block_ggml` (in supertonic_vocoder.cpp) used to compose
+// the per-block residual chain as:
+//
+//   x [T0, C] ── depthwise_conv1d_causal_ggml ──▶ dw [T0, C]
+//             ──▶ layer_norm_channel_ggml ──▶ ln [T0, C]
+//                  (permute → cont [C,T0] → norm → mul → add →
+//                   permute → cont [T0,C])     ← 2 conts each call
+//             ──▶ conv1d_causal_ggml (pw1, K=1)
+//                  (pad-noop → im2col [C,T0] → mul_mat → reshape)
+//             ──▶ gelu
+//             ──▶ conv1d_causal_ggml (pw2, K=1) (im2col again)
+//             ──▶ mul γ  ──▶ add residual
+//
+// That chain costs per-block:
+//   - 2 `ggml_cont` copies (LN front + LN back).
+//   - 2 `ggml_im2col` copies (pw1 + pw2; K=1 reduces im2col to a
+//     pure layout-shuffle copy).
+// = 4 [T0=420, C=512] copies / block ≈ 3.36 MiB / block.
+// × 10 ConvNeXt blocks = ~33.6 MiB redundant memory traffic
+// per vocoder pass on a discrete GPU.
+//
+// The fused builder cuts this in half by:
+//   1. Keeping the LN result in `[C, T0]` (channel-major) memory —
+//      no back-permute / back-cont after `ggml_norm + mul + add`.
+//   2. Lowering pw1 / pw2 to direct `ggml_mul_mat(w_2d, x_perm)`
+//      against that `[C, T0]` LN output.  No `im2col` needed for
+//      `K=1` — the same mathematical operation as the existing
+//      `conv1d_causal_ggml` path with identical summation order.
+//   3. Re-permuting once at the very end so the block output is
+//      `[T0, C]` for the next block (and the existing trace /
+//      readback plumbing keeps working unchanged).
+//
+// Net per block:
+//   - Conts: 2 → 2 (LN front + final back-permute).  Same count.
+//   - im2col copies: 2 → 0.  **Saves 2 [T0, C] copies per block.**
+//   = 1.68 MiB / block × 10 blocks = ~16.8 MiB redundant traffic
+//   eliminated per vocoder pass.  Matches the audit's F7 cost
+//   estimate (the redundant 2× permute+cont copy traffic the
+//   audit measured was the pair the LN front/back conts cause —
+//   the im2col copies were missed by the audit but show the same
+//   pattern, so the same fix removes both).
+//
+// Shape contract (mirrors the in-tree
+// `supertonic_vocoder_convnext_weights`):
+//   - `residual`    : F32, ne=[T0, C].  Block input + residual
+//                     summed at the end.
+//   - `dw_out`      : F32, ne=[T0, C].  Output of the upstream
+//                     depthwise conv (kept outside this helper so
+//                     the depthwise op stays in supertonic_vocoder.cpp).
+//   - `ln_g`, `ln_b`: F32, ne=[C].  Layer-norm gamma + beta.
+//   - `pw1_w`       : F32, ne=[K=1, IC=C, OC=hidden].
+//   - `pw1_b`       : F32, ne=[hidden].  Nullable.
+//   - `pw2_w`       : F32, ne=[K=1, IC=hidden, OC=C].
+//   - `pw2_b`       : F32, ne=[C].  Nullable.
+//   - `block_gamma` : F32, ne=[C].  Per-channel scaling.
+//   - returns       : F32, ne=[T0, C].  Block output.
+//
+// Op-set used: `ggml_permute`, `ggml_cont`, `ggml_norm`,
+// `ggml_reshape_2d`, `ggml_repeat`, `ggml_mul`, `ggml_add`,
+// `ggml_mul_mat`, `ggml_gelu_erf`.  All universally supported
+// (incl. baseline upstream OpenCL — no new ops introduced beyond
+// the existing convnext block's surface).
+//
+// Parity-tested in `test_supertonic_convnext_block_fused.cpp`
+// against a scalar reference of the per-block math on three
+// shapes (tiny K=3/dilation=1, K=7/dilation=2, scale-up
+// K=7/dilation=4).  Tolerance 1e-4 absolute on tiny shapes,
+// 5e-4 on the scale-up (mul_mat sum-order parity).
+inline ggml_tensor * convnext_block_fused_ggml(
+        ggml_context * ctx,
+        ggml_tensor *  residual,
+        ggml_tensor *  dw_out,
+        ggml_tensor *  ln_g,
+        ggml_tensor *  ln_b,
+        ggml_tensor *  pw1_w,
+        ggml_tensor *  pw1_b,
+        ggml_tensor *  pw2_w,
+        ggml_tensor *  pw2_b,
+        ggml_tensor *  block_gamma,
+        float          eps = 1e-6f) {
+    const int64_t C      = dw_out->ne[1];
+    const int64_t hidden = pw1_w->ne[2];
+
+    // Layer-norm — permute → cont → norm → γ·x + β.  Result stays
+    // in `[C, T0]` (channel-major) so the next two pointwise convs
+    // can consume it directly as a mul_mat right-hand side without
+    // any im2col / re-permute overhead.
+    ggml_tensor * y = ggml_cont(ctx, ggml_permute(ctx, dw_out, 1, 0, 2, 3));
+    y = ggml_norm(ctx, y, eps);
+    {
+        // `repeat_like(v[C], y[C, T0]) → reshape(v, C, 1) + repeat`.
+        // Reproduced inline so the helper stays header-only and
+        // doesn't reach into the vocoder's anonymous-namespace
+        // `repeat_like` wrapper.
+        ggml_tensor * ln_g_2d = ggml_reshape_2d(ctx, ln_g, C, 1);
+        ggml_tensor * ln_b_2d = ggml_reshape_2d(ctx, ln_b, C, 1);
+        y = ggml_mul(ctx, y, ggml_repeat(ctx, ln_g_2d, y));
+        y = ggml_add(ctx, y, ggml_repeat(ctx, ln_b_2d, y));
+    }
+
+    // pw1 — K=1 pointwise conv via `ggml_mul_mat`.
+    //
+    // pw1_w has ne=[1, IC=C, OC=hidden]; reshape to [IC, OC].
+    // mul_mat(A=[K=IC, n=OC], B=[K=IC, m=T0]) → ne=[OC=hidden, T0]
+    // with C[oc, t] = Σ_ic w_2d[ic, oc] * y[ic, t] — identical
+    // arithmetic to the existing `conv1d_causal_ggml` path's
+    // `mul_mat(im2col_reshape, w_reshape)` for `K=1`.
+    ggml_tensor * pw1_w_2d = ggml_reshape_2d(
+        ctx, pw1_w, pw1_w->ne[0] * pw1_w->ne[1], pw1_w->ne[2]);
+    ggml_tensor * pw1_out = ggml_mul_mat(ctx, pw1_w_2d, y);
+    if (pw1_b) {
+        ggml_tensor * pw1_b_2d = ggml_reshape_2d(ctx, pw1_b, hidden, 1);
+        pw1_out = ggml_add(ctx, pw1_out, ggml_repeat(ctx, pw1_b_2d, pw1_out));
+    }
+
+    // GELU is element-wise; the `[hidden, T0]` layout flows through
+    // verbatim.
+    ggml_tensor * gelu_out = ggml_gelu_erf(ctx, pw1_out);
+
+    // pw2 — symmetric to pw1.  Output is `[C, T0]`.
+    ggml_tensor * pw2_w_2d = ggml_reshape_2d(
+        ctx, pw2_w, pw2_w->ne[0] * pw2_w->ne[1], pw2_w->ne[2]);
+    ggml_tensor * pw2_out = ggml_mul_mat(ctx, pw2_w_2d, gelu_out);
+    if (pw2_b) {
+        ggml_tensor * pw2_b_2d = ggml_reshape_2d(ctx, pw2_b, C, 1);
+        pw2_out = ggml_add(ctx, pw2_out, ggml_repeat(ctx, pw2_b_2d, pw2_out));
+    }
+
+    // Block-level γ scaling applied per-channel (broadcast over T0)
+    // BEFORE the back-permute — gamma is a per-channel constant so
+    // the multiplication commutes with the layout flip and we save
+    // one ggml_repeat over [T0, C] vs. doing it after.
+    {
+        ggml_tensor * g_2d = ggml_reshape_2d(ctx, block_gamma, C, 1);
+        pw2_out = ggml_mul(ctx, pw2_out, ggml_repeat(ctx, g_2d, pw2_out));
+    }
+
+    // Back to `[T0, C]` for the residual add and the next block.
+    // This is the second (and last) ggml_cont in the helper — the
+    // back-half of the F7 cost / savings pair.
+    ggml_tensor * pw2_back = ggml_cont(
+        ctx, ggml_permute(ctx, pw2_out, 1, 0, 2, 3));
+    return ggml_add(ctx, residual, pw2_back);
+}
+
+// ---------------------------------------------------------------------
+// Audit finding F12 / Phase 2L — in-graph time/channel transpose
+// to kill the per-call `pack_time_channel_for_ggml` CPU loops.
+//
+// Background
+// ----------
+// The vector / text / duration estimator graph caches today hold
+// their primary activation input as `ne=[L, C]` (axis 0 = L = time
+// in GGML semantic).  GGML stores that as channel-major memory
+// (`buf[c*L + t]`), but every caller hands the data in CPU-native
+// time-major form (`x[t*C + c]`).  Callers paper over the
+// mismatch by running `pack_time_channel_for_ggml(x_tc, L, C)` on
+// the host — an `O(L * C)` loop with strided stores — and then
+// uploading the packed buffer.  Audit F12: this is dozens of
+// small CPU transposes per synth that also serialise the GPU
+// dispatch.
+//
+// The fix (audit's recommended Option 2): keep the cache's upload
+// tensor in `ne=[C, L]` (axis 0 = C = channels), so the caller
+// can `ggml_backend_tensor_set` the CPU-native buffer byte-for-
+// byte without any host pack, and have the graph itself emit
+// `ggml_cont(ctx, ggml_transpose(ctx, x_tc_in))` to recover the
+// `[L, C]` view downstream ops already consume.
+//
+// Why bit-exact
+// -------------
+// `ggml_transpose` is a strides-only view (zero arithmetic);
+// `ggml_cont` is a memory rearrangement that materialises the
+// natural-stride layout of `ne=[L, C]` — element (l, c) lands at
+// byte `(l + c*L) * sizeof(float)`.  The host pack
+// `pack_time_channel_for_ggml` writes `out[c*L + t] = x[t*C + c]`,
+// i.e. the SAME byte at offset `(c*L + t) * sizeof(float)` carries
+// the SAME float value.  See
+// `test/test_supertonic_in_graph_transpose.cpp` for the bit-exact
+// parity assertion.
+//
+// Shape contract:
+//   - `x_tc_in` : F32, ne=[C, L].  Uploaded raw from CPU-native
+//                 `x[t*C + c]` buffer (no pack).
+//   - returns   : F32, ne=[L, C], naturally strided
+//                 (`nb=[4, L*4]`).
+//
+// Op-set used: `ggml_transpose` + `ggml_cont`.  Both universally
+// supported (incl. baseline upstream OpenCL).  No new ops.
+inline ggml_tensor * transpose_time_channel_ggml(ggml_context * ctx,
+                                                 ggml_tensor *  x_tc_in) {
+    // `ggml_transpose` swaps axes 0 and 1 by reordering strides
+    // (zero cost — same memory, new view).  `ggml_cont` then
+    // materialises the natural-stride [L, C] layout that
+    // downstream graph builders treat as the canonical
+    // time-major input.  Byte-for-byte identical to
+    // `pack_time_channel_for_ggml` writes.
+    return ggml_cont(ctx, ggml_transpose(ctx, x_tc_in));
+}
+
+// Inline definition of the forward-declared portable leaky-relu helper
+// above.  Must come after `supertonic_use_cpu_custom_ops()` and
+// `supertonic_use_native_leaky_relu()` are declared so the dispatcher
+// resolves at every call site.
+//
+// Two-stage dispatch:
+//  1. CPU custom-op fast path — keeps the fused `ggml_leaky_relu`
+//     builtin (one op + one `to_t` worker pass) on the CPU backend.
+//  2. Backend-aware fast path — if the resolved GPU backend reports
+//     it implements `GGML_OP_LEAKY_RELU` natively (Vulkan / Metal /
+//     CUDA, plus chatterbox-patched OpenCL), emit the same single
+//     fused builtin.  This collapses to one shader dispatch per
+//     vocoder leaky-relu site instead of three (relu + scale + add)
+//     and keeps the GPU command buffer ~33 % shorter on the vocoder
+//     post-conv chain.
+//  3. Otherwise, decompose into `(1-α)·relu(x) + α·x` — three
+//     universally-supported ops.  The historical OpenCL bring-up
+//     path (no chatterbox patch) lands here; correctness is bit-
+//     identical to a fused builtin for the F32 path Supertonic uses.
+//
+// The `use_native_leaky_relu` query is set at backend init time by
+// `ggml_backend_supports_op` against a synthetic LEAKY_RELU node, so
+// the helper gets the right answer for every backend without a
+// per-backend table.  See `supertonic_internal.h::supertonic_model::
+// use_native_leaky_relu` for the rationale.
+inline ggml_tensor * leaky_relu_portable_ggml(ggml_context * ctx, ggml_tensor * x, float alpha) {
+    if (supertonic_use_cpu_custom_ops() || supertonic_use_native_leaky_relu()) {
+        return ggml_leaky_relu(ctx, x, alpha, /*inplace=*/false);
+    }
+    // Conservative GPU fallback (op not advertised by the backend):
+    // (1 - α)·relu(x) + α·x.  Three universally-supported ops.
+    ggml_tensor * pos    = ggml_scale(ctx, ggml_relu(ctx, x), 1.0f - alpha);
+    ggml_tensor * scaled = ggml_scale(ctx, x, alpha);
+    return ggml_add(ctx, pos, scaled);
+}
+
 } // namespace tts_cpp::supertonic::detail
diff --git a/tts-cpp/src/supertonic_preprocess.cpp b/tts-cpp/src/supertonic_preprocess.cpp
index 60ffdbacc73..dfd42f0f10c 100644
--- a/tts-cpp/src/supertonic_preprocess.cpp
+++ b/tts-cpp/src/supertonic_preprocess.cpp
@@ -171,7 +171,8 @@ bool is_supported_language(const std::string & language) {
 
 std::string supertonic_preprocess_text(const std::string & text,
                                        const std::string & language,
-                                       const std::string & language_wrap_mode) {
+                                       const std::string & language_wrap_mode,
+                                       bool is_continuation) {
     if (!is_supported_language(language)) {
         throw std::runtime_error("invalid Supertonic language: " + language);
     }
@@ -211,7 +212,13 @@ std::string supertonic_preprocess_text(const std::string & text,
     while (s.find("``") != std::string::npos) replace_all(s, "``", "`");
 
     s = collapse_spaces(s);
-    if (!has_terminal_punct(s)) s += ".";
+    // Skip the auto-period for continuation chunks (streaming).  The
+    // model was trained on sentence-terminated input; on chunked mid-
+    // utterance text a fake period makes it speak the stub as a
+    // complete sentence with falling intonation + trailing artifacts.
+    // Continuation chunks pass through with their natural ending (word,
+    // comma, etc.) so the model isn't lied to about sentence end.
+    if (!is_continuation && !has_terminal_punct(s)) s += ".";
     if (language_wrap_mode == "none") return s;
     if (language_wrap_mode == "prefix") return "<" + language + ">" + s + " ";
     if (language_wrap_mode == "open_close") return "<" + language + ">" + s + "</" + language + ">";
@@ -223,9 +230,11 @@ bool supertonic_text_to_ids(const supertonic_model & model,
                             const std::string & language,
                             std::vector<int32_t> & ids,
                             std::string * normalized_text,
-                            std::string * error) {
+                            std::string * error,
+                            bool is_continuation) {
     try {
-        std::string normalized = supertonic_preprocess_text(text, language, model.hparams.language_wrap_mode);
+        std::string normalized = supertonic_preprocess_text(
+            text, language, model.hparams.language_wrap_mode, is_continuation);
         std::vector<uint32_t> cps = utf8_to_cps(normalized);
         ids.clear();
         ids.reserve(cps.size());
diff --git a/tts-cpp/src/supertonic_text_encoder.cpp b/tts-cpp/src/supertonic_text_encoder.cpp
index c03839b8055..1fd2d160497 100644
--- a/tts-cpp/src/supertonic_text_encoder.cpp
+++ b/tts-cpp/src/supertonic_text_encoder.cpp
@@ -53,7 +53,9 @@ void profile_text_begin() {
 }
 
 void profile_text_compute(const supertonic_model & model, ggml_cgraph * graph, const char * island) {
-    if (!text_profile_enabled()) {
+    const bool stderr_on = text_profile_enabled();
+    const bool csv_on    = supertonic_profile_csv_enabled();
+    if (!stderr_on && !csv_on) {
         supertonic_graph_compute(model, graph);
         return;
     }
@@ -64,8 +66,17 @@ void profile_text_compute(const supertonic_model & model, ggml_cgraph * graph, c
     const auto t1 = std::chrono::steady_clock::now();
     const double compute_ms = std::chrono::duration<double, std::milli>(t1 - t0).count();
     state.last = t1;
-    std::fprintf(stderr, "supertonic_text_profile island=%s pre_ms=%.3f compute_ms=%.3f\n",
-                 island, pre_ms, compute_ms);
+    if (stderr_on) {
+        std::fprintf(stderr, "supertonic_text_profile island=%s pre_ms=%.3f compute_ms=%.3f\n",
+                     island, pre_ms, compute_ms);
+    }
+    // Phase 2D: text encoder doesn't have a denoise step concept;
+    // pass -1 sentinel.  Use the negative step value to filter
+    // text-stage rows out of vector-stage analyses in the
+    // analysis script.
+    if (csv_on) {
+        supertonic_profile_csv_record("text", island, /*step=*/-1, compute_ms);
+    }
 }
 
 void profile_text_checkpoint(const char * island) {
@@ -105,7 +116,14 @@ ggml_tensor * repeat_like(ggml_context * ctx, ggml_tensor * v, ggml_tensor * lik
         else if (like->ne[1] == v->ne[0]) v = ggml_reshape_2d(ctx, v, 1, v->ne[0]);
     }
     if (!ggml_can_repeat(v, like)) throw std::runtime_error("cannot repeat tensor in text encoder graph");
-    return ggml_repeat(ctx, v, like);
+    // Every caller feeds this into ggml_add/ggml_mul which broadcast natively;
+    // skip the explicit ggml_repeat dispatch.
+    static const bool force_explicit_repeat =
+        std::getenv("SUPERTONIC_FORCE_EXPLICIT_REPEAT") != nullptr;
+    if (force_explicit_repeat) {
+        return ggml_repeat(ctx, v, like);
+    }
+    return v;
 }
 
 ggml_tensor * conv1d_f32(ggml_context * ctx,
@@ -114,6 +132,8 @@ ggml_tensor * conv1d_f32(ggml_context * ctx,
                          int stride,
                          int padding,
                          int dilation) {
+    // text_encoder uses the pure-graph path unconditionally; no CPU fast path
+    // here so no use_cpu_fastpath plumbing.
     ggml_tensor * im2col = ggml_im2col(ctx, kernel, input, stride, 0, padding, 0, dilation, 0, false, GGML_TYPE_F32);
     ggml_tensor * result = ggml_mul_mat(ctx,
         ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[2] * im2col->ne[1]),
@@ -122,6 +142,15 @@ ggml_tensor * conv1d_f32(ggml_context * ctx,
 }
 
 ggml_tensor * edge_clamp_pad_1d(ggml_context * ctx, ggml_tensor * x, int pad_left, int pad_right) {
+    if (pad_left == 0 && pad_right == 0) return x;
+    static const bool disable_fused_edge_pad =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_EDGE_PAD") != nullptr;
+    if (!disable_fused_edge_pad &&
+        x->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        ggml_is_contiguous(x)) {
+        return ggml_supertonic_edge_pad_1d(ctx, x, pad_left, pad_right);
+    }
     const int64_t L = x->ne[0], C = x->ne[1];
     ggml_tensor * out = x;
     if (pad_left > 0) {
@@ -140,6 +169,16 @@ ggml_tensor * depthwise_same_ggml(ggml_context * ctx,
                                   ggml_tensor * w,
                                   ggml_tensor * b) {
     const int K = (int)w->ne[0];
+    static const bool disable_fused =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_DEPTHWISE") != nullptr;
+    if (!disable_fused && (K == 3 || K == 5) &&
+        x->type == GGML_TYPE_F32 && w->type == GGML_TYPE_F32 &&
+        b->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 && w->ne[1] == 1 && w->ne[3] == 1 &&
+        w->ne[2] == x->ne[1] && b->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(x) && ggml_is_contiguous(w) && ggml_is_contiguous(b)) {
+        return ggml_supertonic_depthwise_1d(ctx, x, w, b, 1);
+    }
     const int pad_left = (K - 1) / 2;
     const int pad_right = (K - 1) - pad_left;
     ggml_tensor * padded = edge_clamp_pad_1d(ctx, x, pad_left, pad_right);
@@ -151,6 +190,15 @@ ggml_tensor * depthwise_same_ggml(ggml_context * ctx,
 }
 
 ggml_tensor * layer_norm_ggml(ggml_context * ctx, ggml_tensor * x, ggml_tensor * g, ggml_tensor * b) {
+    static const bool disable_fused_layer_norm =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_LAYER_NORM") != nullptr;
+    if (!disable_fused_layer_norm &&
+        x->type == GGML_TYPE_F32 && g->type == GGML_TYPE_F32 && b->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        g->ne[0] == x->ne[1] && b->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(x) && ggml_is_contiguous(g) && ggml_is_contiguous(b)) {
+        return ggml_supertonic_layer_norm_channel(ctx, x, g, b, 1e-6f);
+    }
     ggml_tensor * xt = ggml_cont(ctx, ggml_permute(ctx, x, 1, 0, 2, 3));
     xt = ggml_norm(ctx, xt, 1e-6f);
     xt = ggml_mul(ctx, xt, repeat_like(ctx, g, xt));
@@ -428,7 +476,15 @@ void build_relpos_cache(text_relpos_graph_cache & cache,
     for (int i = 0; i < N_MASKS; ++i) {
         cache.masks[i] = ggml_new_tensor_3d(cache.ctx, GGML_TYPE_F32, L, L, 1);
         const std::string name = "relpos_mask_" + std::to_string(i);
-        ggml_set_name(cache.masks[i], name.c_str()); ggml_set_input(cache.masks[i]);
+        ggml_set_name(cache.masks[i], name.c_str());
+        ggml_set_input(cache.masks[i]);
+        // gallocr frees leaf inputs once their last consumer in the graph
+        // runs, which makes the buffer available for intermediate reuse on
+        // subsequent compute passes — by the next run the mask data is
+        // overwritten.  Mark as OUTPUT too so gallocr keeps the buffer
+        // alive across compute passes; the data is then uploaded once in
+        // build_relpos_cache and stable for the cache's lifetime.
+        ggml_set_output(cache.masks[i]);
     }
 
     ggml_tensor * q = conv1d_k1_channel_time_ggml(cache.ctx,
@@ -672,6 +728,9 @@ void speech_prompted_attention(const supertonic_model & m, int idx,
     dense_time_matmul(merged, L, C, out_w, out_b, C, out_lc);
 }
 
+// `speech_attention_cache` + `build_speech_attention_cache` own the
+// second-of-two graph caches `speech_prompted_attention_ggml` runs
+// (flash-attn + out-proj after host-side q/k/v_pack work).
 struct speech_attention_cache {
     const supertonic_model * model = nullptr;
     uint64_t generation_id = 0;
@@ -689,19 +748,19 @@ struct speech_attention_cache {
     ggml_tensor * v = nullptr;
 };
 
-void free_speech_attention_cache(speech_attention_cache & cache) {
+inline void free_speech_attention_cache(speech_attention_cache & cache) {
     supertonic_safe_gallocr_free(cache.allocr, cache.generation_id);
     if (cache.ctx) ggml_free(cache.ctx);
     cache = {};
 }
 
-void build_speech_attention_cache(speech_attention_cache & cache,
-                                  const supertonic_model & m,
-                                  int idx,
-                                  int L,
-                                  int Lctx,
-                                  const std::string & out_w_source,
-                                  const std::string & out_b_source) {
+inline void build_speech_attention_cache(speech_attention_cache & cache,
+                                         const supertonic_model & m,
+                                         int idx,
+                                         int L,
+                                         int Lctx,
+                                         const std::string & out_w_source,
+                                         const std::string & out_b_source) {
     free_speech_attention_cache(cache);
     cache.model = &m;
     cache.generation_id = m.generation_id;
@@ -737,6 +796,226 @@ void build_speech_attention_cache(speech_attention_cache & cache,
     ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
 }
 
+} // namespace (close anonymous; below symbols are detail-namespace
+  // scope so the round-12 #6 test can link against them)
+
+// Phase A4 / round-12 #6: speech_prompted_attention as ONE merged
+// ggml graph.  Master's Metal-port branch built the cache + builder
+// but never wired the run path; round 12 adds
+// `run_speech_prompted_merged_cache` and the dispatch in
+// `speech_prompted_attention_ggml` below.
+//
+// Pre-A4 this function built two separate graphs (QKV proj, then
+// flash-attn+out-proj) with host-side q_pack/v_pack/k_pack head-split
+// work between them.  The merged version does the head-split in-graph
+// via reshape + permute + cont (or relies on ggml's view semantics
+// where it's free), feeds straight into flash_attn, and runs the out
+// projection — all in one `ggml_backend_graph_compute` call.
+//
+// Per call savings (vs. legacy two-cache path):
+//   - 2 GPU→host downloads (q_out, v_out) → 0
+//   - 3 host→GPU uploads (q_pack, k_pack, v_pack) → 0
+//   - 1 fewer graph dispatch (one fewer command buffer)
+//   - host-side pack work eliminated entirely.
+// = 5 sync points saved per call × 2 layers = 10 sync points / synth.
+//
+// Struct + free + build are at detail-namespace scope (not
+// anonymous) so the round-12 CPU-only unit test can SFINAE-pin
+// the field contract.  Forward-declared in supertonic_internal.h.
+
+void free_speech_prompted_merged_cache(speech_prompted_merged_cache & cache) {
+    supertonic_safe_gallocr_free(cache.allocr, cache.generation_id);
+    if (cache.ctx) ggml_free(cache.ctx);
+    cache = {};
+}
+
+void build_speech_prompted_merged_cache(speech_prompted_merged_cache & cache,
+                                        const supertonic_model & m,
+                                        int idx,
+                                        int L,
+                                        int Lctx,
+                                        const std::string & q_w_source,
+                                        const std::string & v_w_source,
+                                        const std::string & out_w_source,
+                                        const std::string & out_b_source,
+                                        const std::string & tanh_k_source,
+                                        const std::string & q_b_source,
+                                        const std::string & v_b_source) {
+    const int C = 256;
+    const int half = 128;
+    const int H = 2;
+    (void)H;
+    free_speech_prompted_merged_cache(cache);
+    cache.model = &m;
+    cache.generation_id = m.generation_id;
+    cache.idx = idx;
+    cache.L = L;
+    cache.Lctx = Lctx;
+    cache.out_w_source = out_w_source;
+    cache.out_b_source = out_b_source;
+
+    constexpr int NODES = 512;
+    const size_t buf_size = ggml_tensor_overhead() * NODES + ggml_graph_overhead_custom(NODES, false);
+    cache.buf.assign(buf_size, 0);
+    ggml_init_params gp = { buf_size, cache.buf.data(), true };
+    cache.ctx = ggml_init(gp);
+    cache.gf = ggml_new_graph_custom(cache.ctx, NODES, false);
+
+    cache.x_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, C);
+    ggml_set_name(cache.x_in, "spm_x_in"); ggml_set_input(cache.x_in);
+    cache.style_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, Lctx, C);
+    ggml_set_name(cache.style_in, "spm_style_in"); ggml_set_input(cache.style_in);
+
+    // Q proj.  Output ne=[L, C].  Head-split: reshape to [L, half, H]
+    // then permute(1, 0, 2, 3) → cont gives [half, L, H] — the layout
+    // flash_attn views as [head_dim, q_len, n_heads].
+    ggml_tensor * q_tc = dense_matmul_time_ggml(cache.ctx, cache.x_in,
+        require_source_tensor(m, q_w_source),
+        require_source_tensor(m, q_b_source));
+    ggml_tensor * q_3d = ggml_reshape_3d(cache.ctx, q_tc, L, half, 2);
+    ggml_tensor * q_dlh = ggml_cont(cache.ctx, ggml_permute(cache.ctx, q_3d, 1, 0, 2, 3));
+
+    // V proj on style.  Same head-split into [half, Lctx, H].
+    ggml_tensor * v_tc = dense_matmul_time_ggml(cache.ctx, cache.style_in,
+        require_source_tensor(m, v_w_source),
+        require_source_tensor(m, v_b_source));
+    ggml_tensor * v_3d = ggml_reshape_3d(cache.ctx, v_tc, Lctx, half, 2);
+    ggml_tensor * v_dlh = ggml_cont(cache.ctx, ggml_permute(cache.ctx, v_3d, 1, 0, 2, 3));
+
+    // K is the precomputed tanh_k model tensor.  Stored as ne=[Lctx, C].
+    // Same head-split: reshape to [Lctx, half, H] then permute to
+    // [half, Lctx, H] and cont.  No per-call host work needed since
+    // K is constant per model.
+    ggml_tensor * k_orig = require_source_tensor(m, tanh_k_source);
+    ggml_tensor * k_3d = ggml_reshape_3d(cache.ctx, k_orig, Lctx, half, 2);
+    ggml_tensor * k_dlh = ggml_cont(cache.ctx, ggml_permute(cache.ctx, k_3d, 1, 0, 2, 3));
+
+    // Flash attention.  Same call shape as the pre-A4 path.
+    ggml_tensor * attn = ggml_flash_attn_ext(cache.ctx, q_dlh, k_dlh, v_dlh,
+                                              nullptr, 1.0f / 16.0f, 0.0f, 0.0f);
+    attn = ggml_reshape_2d(cache.ctx, attn, C, L);
+    ggml_tensor * ctx_tc = ggml_cont(cache.ctx, ggml_transpose(cache.ctx, attn));
+
+    // Output projection.
+    cache.out = dense_matmul_time_ggml(cache.ctx, ctx_tc,
+        require_source_tensor(m, out_w_source),
+        require_source_tensor(m, out_b_source));
+    ggml_set_name(cache.out, "spm_out"); ggml_set_output(cache.out);
+    ggml_build_forward_expand(cache.gf, cache.out);
+
+    cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
+    if (!cache.allocr) throw std::runtime_error("ggml_gallocr_new speech_prompted_merged failed");
+    if (!ggml_gallocr_reserve(cache.allocr, cache.gf)) {
+        throw std::runtime_error("ggml_gallocr_reserve speech_prompted_merged failed");
+    }
+    ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
+}
+
+// QVAC-18605 round 12 #6 — run path for the merged graph.
+//
+// Drop-in replacement for the legacy two-cache code path inside
+// `speech_prompted_attention_ggml`.  Caller is responsible for
+// keying the cache against `(model, idx, L, Lctx)` and rebuilding
+// on miss; this function assumes the cache is built + bound to
+// the backend in `m.backend`.
+//
+// Upload contract (matches `pack_time_channel_for_ggml`'s output):
+//   - `x_lc` is time-major-flat `x_lc[t*C + c]`.  We pack it once
+//     into channel-major-flat memory (`out[c*L + t]`) before
+//     uploading to `cache.x_in` (ne=[L, C], natural strides).
+//   - `style_ttl` is also time-major-flat; same packing.
+//
+// Download contract:
+//   - `cache.out` is ne=[L, C] channel-major-flat memory (matches
+//     master's `dense_matmul_time_ggml` output convention).
+//     `tensor_to_time_channel` flattens to time-major-flat
+//     `out_lc[t*C + c]` — same layout the caller in
+//     `supertonic_text_encoder_forward_ggml` expects from the
+//     pre-round-12 path.
+//
+// Compute cost (vs. legacy two-cache):
+//   + 1 cache-rebuild check (free) - already amortised once / synth.
+//   + 1 host pack of x_lc → x_raw (free; same memcpy size as legacy
+//     speech_prompted_attention_ggml does for its own QKV cache
+//     upload at line 1003).
+//   + 1 host pack of style_tc → style_raw (free; same as legacy).
+//   + 1 host→GPU upload each for x_in / style_in (same as legacy).
+//   + 1 graph dispatch.
+//   + 1 GPU→host download of cache.out.
+//   - 2 fewer host→GPU uploads (no q_pack / v_pack / k_pack since
+//     they're computed in-graph).
+//   - 2 fewer GPU→host downloads (no q_out / v_out).
+//   - 1 fewer graph dispatch (one merged graph instead of two
+//     separate qkv + flash-attn graphs).
+//   - All host pack work for q_pack / k_pack / v_pack eliminated
+//     (which scaled with L × head_dim × n_heads — the worst
+//     offender on long prompts).
+void run_speech_prompted_merged_cache(speech_prompted_merged_cache & cache,
+                                       const supertonic_model & m,
+                                       const std::vector<float> & x_lc,
+                                       int L,
+                                       const float * style_ttl,
+                                       std::vector<float> & out_lc) {
+    (void) m; // referenced via cache.model invariant; kept in the
+              // signature to match the legacy
+              // `speech_prompted_attention_ggml(...)` shape.
+    const int C = 256;
+    const int Lctx = 50;
+    if (cache.ctx == nullptr || cache.gf == nullptr ||
+        cache.x_in == nullptr || cache.style_in == nullptr ||
+        cache.out == nullptr) {
+        throw std::runtime_error(
+            "run_speech_prompted_merged_cache: cache not built");
+    }
+    if (cache.L != L || cache.Lctx != Lctx) {
+        throw std::runtime_error(
+            "run_speech_prompted_merged_cache: cache key mismatch "
+            "(L/Lctx don't match the built graph)");
+    }
+    std::vector<float> x_raw = pack_time_channel_for_ggml(x_lc, L, C);
+    std::vector<float> style_tc((size_t) Lctx * C);
+    for (int t = 0; t < Lctx; ++t) {
+        for (int c = 0; c < C; ++c) {
+            style_tc[(size_t) t * C + c] = style_ttl[(size_t) t * C + c];
+        }
+    }
+    std::vector<float> style_raw = pack_time_channel_for_ggml(style_tc, Lctx, C);
+    ggml_backend_tensor_set(cache.x_in,     x_raw.data(),     0, x_raw.size()     * sizeof(float));
+    ggml_backend_tensor_set(cache.style_in, style_raw.data(), 0, style_raw.size() * sizeof(float));
+    std::string island = "speech" + std::to_string(cache.idx) + "_merged";
+    profile_text_compute(*cache.model, cache.gf, island.c_str());
+    out_lc = tensor_to_time_channel(cache.out);
+}
+
+namespace { // re-open anonymous namespace for the rest of the TU
+
+// F14 — cached speech-prompted attention QKV graph.
+//
+// Pre-audit, `speech_prompted_attention_ggml` allocated a fresh
+// `ggml_context` + `ggml_gallocr_t` every call.  The graph shape
+// depends only on `(L, idx)`; for the typical synth flow
+// (one text encoder call → 2 layers) that's 2 cold misses on the
+// first synth, then steady-state zero rebuilds.  Same pattern as
+// the F8 / F11 caches.
+struct speech_qkv_graph_cache {
+    const supertonic_model * model = nullptr;
+    uint64_t generation_id = 0;
+    int idx = -1;
+    int L = 0;
+    std::vector<uint8_t> buf;
+    ggml_context * ctx = nullptr;
+    ggml_cgraph * gf = nullptr;
+    ggml_gallocr_t allocr = nullptr;
+    ggml_tensor * x_in = nullptr;
+    ggml_tensor * style_in = nullptr;
+};
+
+inline void free_speech_qkv_cache(speech_qkv_graph_cache & cache) {
+    supertonic_safe_gallocr_free(cache.allocr, cache.generation_id);
+    if (cache.ctx) ggml_free(cache.ctx);
+    cache = {};
+}
+
 void speech_prompted_attention_ggml(const supertonic_model & m, int idx,
                                     const std::vector<float> & x_lc, int L,
                                     const float * style_ttl,
@@ -744,56 +1023,123 @@ void speech_prompted_attention_ggml(const supertonic_model & m, int idx,
     const int C = 256;
     const int half = 128;
     const int Lctx = 50;
+    if (idx < 0 || idx >= 2) throw std::runtime_error("invalid speech attention idx");
     const int attn_num = idx + 1;
     const std::string p = "text_encoder:tts.ttl.speech_prompted_text_encoder.attention" + std::to_string(attn_num);
     const std::string q_w = "text_encoder:" + std::string(idx == 0 ? "onnx::MatMul_3678" : "onnx::MatMul_3682");
     const std::string v_w = "text_encoder:" + std::string(idx == 0 ? "onnx::MatMul_3680" : "onnx::MatMul_3684");
     const std::string o_w = "text_encoder:" + std::string(idx == 0 ? "onnx::MatMul_3681" : "onnx::MatMul_3685");
-
-    constexpr int MAX_NODES = 256;
-    static size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead_custom(MAX_NODES, false);
-    thread_local std::vector<uint8_t> buf(buf_size);
-    ggml_init_params gp = { buf_size, buf.data(), true };
-    ggml_context * ctx = ggml_init(gp);
-    ggml_cgraph * gf = ggml_new_graph_custom(ctx, MAX_NODES, false);
-
-    ggml_tensor * x_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, C);
-    ggml_set_name(x_in, "speech_attn_x"); ggml_set_input(x_in);
-    ggml_tensor * style_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, Lctx, C);
-    ggml_set_name(style_in, "speech_attn_style"); ggml_set_input(style_in);
-    ggml_tensor * q = dense_matmul_time_ggml(ctx, x_in,
-        require_source_tensor(m, q_w),
-        require_source_tensor(m, p + ".W_query.linear.bias"));
-    ggml_set_name(q, "speech_attn_q"); ggml_set_output(q); ggml_build_forward_expand(gf, q);
-    ggml_tensor * v = dense_matmul_time_ggml(ctx, style_in,
-        require_source_tensor(m, v_w),
-        require_source_tensor(m, p + ".W_value.linear.bias"));
-    ggml_set_name(v, "speech_attn_v"); ggml_set_output(v); ggml_build_forward_expand(gf, v);
-
-    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
-    if (!allocr) {
-        ggml_free(ctx);
-        throw std::runtime_error("ggml_gallocr_new speech text attention failed");
+    const std::string tanh_k_src = "text_encoder:/speech_prompted_text_encoder/attention" + std::to_string(attn_num) + "/tanh/Tanh_output_0";
+
+    // QVAC-18605 round 12 #6 — merged-cache fast path on non-CPU
+    // backends.  Eliminates 5 sync points (2 GPU→host downloads +
+    // 3 host→GPU uploads) and all host-side Q/V/K head-split pack
+    // work per call.  Two layers per synth = 10 sync points / synth
+    // saved at the text encoder.
+    //
+    // CPU stays on the legacy two-cache path: master's
+    // `dense_matmul_time_ggml` CPU fast path uses cblas via the
+    // custom-op dispatch, and the host-side head-split is a free
+    // memcpy.  Switching CPU to the merged path would pull the
+    // matmul through the ggml conv1d fallback (slower on x86) and
+    // gain nothing — sync points don't exist on CPU.
+    if (!model_prefers_cpu_kernels(m)) {
+        thread_local speech_prompted_merged_cache merged_caches[2];
+        speech_prompted_merged_cache & merged = merged_caches[idx];
+        if (merged.model != &m || merged.generation_id != m.generation_id ||
+            merged.idx != idx || merged.L != L || merged.Lctx != Lctx ||
+            merged.out_w_source != o_w) {
+            build_speech_prompted_merged_cache(merged, m, idx, L, Lctx,
+                                                /*q_w_source=*/q_w,
+                                                /*v_w_source=*/v_w,
+                                                /*out_w_source=*/o_w,
+                                                /*out_b_source=*/p + ".out_fc.linear.bias",
+                                                /*tanh_k_source=*/tanh_k_src,
+                                                /*q_b_source=*/p + ".W_query.linear.bias",
+                                                /*v_b_source=*/p + ".W_value.linear.bias");
+        }
+        run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
+        return;
     }
-    if (!ggml_gallocr_reserve(allocr, gf)) {
-        ggml_gallocr_free(allocr);
-        ggml_free(ctx);
-        throw std::runtime_error("ggml_gallocr_reserve speech text attention failed");
+
+    (void) tanh_k_src; // master's path uses model.speech_tanh_k_cache; tanh_k_src kept for symbolic parity with read_f32 fallback below.
+
+    // F14: per-(model, idx, L) cached QKV graph.  Two thread-local
+    // slots so the two speech-prompted layers don't fight over a
+    // shared cache key.  The inner flash-attention graph is still
+    // cached separately in `speech_attention_cache` below.
+    thread_local speech_qkv_graph_cache qkv_caches[2];
+    // idx already range-checked at the top of the function (round-12
+    // dispatch needed it for the merged-cache thread_local array).
+    speech_qkv_graph_cache & qkv_cache = qkv_caches[idx];
+    if (qkv_cache.model != &m || qkv_cache.generation_id != m.generation_id ||
+        qkv_cache.idx != idx || qkv_cache.L != L) {
+        free_speech_qkv_cache(qkv_cache);
+        qkv_cache.model = &m;
+        qkv_cache.generation_id = m.generation_id;
+        qkv_cache.idx = idx;
+        qkv_cache.L = L;
+
+        constexpr int MAX_NODES = 256;
+        const size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
+                                ggml_graph_overhead_custom(MAX_NODES, false);
+        qkv_cache.buf.assign(buf_size, 0);
+        ggml_init_params gp = { buf_size, qkv_cache.buf.data(), true };
+        qkv_cache.ctx = ggml_init(gp);
+        qkv_cache.gf = ggml_new_graph_custom(qkv_cache.ctx, MAX_NODES, false);
+
+        qkv_cache.x_in = ggml_new_tensor_2d(qkv_cache.ctx, GGML_TYPE_F32, L, C);
+        ggml_set_name(qkv_cache.x_in, "speech_attn_x"); ggml_set_input(qkv_cache.x_in);
+        qkv_cache.style_in = ggml_new_tensor_2d(qkv_cache.ctx, GGML_TYPE_F32, Lctx, C);
+        ggml_set_name(qkv_cache.style_in, "speech_attn_style"); ggml_set_input(qkv_cache.style_in);
+        ggml_tensor * q = dense_matmul_time_ggml(qkv_cache.ctx, qkv_cache.x_in,
+            require_source_tensor(m, q_w),
+            require_source_tensor(m, p + ".W_query.linear.bias"));
+        ggml_set_name(q, "speech_attn_q"); ggml_set_output(q);
+        ggml_build_forward_expand(qkv_cache.gf, q);
+        ggml_tensor * v_t = dense_matmul_time_ggml(qkv_cache.ctx, qkv_cache.style_in,
+            require_source_tensor(m, v_w),
+            require_source_tensor(m, p + ".W_value.linear.bias"));
+        ggml_set_name(v_t, "speech_attn_v"); ggml_set_output(v_t);
+        ggml_build_forward_expand(qkv_cache.gf, v_t);
+
+        qkv_cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
+        if (!qkv_cache.allocr) {
+            ggml_free(qkv_cache.ctx);
+            qkv_cache = {};
+            throw std::runtime_error("ggml_gallocr_new speech text attention failed");
+        }
+        if (!ggml_gallocr_reserve(qkv_cache.allocr, qkv_cache.gf)) {
+            ggml_gallocr_free(qkv_cache.allocr);
+            ggml_free(qkv_cache.ctx);
+            qkv_cache = {};
+            throw std::runtime_error("ggml_gallocr_reserve speech text attention failed");
+        }
+        ggml_gallocr_alloc_graph(qkv_cache.allocr, qkv_cache.gf);
     }
-    ggml_gallocr_alloc_graph(allocr, gf);
 
     std::vector<float> x_raw = pack_time_channel_for_ggml(x_lc, L, C);
     std::vector<float> style_tc((size_t)Lctx*C);
     for (int t = 0; t < Lctx; ++t) for (int c = 0; c < C; ++c) style_tc[(size_t)t*C+c] = style_ttl[(size_t)t*C+c];
     std::vector<float> style_raw = pack_time_channel_for_ggml(style_tc, Lctx, C);
-    ggml_backend_tensor_set(x_in, x_raw.data(), 0, x_raw.size()*sizeof(float));
-    ggml_backend_tensor_set(style_in, style_raw.data(), 0, style_raw.size()*sizeof(float));
+    ggml_backend_tensor_set(qkv_cache.x_in, x_raw.data(), 0, x_raw.size()*sizeof(float));
+    ggml_backend_tensor_set(qkv_cache.style_in, style_raw.data(), 0, style_raw.size()*sizeof(float));
     std::string qkv_island = "speech" + std::to_string(idx) + "_qkv";
-    profile_text_compute(m, gf, qkv_island.c_str());
-
-    std::vector<float> q_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "speech_attn_q"));
-    std::vector<float> v_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "speech_attn_v"));
-    f32_tensor tanh_k = read_f32(m, "text_encoder:/speech_prompted_text_encoder/attention" + std::to_string(attn_num) + "/tanh/Tanh_output_0");
+    profile_text_compute(m, qkv_cache.gf, qkv_island.c_str());
+
+    std::vector<float> q_out = tensor_to_time_channel(ggml_graph_get_tensor(qkv_cache.gf, "speech_attn_q"));
+    std::vector<float> v_out = tensor_to_time_channel(ggml_graph_get_tensor(qkv_cache.gf, "speech_attn_v"));
+    // F16: pre-cached at load (`m.speech_tanh_k_cache[idx]`).  Falls
+    // back to the per-call `read_f32` only when the GGUF didn't
+    // carry the rostered name (legacy + future-compat).
+    const float * tanh_k_data = nullptr;
+    f32_tensor tanh_k_fallback;
+    if (idx >= 0 && idx < 2 && !m.speech_tanh_k_cache[idx].empty()) {
+        tanh_k_data = m.speech_tanh_k_cache[idx].data();
+    } else {
+        tanh_k_fallback = read_f32(m, "text_encoder:/speech_prompted_text_encoder/attention" + std::to_string(attn_num) + "/tanh/Tanh_output_0");
+        tanh_k_data = tanh_k_fallback.data.data();
+    }
     std::vector<float> q_pack((size_t)half*L*2), k_pack((size_t)half*Lctx*2), v_pack((size_t)half*Lctx*2);
     for (int h = 0; h < 2; ++h) {
         for (int t = 0; t < L; ++t) {
@@ -801,7 +1147,7 @@ void speech_prompted_attention_ggml(const supertonic_model & m, int idx,
         }
         for (int t = 0; t < Lctx; ++t) {
             for (int d = 0; d < half; ++d) {
-                k_pack[(size_t)d + (size_t)half*((size_t)t + (size_t)Lctx*h)] = tanh_k.data[((size_t)h*half + d)*Lctx + t];
+                k_pack[(size_t)d + (size_t)half*((size_t)t + (size_t)Lctx*h)] = tanh_k_data[((size_t)h*half + d)*Lctx + t];
                 v_pack[(size_t)d + (size_t)half*((size_t)t + (size_t)Lctx*h)] = v_out[(size_t)t*C + h*half + d];
             }
         }
@@ -810,8 +1156,9 @@ void speech_prompted_attention_ggml(const supertonic_model & m, int idx,
     speech_attention_cache & cache = caches[idx];
     if (cache.model != &m || cache.generation_id != m.generation_id ||
         cache.idx != idx || cache.L != L || cache.Lctx != Lctx ||
-        cache.out_w_source != o_w || cache.out_b_source != p + ".out_fc.linear.bias") {
-        build_speech_attention_cache(cache, m, idx, L, Lctx, o_w, p + ".out_fc.linear.bias");
+        cache.out_w_source != o_w) {
+        build_speech_attention_cache(cache, m, idx, L, Lctx, o_w,
+                                      p + ".out_fc.linear.bias");
     }
     ggml_backend_tensor_set(cache.q, q_pack.data(), 0, q_pack.size()*sizeof(float));
     ggml_backend_tensor_set(cache.k, k_pack.data(), 0, k_pack.size()*sizeof(float));
@@ -819,8 +1166,8 @@ void speech_prompted_attention_ggml(const supertonic_model & m, int idx,
     std::string flash_island = "speech" + std::to_string(idx) + "_flash";
     profile_text_compute(m, cache.gf, flash_island.c_str());
     out_lc = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, "speech_attn_out"));
-    ggml_gallocr_free(allocr);
-    ggml_free(ctx);
+    // F14: outer QKV graph lives in `qkv_cache` (above) and
+    // survives across synths.
 }
 
 } // namespace
@@ -896,63 +1243,135 @@ bool supertonic_text_encoder_forward_ggml(const supertonic_model & model,
                                           const float * style_ttl,
                                           std::vector<float> & text_emb_out,
                                           std::string * error) {
+    supertonic_op_dispatch_scope dispatch(model);
     try {
         profile_text_begin();
         const int C = 256;
         const int L = text_len;
-        f32_tensor emb = read_f32(model, "text_encoder:tts.ttl.text_encoder.text_embedder.char_embedder.weight");
-        std::vector<float> x((size_t)L*C);
+
+        // F10 — embedding lookup runs as `ggml_get_rows` on the
+        // device.  The pre-audit code downloaded the entire
+        // embedding table (~2 MB for the default vocab × C=256
+        // model) and CPU-gathered one row per token; this hook
+        // uploads `L` int32 ids instead and produces the gathered
+        // matrix directly on the backend.  `get_rows` output is
+        // time-major (ne=[C, L]), so we follow with
+        // `ggml_transpose + ggml_cont` to land in the channel-major
+        // ne=[L, C] layout the convnext blocks expect.  Bounds
+        // check still runs host-side against the (host-known) vocab
+        // size of the embedding tensor.
+        ggml_tensor * emb_table = require_source_tensor(model,
+            "text_encoder:tts.ttl.text_encoder.text_embedder.char_embedder.weight");
+        const int64_t vocab_size = emb_table->ne[1];
+        std::vector<int32_t> ids(L);
         for (int t = 0; t < L; ++t) {
-            int64_t id = text_ids[t];
-            if (id < 0 || id >= emb.ne[1]) throw std::runtime_error("text id out of range");
-            for (int c = 0; c < C; ++c) x[(size_t)t*C+c] = emb.data[(size_t)id*C+c];
+            const int64_t id = text_ids[t];
+            if (id < 0 || id >= vocab_size) {
+                throw std::runtime_error("text id out of range");
+            }
+            ids[t] = (int32_t) id;
         }
 
-        constexpr int MAX_NODES = 640;
-        static size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead_custom(MAX_NODES, false);
-        thread_local std::vector<uint8_t> buf(buf_size);
-        ggml_init_params gp = { buf_size, buf.data(), true };
-        ggml_context * ctx = ggml_init(gp);
-        ggml_cgraph * gf = ggml_new_graph_custom(ctx, MAX_NODES, false);
-        ggml_tensor * in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(in, "text_encoder_embed"); ggml_set_input(in);
-        ggml_tensor * y = in;
-        for (int i = 0; i < 6; ++i) {
-            y = text_convnext_ggml(ctx, model, "text_encoder:tts.ttl.text_encoder.convnext.convnext." + std::to_string(i), y);
-        }
-        ggml_set_name(y, "text_encoder_convnext5"); ggml_set_output(y);
-        ggml_build_forward_expand(gf, y);
-        ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
-        if (!allocr) {
-            ggml_free(ctx);
-            throw std::runtime_error("ggml_gallocr_new text encoder failed");
-        }
-        if (!ggml_gallocr_reserve(allocr, gf)) {
-            ggml_gallocr_free(allocr);
-            ggml_free(ctx);
-            throw std::runtime_error("ggml_gallocr_reserve text encoder failed");
+        // F18 — text-encoder convnext-front graph cache.  Same
+        // pattern as F8 / F11 / F14: build once per (model, L),
+        // survive across synths; the per-synth path becomes
+        // `tensor_set(ids) → compute → tensor_get(output)`.
+        struct text_convnext_front_cache {
+            const supertonic_model * model = nullptr;
+            uint64_t generation_id = 0;
+            int L = 0;
+            std::vector<uint8_t> buf;
+            ggml_context * ctx = nullptr;
+            ggml_cgraph * gf = nullptr;
+            ggml_gallocr_t allocr = nullptr;
+            ggml_tensor * ids_in = nullptr;
+        };
+        thread_local text_convnext_front_cache convnext_cache;
+        if (convnext_cache.model != &model ||
+            convnext_cache.generation_id != model.generation_id ||
+            convnext_cache.L != L) {
+            // Tear down stale state.
+            supertonic_safe_gallocr_free(convnext_cache.allocr, convnext_cache.generation_id);
+            if (convnext_cache.ctx) ggml_free(convnext_cache.ctx);
+            convnext_cache = {};
+            convnext_cache.model = &model;
+            convnext_cache.generation_id = model.generation_id;
+            convnext_cache.L = L;
+
+            constexpr int MAX_NODES = 640;
+            const size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
+                                    ggml_graph_overhead_custom(MAX_NODES, false);
+            convnext_cache.buf.assign(buf_size, 0);
+            ggml_init_params gp = { buf_size, convnext_cache.buf.data(), true };
+            convnext_cache.ctx = ggml_init(gp);
+            convnext_cache.gf  = ggml_new_graph_custom(convnext_cache.ctx, MAX_NODES, false);
+
+            // F10: i32 token-id input, gather → permute → cont →
+            // convnext stack.  Same op sequence as pre-F18; only
+            // the lifetime around it changed.
+            convnext_cache.ids_in = ggml_new_tensor_1d(convnext_cache.ctx, GGML_TYPE_I32, L);
+            ggml_set_name(convnext_cache.ids_in, "text_encoder_ids");
+            ggml_set_input(convnext_cache.ids_in);
+            ggml_tensor * gathered = ggml_get_rows(convnext_cache.ctx, emb_table, convnext_cache.ids_in);
+            ggml_tensor * in_t = ggml_cont(convnext_cache.ctx, ggml_transpose(convnext_cache.ctx, gathered));
+            ggml_set_name(in_t, "text_encoder_embed");
+            ggml_tensor * y_t = in_t;
+            for (int i = 0; i < 6; ++i) {
+                y_t = text_convnext_ggml(convnext_cache.ctx, model,
+                    "text_encoder:tts.ttl.text_encoder.convnext.convnext." + std::to_string(i), y_t);
+            }
+            ggml_set_name(y_t, "text_encoder_convnext5");
+            ggml_set_output(y_t);
+            ggml_build_forward_expand(convnext_cache.gf, y_t);
+
+            convnext_cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
+            if (!convnext_cache.allocr) {
+                ggml_free(convnext_cache.ctx);
+                convnext_cache = {};
+                throw std::runtime_error("ggml_gallocr_new text encoder failed");
+            }
+            if (!ggml_gallocr_reserve(convnext_cache.allocr, convnext_cache.gf)) {
+                ggml_gallocr_free(convnext_cache.allocr);
+                ggml_free(convnext_cache.ctx);
+                convnext_cache = {};
+                throw std::runtime_error("ggml_gallocr_reserve text encoder failed");
+            }
+            ggml_gallocr_alloc_graph(convnext_cache.allocr, convnext_cache.gf);
         }
-        ggml_gallocr_alloc_graph(allocr, gf);
-        std::vector<float> raw = pack_time_channel_for_ggml(x, L, C);
-        ggml_backend_tensor_set(in, raw.data(), 0, raw.size()*sizeof(float));
-        profile_text_compute(model, gf, "convnext_front");
-        x = tensor_to_time_channel(ggml_graph_get_tensor(gf, "text_encoder_convnext5"));
-        ggml_gallocr_free(allocr);
-        ggml_free(ctx);
+        ggml_backend_tensor_set(convnext_cache.ids_in, ids.data(), 0, ids.size() * sizeof(int32_t));
+        profile_text_compute(model, convnext_cache.gf, "convnext_front");
+        std::vector<float> x = tensor_to_time_channel(
+            ggml_graph_get_tensor(convnext_cache.gf, "text_encoder_convnext5"));
         profile_text_checkpoint("convnext_readback");
 
         // The text encoder's relative-position and speech-prompted attention
         // layers are custom scalar continuations for now; the ConvNeXt front
         // half above is already run as a GGML graph.
         std::vector<float> convnext_out = x;
+        // F13: layer-norm weights are pre-downloaded into
+        // `model.text_encoder_ln_weights` at load time; the helper
+        // below wraps the lookup with a `read_f32` fallback so a
+        // GGUF that's missing one of the rostered names degrades
+        // gracefully to the legacy behaviour.
+        auto ln_cached = [&](const std::string & name) -> f32_tensor {
+            auto it = model.text_encoder_ln_weights.find(name);
+            if (it != model.text_encoder_ln_weights.end() && !it->second.empty()) {
+                f32_tensor t;
+                t.data = it->second;
+                t.ne[0] = (int64_t) it->second.size();
+                t.ne[1] = 1; t.ne[2] = 1; t.ne[3] = 1;
+                return t;
+            }
+            return read_f32(model, name);
+        };
         for (int i = 0; i < 4; ++i) {
             std::vector<float> residual = x;
             relpos_attention_ggml(model, i, x, L, C, x);
             for (size_t j = 0; j < x.size(); ++j) x[j] += residual[j];
             layer_norm_channel(
                 x, L, C,
-                read_f32(model, "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1." + std::to_string(i) + ".norm.weight"),
-                read_f32(model, "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1." + std::to_string(i) + ".norm.bias"));
+                ln_cached("text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1." + std::to_string(i) + ".norm.weight"),
+                ln_cached("text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1." + std::to_string(i) + ".norm.bias"));
             std::string attn_post = "relpos" + std::to_string(i) + "_res_norm";
             profile_text_checkpoint(attn_post.c_str());
             residual = x;
@@ -960,8 +1379,8 @@ bool supertonic_text_encoder_forward_ggml(const supertonic_model & model,
             for (size_t j = 0; j < x.size(); ++j) x[j] += residual[j];
             layer_norm_channel(
                 x, L, C,
-                read_f32(model, "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2." + std::to_string(i) + ".norm.weight"),
-                read_f32(model, "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2." + std::to_string(i) + ".norm.bias"));
+                ln_cached("text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2." + std::to_string(i) + ".norm.weight"),
+                ln_cached("text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2." + std::to_string(i) + ".norm.bias"));
             std::string ffn_post = "ffn" + std::to_string(i) + "_res_norm";
             profile_text_checkpoint(ffn_post.c_str());
         }
@@ -976,10 +1395,12 @@ bool supertonic_text_encoder_forward_ggml(const supertonic_model & model,
         speech_prompted_attention_ggml(model, 1, x, L, style_ttl, attn_out);
         for (size_t i = 0; i < x.size(); ++i) x[i] = shared_residual[i] + attn_out[i];
         profile_text_checkpoint("speech1_residual");
+        // F13: final speech-prompted layer norm pair lives in the
+        // same host-side cache.
         layer_norm_channel(
             x, L, C,
-            read_f32(model, "text_encoder:tts.ttl.speech_prompted_text_encoder.norm.norm.weight"),
-            read_f32(model, "text_encoder:tts.ttl.speech_prompted_text_encoder.norm.norm.bias"));
+            ln_cached("text_encoder:tts.ttl.speech_prompted_text_encoder.norm.norm.weight"),
+            ln_cached("text_encoder:tts.ttl.speech_prompted_text_encoder.norm.norm.bias"));
         profile_text_checkpoint("speech_norm");
 
         text_emb_out.assign((size_t) C * L, 0.0f);
@@ -1001,6 +1422,7 @@ bool supertonic_text_encoder_trace_ggml(const supertonic_model & model,
                                         std::vector<supertonic_trace_tensor> & scalar_trace,
                                         std::vector<supertonic_trace_tensor> & ggml_trace,
                                         std::string * error) {
+    supertonic_op_dispatch_scope dispatch(model);
     try {
         scalar_trace.clear();
         ggml_trace.clear();
diff --git a/tts-cpp/src/supertonic_vector_estimator.cpp b/tts-cpp/src/supertonic_vector_estimator.cpp
index bd377c55dd1..b35caf14e86 100644
--- a/tts-cpp/src/supertonic_vector_estimator.cpp
+++ b/tts-cpp/src/supertonic_vector_estimator.cpp
@@ -13,6 +13,7 @@
 #include <cmath>
 #include <cstdio>
 #include <cstdlib>
+#include <cstring>
 #include <stdexcept>
 #include <string>
 
@@ -60,20 +61,52 @@ void profile_vector_step_begin(int step) {
 void profile_vector_compute(const supertonic_model & model,
                             ggml_cgraph * graph,
                             int step,
-                            const char * island) {
-    if (!vector_profile_enabled()) {
-        supertonic_sched_compute(model, graph);
+                            const char * island,
+                            bool use_sched = false) {
+    // Callers pick the compute primitive by allocation strategy:
+    //   use_sched == false : graph is bound to a per-cache
+    //                        `ggml_gallocr_t` (HEAD's F8/F18/F19/...
+    //                        caches).  Use `supertonic_graph_compute`
+    //                        (direct backend compute) so the tensors'
+    //                        gallocr-bound buffers are honoured.
+    //                        Routing through `model.sched` would
+    //                        force the graph through a scheduler that
+    //                        doesn't know about the per-cache gallocr
+    //                        and silently corrupt the output.
+    //   use_sched == true  : graph is allocated by
+    //                        `supertonic_sched_alloc` on the model
+    //                        scheduler (QVAC-19254 fallback when the
+    //                        primary backend doesn't support every
+    //                        op).  Use `supertonic_sched_compute` so
+    //                        the alloc + compute pair is consistent.
+    auto dispatch = [&]() {
+        if (use_sched) supertonic_sched_compute(model, graph);
+        else           supertonic_graph_compute(model, graph);
+    };
+    const bool stderr_on = vector_profile_enabled();
+    const bool csv_on    = supertonic_profile_csv_enabled();
+    if (!stderr_on && !csv_on) {
+        dispatch();
         return;
     }
     auto & state = vector_profile();
     const auto t0 = std::chrono::steady_clock::now();
     const double pre_ms = std::chrono::duration<double, std::milli>(t0 - state.last).count();
-    supertonic_sched_compute(model, graph);
+    dispatch();
     const auto t1 = std::chrono::steady_clock::now();
     const double ms = std::chrono::duration<double, std::milli>(t1 - t0).count();
     state.last = t1;
-    std::fprintf(stderr, "supertonic_vector_profile step=%d island=%s pre_ms=%.3f compute_ms=%.3f\n",
-                 step, island, pre_ms, ms);
+    if (stderr_on) {
+        std::fprintf(stderr, "supertonic_vector_profile step=%d island=%s pre_ms=%.3f compute_ms=%.3f\n",
+                     step, island, pre_ms, ms);
+    }
+    // Phase 2D: machine-readable timing for the post-mortem
+    // analysis script.  Records every graph compute call with the
+    // stage/island context the existing stderr line already
+    // carries.  No-op when the CSV emitter isn't enabled.
+    if (csv_on) {
+        supertonic_profile_csv_record("vector", island, step, ms);
+    }
 }
 
 void profile_vector_step_end(int step) {
@@ -144,7 +177,18 @@ ggml_tensor * repeat_like(ggml_context * ctx, ggml_tensor * v, ggml_tensor * lik
             std::to_string(like->ne[0]) + "," + std::to_string(like->ne[1]) + "," +
             std::to_string(like->ne[2]) + "," + std::to_string(like->ne[3]) + "]");
     }
-    return ggml_repeat(ctx, v, like);
+    // Every call site in this file feeds the return value straight into
+    // ggml_add / ggml_mul, both of which broadcast natively in ggml.  Skip
+    // the explicit ggml_repeat node so the downstream op handles the
+    // broadcast — saves ~282 REPEAT ops per consolidated per-step graph.
+    // Override with SUPERTONIC_FORCE_EXPLICIT_REPEAT=1 if this regresses
+    // on a backend that doesn't broadcast (none observed today).
+    static const bool force_explicit_repeat =
+        std::getenv("SUPERTONIC_FORCE_EXPLICIT_REPEAT") != nullptr;
+    if (force_explicit_repeat) {
+        return ggml_repeat(ctx, v, like);
+    }
+    return v;
 }
 
 ggml_tensor * conv1d_f32(ggml_context * ctx,
@@ -154,7 +198,9 @@ ggml_tensor * conv1d_f32(ggml_context * ctx,
                          int padding,
                          int dilation) {
 #if defined(TTS_CPP_USE_ACCELERATE) || defined(TTS_CPP_USE_CBLAS)
-    if (kernel->ne[0] == 1 && stride == 1 && padding == 0 && dilation == 1 &&
+    // CPU-only fast path: see supertonic_op_dispatch_scope contract.
+    if (supertonic_use_cpu_custom_ops() &&
+        kernel->ne[0] == 1 && stride == 1 && padding == 0 && dilation == 1 &&
         input->type == GGML_TYPE_F32 && kernel->type == GGML_TYPE_F32 &&
         input->ne[2] == 1 && input->ne[3] == 1) {
         auto pointwise_op = [](ggml_tensor * dst, int ith, int nth, void *) {
@@ -204,6 +250,19 @@ ggml_tensor * conv1d_f32(ggml_context * ctx,
 }
 
 ggml_tensor * edge_clamp_pad_1d(ggml_context * ctx, ggml_tensor * x, int pad_left, int pad_right) {
+    if (pad_left == 0 && pad_right == 0) return x;
+    // Fused fast path via supertonic_edge_pad_1d.  Same kernel handles
+    // both sides; the legacy view + repeat_4d + concat chain (2 ops
+    // per side) becomes 1 dispatch total.  Override:
+    // SUPERTONIC_DISABLE_FUSED_EDGE_PAD=1.
+    static const bool disable_fused_edge_pad =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_EDGE_PAD") != nullptr;
+    if (!disable_fused_edge_pad &&
+        x->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        ggml_is_contiguous(x)) {
+        return ggml_supertonic_edge_pad_1d(ctx, x, pad_left, pad_right);
+    }
     const int64_t L = x->ne[0];
     const int64_t C = x->ne[1];
     ggml_tensor * out = x;
@@ -299,6 +358,9 @@ ggml_tensor * depthwise_same_custom_ggml(ggml_context * ctx,
                                          ggml_tensor * w,
                                          ggml_tensor * b,
                                          int dilation) {
+    // GPU backends reject GGML_OP_CUSTOM; fall through to the pure-GGML
+    // im2col + mul_mat path in depthwise_same_ggml() below.
+    if (!supertonic_use_cpu_custom_ops()) return nullptr;
     const depthwise_same_op_config * cfg = depthwise_same_config(dilation);
     if (!cfg || x->type != GGML_TYPE_F32 || w->type != GGML_TYPE_F32 || b->type != GGML_TYPE_F32) {
         return nullptr;
@@ -321,6 +383,23 @@ ggml_tensor * depthwise_same_ggml(ggml_context * ctx,
         return custom;
     }
     const int K = (int) w->ne[0];
+    // Fused-op fast path (any backend that registers GGML_OP_SUPERTONIC_DEPTHWISE_1D
+    // — Metal does via the local ggml port overlay; CPU's
+    // ggml_compute_forward_supertonic_depthwise_1d is the parity backstop).
+    // Replaces the edge_clamp_pad + im2col + mul_mat + add chain with one
+    // dispatch.  Currently supports K in {3, 5}; the existing graph path is
+    // the fallback for K outside that set.  Override with
+    // SUPERTONIC_DISABLE_FUSED_DEPTHWISE=1 to force the stock-op chain.
+    static const bool disable_fused =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_DEPTHWISE") != nullptr;
+    if (!disable_fused && (K == 3 || K == 5) &&
+        x->type == GGML_TYPE_F32 && w->type == GGML_TYPE_F32 &&
+        b->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 && w->ne[1] == 1 && w->ne[3] == 1 &&
+        w->ne[2] == x->ne[1] && b->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(x) && ggml_is_contiguous(w) && ggml_is_contiguous(b)) {
+        return ggml_supertonic_depthwise_1d(ctx, x, w, b, dilation);
+    }
     const int pad_left = ((K - 1) * dilation) / 2;
     const int pad_right = (K - 1) * dilation - pad_left;
     ggml_tensor * padded = edge_clamp_pad_1d(ctx, x, pad_left, pad_right);
@@ -335,7 +414,23 @@ ggml_tensor * layer_norm_ggml(ggml_context * ctx,
                               ggml_tensor * x,
                               ggml_tensor * g,
                               ggml_tensor * b) {
-    if (x->type == GGML_TYPE_F32 && g->type == GGML_TYPE_F32 && b->type == GGML_TYPE_F32 &&
+    // Fused-op fast path on non-CPU backends (Metal/Vulkan/CUDA/OpenCL):
+    // GGML_OP_SUPERTONIC_LAYER_NORM_CHANNEL collapses the
+    // permute + cont + ggml_norm + mul + add + permute + cont chain into
+    // a single dispatch.  Override with SUPERTONIC_DISABLE_FUSED_LAYER_NORM=1.
+    static const bool disable_fused_layer_norm =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_LAYER_NORM") != nullptr;
+    if (!supertonic_use_cpu_custom_ops() && !disable_fused_layer_norm &&
+        x->type == GGML_TYPE_F32 && g->type == GGML_TYPE_F32 && b->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        g->ne[0] == x->ne[1] && b->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(x) && ggml_is_contiguous(g) && ggml_is_contiguous(b)) {
+        return ggml_supertonic_layer_norm_channel(ctx, x, g, b, 1e-6f);
+    }
+    // CPU-only direct row-wise layer-norm; falls through to permute +
+    // ggml_norm on non-CPU backends so the graph stays GPU-executable.
+    if (supertonic_use_cpu_custom_ops() &&
+        x->type == GGML_TYPE_F32 && g->type == GGML_TYPE_F32 && b->type == GGML_TYPE_F32 &&
         x->ne[2] == 1 && x->ne[3] == 1) {
         auto layer_norm_op = [](ggml_tensor * dst, int ith, int nth, void *) {
             const ggml_tensor * src = dst->src[0];
@@ -387,7 +482,11 @@ ggml_tensor * dense_matmul_time_ggml(ggml_context * ctx,
                                      ggml_tensor * w,
                                      ggml_tensor * b) {
 #if defined(TTS_CPP_USE_ACCELERATE) || defined(TTS_CPP_USE_CBLAS)
-    if (x->type == GGML_TYPE_F32 && w->type == GGML_TYPE_F32 && (!b || b->type == GGML_TYPE_F32) &&
+    // CPU-only direct dense-time matmul; the pure-GGML fallback below
+    // expresses the same op via conv1d_f32(K=1) which is supported on
+    // every backend.
+    if (supertonic_use_cpu_custom_ops() &&
+        x->type == GGML_TYPE_F32 && w->type == GGML_TYPE_F32 && (!b || b->type == GGML_TYPE_F32) &&
         x->ne[2] == 1 && x->ne[3] == 1 && w->ne[1] == x->ne[1]) {
         auto dense_op = [](ggml_tensor * dst, int ith, int nth, void *) {
             const ggml_tensor * src = dst->src[0];
@@ -442,6 +541,13 @@ ggml_tensor * dense_matmul_time_ggml(ggml_context * ctx,
     // tensors are loaded as ne=[OC, IC].  Make that transpose contiguous, then
     // view it as a Conv1d kernel [K=1, IC, OC] so it can consume the repo's
     // standard time-major activation layout [T, IC].
+    //
+    // Tried replacing this conv1d_f32 wrapper with a direct ggml_mul_mat on
+    // 2026-05-11 — it requires cont on BOTH operands to satisfy mul_mat's
+    // !ggml_is_transposed(A) assertion, which yields the SAME dispatch count
+    // (cont + cont + mul_mat + add) as the current conv1d path (cont +
+    // im2col + mul_mat + add).  Net wash; keeping conv1d_f32 because it's
+    // already battle-tested with the CPU fastpath.
     ggml_tensor * wt = ggml_cont(ctx, ggml_transpose(ctx, w));
     ggml_tensor * kernel = ggml_reshape_3d(ctx, wt, 1, w->ne[1], w->ne[0]);
     ggml_tensor * y = conv1d_f32(ctx, kernel, x, 1, 0, 1);
@@ -449,8 +555,147 @@ ggml_tensor * dense_matmul_time_ggml(ggml_context * ctx,
     return y;
 }
 
+// Same as dense_matmul_time_ggml, but `model` is consulted for a pre-
+// transposed copy of `w` (built at load time for `:onnx::MatMul_*` weights
+// on non-CPU backends).  When available, the runtime `cont(transpose(w))`
+// dispatch is skipped — the pre-transposed tensor already has the
+// `[IC, OC]` layout that the conv1d_f32 K=1 kernel expects.  CPU callers
+// fall through to the original path (the cblas pointwise fast path takes
+// the loaded `[OC, IC]` weight directly).
+// Forward decl — defined below.
+ggml_tensor * dense_matmul_time_wt_pretransposed_ggml(ggml_context * ctx,
+                                                      const supertonic_model & model,
+                                                      ggml_tensor * x,
+                                                      ggml_tensor * w,
+                                                      ggml_tensor * b);
+
+ggml_tensor * dense_matmul_time_pretransposed_ggml(ggml_context * ctx,
+                                                   const supertonic_model & model,
+                                                   ggml_tensor * x,
+                                                   ggml_tensor * w,
+                                                   ggml_tensor * b) {
+    if (!supertonic_use_cpu_custom_ops()) {
+        if (ggml_tensor * w_pre = try_pretransposed_weight(model, w)) {
+            if (w_pre->type == GGML_TYPE_F32) {
+                // f32 fast path: reshape w_pre into the conv1d kernel
+                // [K=1, IC, OC] and dispatch via the existing wrapper.
+                // mul_mat(im2col_f32, kernel_f32) hits the optimised
+                // kernel_mul_mm_f32_f32.
+                ggml_tensor * kernel = ggml_reshape_3d(ctx, w_pre, 1, w_pre->ne[0], w_pre->ne[1]);
+                ggml_tensor * y = conv1d_f32(ctx, kernel, x, 1, 0, 1);
+                if (b) y = ggml_add(ctx, y, repeat_like(ctx, b, y));
+                return y;
+            }
+            // Quantized w_pre (q8_0): the f32 fast path's
+            // mul_mat(im2col_f32, kernel_quant) would need a
+            // kernel_mul_mm_f32_q8_0 variant which ggml-metal doesn't ship.
+            // Route through the wt helper (kernel as src0 — dispatches
+            // kernel_mul_mm_q8_0_f32) and transpose the [A, T] result back
+            // to [T, A] so the caller's downstream code (residual adds,
+            // [T, C]-shaped intermediate state) doesn't have to change.
+            ggml_tensor * y_wt = dense_matmul_time_wt_pretransposed_ggml(
+                ctx, model, x, w, b);
+            return ggml_cont(ctx, ggml_transpose(ctx, y_wt));
+        }
+    }
+    return dense_matmul_time_ggml(ctx, x, w, b);
+}
+
+// Phase B2 partial: like dense_matmul_time_pretransposed_ggml but emits
+// the result in *width-major* `[OC, T]` layout instead of `[T, OC]`.
+//
+// The trick is to swap the `ggml_mul_mat` operand order from
+// `mul_mat(im2col_[IC,T], kernel_[IC,OC]) -> [T, OC]` to
+// `mul_mat(kernel_[IC,OC], im2col_[IC,T]) -> [OC, T]`.  Both operands
+// stay non-transposed so the assertion on `a`/`b` is satisfied.  The
+// kernel-as-`src0` ordering is also what `kernel_mul_mm_q8_0_f32`
+// requires, so this single change *also* unlocks A3 step 2 (the
+// optimized quantized matmul kernel will dispatch when `w_pre` is
+// q8_0 — see the asymmetric load logic in supertonic_gguf.cpp).
+//
+// Used at the Q/K/V projection sites in the per-step graph: the
+// downstream rope + flash_attn expect `[A, L]` layout, so the cont
+// (transpose) that used to flip `[L, A]` -> `[A, L]` becomes dead
+// code.  Eliminates ~24 cont dispatches per per-step graph × 5
+// steps = ~120 ops per synth.
+//
+// Bias add: `b` (shape `[OC]`) broadcasts naturally against the
+// new `[OC, T]` output via `repeat_like`'s 1-d → 2-d reshape on the
+// `ne[0]` match.
+//
+// Falls through to the legacy path with a runtime cont(transpose)
+// on the activation when no pretransposed weight is available
+// (e.g. weight not on the `:onnx::MatMul_` allowlist).
+ggml_tensor * dense_matmul_time_wt_pretransposed_ggml(ggml_context * ctx,
+                                                      const supertonic_model & model,
+                                                      ggml_tensor * x,
+                                                      ggml_tensor * w,
+                                                      ggml_tensor * b) {
+    if (!supertonic_use_cpu_custom_ops()) {
+        if (ggml_tensor * w_pre = try_pretransposed_weight(model, w)) {
+            const int IC = (int) w_pre->ne[0];
+            const int OC = (int) w_pre->ne[1];
+
+            // ggml_im2col only reads the kernel's SHAPE (ne[0..3]); it never
+            // touches the kernel data — the output buffer holds the
+            // rearranged activation.  So for the SHAPE we can use:
+            //   - a reshape of w_pre when w_pre is f32 (cheap, just metadata)
+            //   - a tiny phantom f32 tensor allocated in the graph context
+            //     when w_pre is quantized (because reshape_3d(q8_0, 1, IC, OC)
+            //     would set ne[0]=1 < q8_0's 32-element block size and break
+            //     the type's invariants).  The phantom is never read.
+            ggml_tensor * shape_kernel;
+            if (w_pre->type == GGML_TYPE_F32) {
+                shape_kernel = ggml_reshape_3d(ctx, w_pre, 1, IC, OC);
+            } else {
+                shape_kernel = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, 1, IC, OC);
+                // No data needs binding — im2col only consults ne[0..3].
+            }
+
+            ggml_tensor * im2col = ggml_im2col(ctx, shape_kernel, x, 1, 0, 0, 0, 1, 0, false, GGML_TYPE_F32);
+            // im2col has ne=[IC, T, 1, 1].  Reshape to 2D for mul_mat.
+            ggml_tensor * im2col_2d = ggml_reshape_2d(ctx, im2col,
+                                                      im2col->ne[0], im2col->ne[2] * im2col->ne[1]);
+            // Swapped order: w_pre first (src0 = the quantized/f32 weight),
+            // im2col second (src1 = f32 activation).  Result is [M=OC, N=T].
+            // For w_pre=q8_0 this dispatches kernel_mul_mm_q8_0_f32 — the
+            // bandwidth-optimised quantized matmul kernel — which is the
+            // A3 step 2 unlock.
+            ggml_tensor * w_2d = ggml_reshape_2d(ctx, w_pre, IC, OC);
+            ggml_tensor * y = ggml_mul_mat(ctx, w_2d, im2col_2d);
+            // y has ne=[OC, T] — already the wt layout.
+            if (b) y = ggml_add(ctx, y, repeat_like(ctx, b, y));
+            return y;
+        }
+    }
+    // Fallback: legacy [T, OC] matmul + explicit cont(transpose) to
+    // produce [OC, T] for the caller.  CPU also lands here (and gets
+    // the cblas fast path for free via dense_matmul_time_ggml).
+    ggml_tensor * y_tc = dense_matmul_time_ggml(ctx, x, w, b);
+    return ggml_cont(ctx, ggml_transpose(ctx, y_tc));
+}
+
 ggml_tensor * bias_gelu_ggml(ggml_context * ctx, ggml_tensor * x, ggml_tensor * b) {
-    if (x->type == GGML_TYPE_F32 && b->type == GGML_TYPE_F32 && x->ne[2] == 1 && x->ne[3] == 1) {
+    const bool use_cpu_custom = supertonic_use_cpu_custom_ops();
+    // Fused-op fast path (any backend that registers
+    // GGML_OP_SUPERTONIC_BIAS_GELU — Metal does via the local ggml port
+    // overlay; CPU's ggml_compute_forward_supertonic_bias_gelu is the
+    // parity backstop).  Replaces the add(bias) + gelu_erf chain
+    // (2 dispatches on Metal) with one dispatch.  Override with
+    // SUPERTONIC_DISABLE_FUSED_BIAS_GELU=1 to force the stock-op chain.
+    // Skipped on CPU custom-op backends (cblas path below is faster).
+    static const bool disable_fused_bias_gelu =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_BIAS_GELU") != nullptr;
+    if (!use_cpu_custom && !disable_fused_bias_gelu &&
+        x->type == GGML_TYPE_F32 && b->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        b->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(x) && ggml_is_contiguous(b)) {
+        return ggml_supertonic_bias_gelu(ctx, x, b);
+    }
+    // CPU-only fused bias + GELU; falls back to gelu(add(x, b)) on GPU.
+    if (use_cpu_custom &&
+        x->type == GGML_TYPE_F32 && b->type == GGML_TYPE_F32 && x->ne[2] == 1 && x->ne[3] == 1) {
         auto op = [](ggml_tensor * dst, int ith, int nth, void *) {
             const ggml_tensor * src = dst->src[0];
             const ggml_tensor * bias = dst->src[1];
@@ -482,7 +727,30 @@ ggml_tensor * pw2_residual_ggml(ggml_context * ctx,
                                 ggml_tensor * x,
                                 ggml_tensor * b,
                                 ggml_tensor * gamma) {
-    if (residual->type == GGML_TYPE_F32 && x->type == GGML_TYPE_F32 &&
+    const bool use_cpu_custom = supertonic_use_cpu_custom_ops();
+    // Fused-op fast path (any backend that registers
+    // GGML_OP_SUPERTONIC_PW2_RESIDUAL — Metal does via the local ggml port
+    // overlay; CPU's ggml_compute_forward_supertonic_pw2_residual is the
+    // parity backstop).  Replaces the add(bias) + mul(gamma) + add(residual)
+    // chain with one dispatch.  Override with
+    // SUPERTONIC_DISABLE_FUSED_PW2_RESIDUAL=1 to force the stock-op chain.
+    // Skipped on CPU custom-op backends (cblas fast path below is faster).
+    static const bool disable_fused_pw2_residual =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_PW2_RESIDUAL") != nullptr;
+    if (!use_cpu_custom && !disable_fused_pw2_residual &&
+        residual->type == GGML_TYPE_F32 && x->type == GGML_TYPE_F32 &&
+        b->type == GGML_TYPE_F32 && gamma->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        residual->ne[0] == x->ne[0] && residual->ne[1] == x->ne[1] &&
+        b->ne[0] == x->ne[1] && gamma->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(residual) && ggml_is_contiguous(x) &&
+        ggml_is_contiguous(b) && ggml_is_contiguous(gamma)) {
+        return ggml_supertonic_pw2_residual(ctx, residual, x, b, gamma);
+    }
+    // CPU-only fused (bias + gamma + residual); falls back to the
+    // 3-step add/mul/add chain on GPU.
+    if (use_cpu_custom &&
+        residual->type == GGML_TYPE_F32 && x->type == GGML_TYPE_F32 &&
         b->type == GGML_TYPE_F32 && gamma->type == GGML_TYPE_F32 &&
         x->ne[2] == 1 && x->ne[3] == 1) {
         auto op = [](ggml_tensor * dst, int ith, int nth, void *) {
@@ -540,6 +808,109 @@ ggml_tensor * vector_convnext_ggml(ggml_context * ctx,
         require_source_tensor(model, p + ".gamma"));
 }
 
+// Phase B2 full: [C, T]-layout pointwise (K=1) Conv1d as a direct matmul.
+//
+// pwconv1/pwconv2 weights load as Conv1d kernels with ne=[K=1, IC, OC, 1].
+// With activations already in [C, T] layout (IC inner-most), the K=1
+// dimension is degenerate and the convolution is just:
+//
+//   y[OC, T] = sum_IC w[IC, OC] * x[IC, T]
+//
+// which is exactly `ggml_mul_mat(w_2d=[IC, OC], x_2d=[IC, T])` — no
+// im2col, no transpose, no pretranspose-cache lookup needed.  Result is
+// f32 contiguous and directly consumable by the next [C, T] op.
+//
+// CPU is intentionally NOT routed here: AMX cblas_sgemm in the legacy
+// path is faster than the equivalent ggml_mul_mat dispatch on Apple
+// CPUs.  Caller's `vector_convnext_ggml_ct` already roundtrips on CPU.
+ggml_tensor * pointwise_matmul_ct(ggml_context * ctx,
+                                  ggml_tensor * x_ct,   // [IC, T, 1, 1]
+                                  ggml_tensor * w,      // [1, IC, OC, 1]  (Conv1d K=1)
+                                  ggml_tensor * b) {
+    GGML_ASSERT(w->ne[0] == 1);            // K=1
+    GGML_ASSERT(w->ne[1] == x_ct->ne[0]);  // IC match
+    GGML_ASSERT(ggml_is_contiguous(w));
+    ggml_tensor * w_2d = ggml_reshape_2d(ctx, w, w->ne[1], w->ne[2]);
+    ggml_tensor * x_2d = ggml_reshape_2d(ctx, x_ct, x_ct->ne[0], x_ct->ne[1]);
+    ggml_tensor * y = ggml_mul_mat(ctx, w_2d, x_2d);  // [OC, T]
+    if (b) y = ggml_add(ctx, y, repeat_like(ctx, b, y));
+    return y;
+}
+
+// Phase B2 full: ConvNeXt block operating on `[C, T]` activations end-to-end.
+// All five fused custom Metal kernels have layout-flag plumbing landed in
+// port-version 13; this block strings their `_ct` variants together so the
+// activation tensor never needs to flip layout mid-block.  Used by callers
+// that fuse a chain of N convnext blocks with a single entry permute
+// `[T, C] -> [C, T]` before the loop and a single exit permute after — net
+// savings = (N - 1) intra-block transposes per chain × 5 CFM steps.
+//
+// Input  x:   [C, T, 1, 1]  f32 contiguous
+// Output    : [C, T, 1, 1]  f32 contiguous
+//
+// CPU backends fall through to the legacy `[T, C]` path: the `_ct` ops have
+// CPU forward implementations but they would force AMX-cblas off, so on
+// CPU we permute in/out around the legacy block to keep AMX engaged.
+ggml_tensor * vector_convnext_ggml_ct(ggml_context * ctx,
+                                      const supertonic_model & model,
+                                      const std::string & p,
+                                      ggml_tensor * x_ct,
+                                      int dilation) {
+    if (model_prefers_cpu_kernels(model)) {
+        // CPU: roundtrip to [T, C], run legacy block (AMX cblas fast path),
+        // roundtrip back.  Cheap on CPU because the permute is just a copy.
+        ggml_tensor * x_tc = ggml_cont(ctx, ggml_permute(ctx, x_ct, 1, 0, 2, 3));
+        ggml_tensor * y_tc = vector_convnext_ggml(ctx, model, p, x_tc, dilation);
+        return ggml_cont(ctx, ggml_permute(ctx, y_tc, 1, 0, 2, 3));
+    }
+
+    // Helper: flatten leading-1 dims so per-channel tensors come out as [C].
+    // Supertonic GGUFs ship bias/gamma/norm parameters as [C, 1, 1, 1] or
+    // [1, C, 1, 1] depending on which PyTorch broadcast view they were
+    // exported from.  The `_ct` ctors all assert `param->ne[0] == C_dim`, so
+    // unflattened tensors break them.  This is the same shape mismatch that
+    // has been silently disabling the legacy `pw2_residual_ggml` fused path
+    // for ConvNeXt blocks all along.
+    auto flatten_1d = [&](ggml_tensor * t) -> ggml_tensor * {
+        const int64_t n = ggml_nelements(t);
+        // Skip reshape only when already a literal 1-d view with ne[0] == n
+        // (`ggml_n_dims` is unreliable here — it ignores leading-1 dims and
+        // would return 1 for a [1, C, 1, 1] tensor where ne[0] = 1).
+        if (t->ne[0] == n && t->ne[1] == 1 && t->ne[2] == 1 && t->ne[3] == 1) {
+            return t;
+        }
+        return ggml_reshape_1d(ctx, t, n);
+    };
+
+    ggml_tensor * residual = x_ct;
+    // depthwise_1d_ct: [C, T] -> [C, T]
+    ggml_tensor * y = ggml_supertonic_depthwise_1d_ct(ctx, x_ct,
+        require_source_tensor(model, p + ".dwconv.weight"),
+        flatten_1d(require_source_tensor(model, p + ".dwconv.bias")),
+        dilation);
+    // layer_norm_channel_ct: [C, T] -> [C, T]
+    y = ggml_supertonic_layer_norm_channel_ct(ctx, y,
+        flatten_1d(require_source_tensor(model, p + ".norm.norm.weight")),
+        flatten_1d(require_source_tensor(model, p + ".norm.norm.bias")),
+        1e-6f);
+    // pw1 matmul: [IC=C, T] -> [OC, T]
+    y = pointwise_matmul_ct(ctx, y,
+        require_source_tensor(model, p + ".pwconv1.weight"),
+        nullptr);
+    // bias_gelu_ct: [OC, T] -> [OC, T]
+    y = ggml_supertonic_bias_gelu_ct(ctx, y,
+        flatten_1d(require_source_tensor(model, p + ".pwconv1.bias")));
+    // pw2 matmul: [IC=OC, T] -> [C, T]   (restores channel count)
+    y = pointwise_matmul_ct(ctx, y,
+        require_source_tensor(model, p + ".pwconv2.weight"),
+        nullptr);
+    // pw2_residual_ct: x[C, T] + bias[C] (×) gamma[C] + residual[C, T] -> [C, T]
+    return ggml_supertonic_pw2_residual_ct(ctx, y,
+        flatten_1d(require_source_tensor(model, p + ".pwconv2.bias")),
+        flatten_1d(require_source_tensor(model, p + ".gamma")),
+        residual);
+}
+
 std::vector<float> tensor_to_time_channel(ggml_tensor * t) {
     const int L = (int) t->ne[0];
     const int C = (int) t->ne[1];
@@ -614,6 +985,16 @@ struct vector_text_attention_cache {
     int kv_len = 0;
     int n_heads = 0;
     int head_dim = 0;
+    // QVAC-18605 round 4 — generalised cache key for the K/V
+    // flash-attention dispatch dtype.  Replaces the round-1
+    // boolean `f16_kv_attn` (kept the field name for grep
+    // continuity in PROGRESS_SUPERTONIC.md / git history; the
+    // semantics are now an enum carrying f32/f16/bf16/q8_0).
+    // Rebuilding the graph when this flips matches the same
+    // correctness contract as the (q_len, kv_len, n_heads,
+    // head_dim) cache keys above.  See dispatch logic in
+    // `build_text_attention_cache()`.
+    kv_attn_dtype kv_attn_type = kv_attn_dtype::f32;
     std::string out_w_source;
     std::string out_b_source;
     std::vector<uint8_t> buf;
@@ -656,6 +1037,7 @@ void build_text_attention_cache(vector_text_attention_cache & cache,
     cache.kv_len = kv_len;
     cache.n_heads = n_heads;
     cache.head_dim = head_dim;
+    cache.kv_attn_type = supertonic_kv_attn_type();
     cache.out_w_source = out_w_source;
     cache.out_b_source = out_b_source;
 
@@ -683,14 +1065,61 @@ void build_text_attention_cache(vector_text_attention_cache & cache,
     ggml_tensor * v_in = ggml_view_3d(cache.ctx, cache.v_tc_in,
         head_dim, kv_len, n_heads, time_stride, head_stride, 0);
 
+    // QVAC-18605 round 4 — multi-dtype K/V flash-attention
+    // dispatch.  Generalises the round-1 F16-only path:
+    //
+    //   f32  → no cast (backend's F32 flash-attn kernel)
+    //   f16  → cast K / V to F16 (OpenCL `flash_attn_f32_f16`,
+    //          Vulkan `kernel_flash_attn_f32_f16_*`; chatterbox
+    //          --cfm-f16-kv-attn equivalent)
+    //   bf16 → cast K / V to BF16 (Vulkan coopmat2 — wider
+    //          exponent range than F16 at identical bandwidth)
+    //   q8_0 → cast K / V to Q8_0 (Vulkan + half the K/V upload
+    //          bandwidth; row stride of 32 elements is exact for
+    //          our `head_dim = 64` so block alignment is trivially
+    //          satisfied)
+    //
+    // Q stays F32 in every case: cheaper to keep one operand at
+    // the higher precision than to round-trip the post-attention
+    // output back through F32 for the downstream dense projection.
+    //
+    // The decision lives in `model.kv_attn_type` (mirrored onto
+    // the thread-local by `supertonic_op_dispatch_scope` and
+    // captured into `cache.kv_attn_type` above as the cache key).
+    // Probe-gated graceful fallback to f32 happens upstream in
+    // `resolve_kv_attn_type` — by the time we reach this site the
+    // chosen dtype is guaranteed to be one the backend accepts
+    // for our (head_dim, n_heads) shape.
+    ggml_type cast_target = GGML_TYPE_COUNT;  // sentinel "no cast"
+    switch (cache.kv_attn_type) {
+        case kv_attn_dtype::f32:                                   break;
+        case kv_attn_dtype::f16:  cast_target = GGML_TYPE_F16;     break;
+        case kv_attn_dtype::bf16: cast_target = GGML_TYPE_BF16;    break;
+        case kv_attn_dtype::q8_0: cast_target = GGML_TYPE_Q8_0;    break;
+        case kv_attn_dtype::autoselect:
+            // Resolver never returns autoselect; defensive throw
+            // so a future refactor that bypasses the resolver
+            // can't silently take the F32 path.
+            throw std::runtime_error(
+                "vector_text_attention_cache: kv_attn_type=autoselect "
+                "leaked into dispatch (resolver should have produced "
+                "a concrete dtype)");
+    }
+    if (cast_target != GGML_TYPE_COUNT) {
+        ggml_tensor * k_typed = ggml_new_tensor_3d(cache.ctx, cast_target, head_dim, kv_len, n_heads);
+        ggml_tensor * v_typed = ggml_new_tensor_3d(cache.ctx, cast_target, head_dim, kv_len, n_heads);
+        k_in = ggml_cpy(cache.ctx, k_in, k_typed);
+        v_in = ggml_cpy(cache.ctx, v_in, v_typed);
+    }
+
     ggml_tensor * attn = ggml_flash_attn_ext(cache.ctx, q_in, k_in, v_in,
                                              nullptr, 1.0f/16.0f, 0.0f, 0.0f);
-    attn = ggml_reshape_2d(cache.ctx, attn, n_heads * head_dim, q_len);
+    attn = ggml_reshape_2d(cache.ctx, attn, static_cast<int64_t>(n_heads) * head_dim, q_len);
     ggml_tensor * ctx_tc = ggml_cont(cache.ctx, ggml_transpose(cache.ctx, attn));
     ggml_set_name(ctx_tc, "vector_attn_ctx"); ggml_set_output(ctx_tc);
     ggml_build_forward_expand(cache.gf, ctx_tc);
 
-    ggml_tensor * out = dense_matmul_time_ggml(cache.ctx, ctx_tc,
+    ggml_tensor * out = dense_matmul_time_pretransposed_ggml(cache.ctx, model, ctx_tc,
         require_source_tensor(model, out_w_source),
         require_source_tensor(model, out_b_source));
     ggml_set_name(out, "vector_attn_out"); ggml_set_output(out);
@@ -715,9 +1144,20 @@ std::vector<float> run_text_attention_cache(vector_text_attention_cache & cache,
                                             int current_step,
                                             const char * island,
                                             std::vector<float> * ctx_trace) {
-    // Reuse the shape-keyed graph on the direct backend path; rebuild + route
-    // through the scheduler only when an op must run on CPU. Mirrors run_hift_decode.
-    build_text_attention_cache(cache, model, q_len, kv_len, n_heads, head_dim, out_w_source, out_b_source);
+    // QVAC-18605 round 4 — cache-key check includes kv_attn_type so a
+    // mid-run --kv-attn-type override rebuilds the graph with the new
+    // dtype.  Rebuild only on key mismatch; preserve the shape-cached
+    // graph on every other call.
+    if (cache.model != &model || cache.generation_id != model.generation_id ||
+        cache.q_len != q_len || cache.kv_len != kv_len ||
+        cache.n_heads != n_heads || cache.head_dim != head_dim ||
+        cache.kv_attn_type != supertonic_kv_attn_type() ||
+        cache.out_w_source != out_w_source || cache.out_b_source != out_b_source) {
+        build_text_attention_cache(cache, model, q_len, kv_len, n_heads, head_dim, out_w_source, out_b_source);
+    }
+    // QVAC-19254 — direct backend path when every node is supported by
+    // the primary backend; route through the scheduler when an op must
+    // run on CPU (GGML_OP_CUSTOM etc.).
     bool direct = true;
     const int n_nodes = ggml_graph_n_nodes(cache.gf);
     for (int i = 0; i < n_nodes; ++i) {
@@ -738,8 +1178,96 @@ std::vector<float> run_text_attention_cache(vector_text_attention_cache & cache,
     ggml_backend_tensor_set(cache.q_tc_in, q_tc.data(), 0, q_tc.size()*sizeof(float));
     ggml_backend_tensor_set(cache.k_tc_in, k_tc.data(), 0, k_tc.size()*sizeof(float));
     ggml_backend_tensor_set(cache.v_tc_in, v_tc.data(), 0, v_tc.size()*sizeof(float));
-    if (direct) supertonic_graph_compute(model, cache.gf);
-    else        profile_vector_compute(model, cache.gf, current_step, island);
+    if (direct) profile_vector_compute(model, cache.gf, current_step, island);
+    else        profile_vector_compute(model, cache.gf, current_step, island, /*use_sched=*/true);
+    if (ctx_trace) *ctx_trace = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, "vector_attn_ctx"));
+    return tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, "vector_attn_out"));
+}
+
+// Audit follow-up #6 (2C-lite) — GPU-input fast path for
+// `run_text_attention_cache`.  Equivalent to the host-vector
+// overload above but replaces the three `ggml_backend_tensor_set`
+// uploads with `ggml_backend_tensor_copy` (same-backend device→
+// device blit) so Q / K / V never round-trip through the host
+// between the producing graph (front-block / group-graph / res-
+// style QKV cache) and this attention cache.
+//
+// Eliminates per call: 3 GPU→host downloads + 3 host→GPU uploads.
+// Across the four attention sites × 5 denoise steps × Q/K/V =
+// 120 sync points / synth on the production path (independent of
+// trace-mode downloads, which still happen for parity harnesses
+// when `include_ggml_trace` is set at the call site).
+//
+// `q_src` / `k_src` / `v_src` MUST point into a graph that has
+// already been computed on the same `model.backend` and whose
+// allocator is still alive.  The current call pattern (one
+// `run_*_cache` per site, computed immediately before this
+// attention call) satisfies both.
+//
+// Test contract: `test/test_supertonic_graph_to_graph_blit.cpp`
+// — two minimal cached graphs sharing one backend, parity vs the
+// download / upload pair across all five vector-estimator attn
+// shapes (front+g1/g2/g3 Q at L=20, style K at kv=50, L=1 trip-
+// wire).
+std::vector<float> run_text_attention_cache_gpu(vector_text_attention_cache & cache,
+                                                const supertonic_model & model,
+                                                ggml_tensor * q_src,
+                                                ggml_tensor * k_src,
+                                                ggml_tensor * v_src,
+                                                int q_len,
+                                                int kv_len,
+                                                int n_heads,
+                                                int head_dim,
+                                                const std::string & out_w_source,
+                                                const std::string & out_b_source,
+                                                int current_step,
+                                                const char * island,
+                                                std::vector<float> * ctx_trace) {
+    if (cache.model != &model || cache.generation_id != model.generation_id ||
+        cache.q_len != q_len || cache.kv_len != kv_len ||
+        cache.n_heads != n_heads || cache.head_dim != head_dim ||
+        cache.kv_attn_type != supertonic_kv_attn_type() ||
+        cache.out_w_source != out_w_source || cache.out_b_source != out_b_source) {
+        build_text_attention_cache(cache, model, q_len, kv_len, n_heads, head_dim, out_w_source, out_b_source);
+    }
+    // QVAC-19254 — direct vs scheduler routing.  build_text_attention_cache
+    // no longer creates a gallocr; the run paths (this GPU-bridge variant
+    // + the host-vector overload above) must do it themselves, otherwise
+    // `cache.q_tc_in` / `k_tc_in` / `v_tc_in` have null backend buffers and
+    // the subsequent `ggml_backend_tensor_copy` aborts with
+    // "tensor buffer not set".  Mirrors the direct/sched dispatch in
+    // `run_text_attention_cache` above.
+    bool direct = true;
+    {
+        const int n_nodes = ggml_graph_n_nodes(cache.gf);
+        for (int i = 0; i < n_nodes; ++i) {
+            if (!ggml_backend_supports_op(model.backend, ggml_graph_node(cache.gf, i))) { direct = false; break; }
+        }
+    }
+    if (direct) {
+        if (!cache.allocr) {
+            cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
+            if (!cache.allocr) throw std::runtime_error("ggml_gallocr_new supertonic text attention (gpu bridge) failed");
+            if (!ggml_gallocr_reserve(cache.allocr, cache.gf)) {
+                throw std::runtime_error("ggml_gallocr_reserve supertonic text attention (gpu bridge) failed");
+            }
+        }
+        ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
+    } else {
+        supertonic_sched_alloc(model, cache.gf);
+    }
+    // Same-backend device→device blits.  ggml_backend_tensor_copy
+    // checks `ggml_nbytes(src) == ggml_nbytes(dst)` internally and
+    // dispatches the backend's `cpy_tensor_async` path (CPU →
+    // memcpy, OpenCL → clEnqueueCopyBuffer, etc.).  No host
+    // synchronisation between the three copies; the next graph
+    // compute happens-before-orders them via the same backend
+    // queue.
+    ggml_backend_tensor_copy(q_src, cache.q_tc_in);
+    ggml_backend_tensor_copy(k_src, cache.k_tc_in);
+    ggml_backend_tensor_copy(v_src, cache.v_tc_in);
+    if (direct) profile_vector_compute(model, cache.gf, current_step, island);
+    else        profile_vector_compute(model, cache.gf, current_step, island, /*use_sched=*/true);
     if (ctx_trace) *ctx_trace = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, "vector_attn_ctx"));
     return tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, "vector_attn_out"));
 }
@@ -752,9 +1280,30 @@ void push_trace(std::vector<supertonic_trace_tensor> & trace,
 
 struct vector_group_graph_result {
     std::vector<float> post;
-    std::vector<float> q;
-    std::vector<float> k;
+    std::vector<float> q;        // pre-RoPE Q (kept for scalar-parity trace)
+    std::vector<float> k;        // pre-RoPE K
     std::vector<float> v;
+    // F23 — when the cache has `apply_rope = true` these hold the
+    // post-RoPE Q/K downloaded from the in-graph rotation outputs
+    // (`<q_name>_rope` / `<k_name>_rope`).  Call sites pass these
+    // directly to `run_text_attention_cache` instead of calling
+    // host-side `apply_rope(theta, …)` on q/k.  Empty when the
+    // legacy fallback path is taken (model lacks `vector_rope_theta`).
+    std::vector<float> q_rope;
+    std::vector<float> k_rope;
+
+    // Audit follow-up #6 (2C-lite) — GPU-side handles for the
+    // post-RoPE Q/K and raw V tensors.  Pointers are valid as
+    // long as the producing `vector_group_graph_cache` (or
+    // `front_block_proj_cache` for the attn0 site) is still
+    // alive and hasn't been rebuilt.  Call sites feed these
+    // directly into `run_text_attention_cache_gpu` to skip the
+    // download / upload pair.  Null when no graph executed (legacy
+    // path with `apply_rope = false` falls back to the host-vector
+    // members above).
+    ggml_tensor * q_rope_gpu = nullptr;
+    ggml_tensor * k_rope_gpu = nullptr;
+    ggml_tensor * v_gpu      = nullptr;
 };
 
 struct vector_group_graph_cache {
@@ -779,14 +1328,69 @@ struct vector_group_graph_cache {
     ggml_context * ctx = nullptr;
     ggml_cgraph * gf = nullptr;
     ggml_gallocr_t allocr = nullptr;
+    // QVAC-18605 round 12 #5 — host-pinned input scratchpad.
+    // Holds ONLY `x_in` + `temb_in` (the two hot per-step inputs
+    // uploaded fresh every denoise step).  On Vulkan, allocated
+    // via `try_alloc_inputs_in_pinned_host_buffer` which returns
+    // a buffer from `ggml_backend_vk_host_buffer_type()` — every
+    // `ggml_backend_tensor_set(x_in, ...)` skips one staging-
+    // buffer hop on the way to BAR-mapped GPU memory.  On CPU
+    // / Metal / OpenCL (no host buffer type) the helper returns
+    // nullptr and we fall back to allocating the same tensors
+    // via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
+    // — same memory, just one staging hop per upload.
+    //
+    // `text_in` stays in the main `ctx` (gallocr handles it)
+    // because it's upload-skipped by the round-10 tracker on
+    // steps 1..N-1; the marginal staging-hop saving doesn't
+    // amortise across the cold-miss / fast-path mix.
+    std::vector<uint8_t> input_ctx_storage;
+    ggml_context * input_ctx = nullptr;
+    ggml_backend_buffer_t input_buf = nullptr;
     ggml_tensor * x_in = nullptr;
     ggml_tensor * temb_in = nullptr;
     ggml_tensor * text_in = nullptr;
+
+    // Audit follow-up #5 / F23 — in-graph RoPE inputs.  Populated
+    // at cache-build time and uploaded once (cos/sin only depend on
+    // L / text_len / θ, all stable across the cache's lifetime).
+    // When `apply_rope == false` (no `vector_rope_theta` available,
+    // e.g. a malformed GGUF) the graph falls back to the historical
+    // path: Q/K stay raw, host code still calls apply_rope.  See
+    // `aiDocs/AUDIT_SUPERTONIC_OPENCL.md` F23.
+    bool apply_rope = false;
+    ggml_tensor * q_cos_in = nullptr;
+    ggml_tensor * q_sin_in = nullptr;
+    ggml_tensor * k_cos_in = nullptr;
+    ggml_tensor * k_sin_in = nullptr;
+    std::string q_rope_name; // == q_name + "_rope"
+    std::string k_rope_name; // == k_name + "_rope"
+
+    // QVAC-18605 round 10 — pointer-compare upload-skip tracker
+    // for `text_in`.  `text_lc_host` is the same `text_emb`
+    // pointer the front-block cache sees: stable within one
+    // synth (5 calls × same pointer), potentially reused-at-same-
+    // address across synths.  Caller resets at `current_step ==
+    // 0` to invalidate the cache.  See upload_skip_tracker
+    // contract in supertonic_internal.h.  Cache rebuild zeroes
+    // this via `cache = {}` (effective reset).
+    upload_skip_tracker text_in_skip;
 };
 
 void free_group_graph_cache(vector_group_graph_cache & cache) {
     supertonic_safe_gallocr_free(cache.allocr, cache.generation_id);
+    // QVAC-18605 round 12 #5 — tear down the host-pinned input
+    // scratchpad.  Order matters: free the gallocr first (it
+    // owns buffers for the main-ctx tensors), then the main
+    // ctx (which holds the graph metadata referencing x_in /
+    // temb_in pointers from `input_ctx`), then the input
+    // buffer (drops the host-pinned pages), then the input
+    // ctx (drops the tensor metadata).  Freeing input_ctx
+    // BEFORE the gallocr would leave the gallocr with
+    // dangling pointers to tensors that no longer exist.
     if (cache.ctx) ggml_free(cache.ctx);
+    if (cache.input_buf) ggml_backend_buffer_free(cache.input_buf);
+    if (cache.input_ctx) ggml_free(cache.input_ctx);
     cache = {};
 }
 
@@ -850,14 +1454,63 @@ void build_group_graph_cache(vector_group_graph_cache & cache,
     cache.ctx = ggml_init(p);
     cache.gf = ggml_new_graph_custom(cache.ctx, NODES, false);
 
-    cache.x_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, C);
-    ggml_set_name(cache.x_in, "vector_group_in"); ggml_set_input(cache.x_in);
-    cache.temb_in = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, 64);
-    ggml_set_name(cache.temb_in, "vector_group_temb"); ggml_set_input(cache.temb_in);
+    // F12: ingest the group graph's primary activation in
+    // CPU-native `[C, L]` (channel-fast) layout so callers can
+    // upload `x_tc` byte-for-byte without the per-call host
+    // `pack_time_channel_for_ggml` loop.  The graph's first op
+    // is an `ggml_cont(ggml_transpose(...))` that materialises
+    // the `[L, C]` layout downstream `vector_convnext_ggml` /
+    // `dense_matmul_time_ggml` builders already consume.  See
+    // `supertonic_internal.h::transpose_time_channel_ggml` for
+    // the bit-exact equivalence proof against the host pack.
+    //
+    // QVAC-18605 round 12 #5 — `x_in` + `temb_in` live in a
+    // SEPARATE ggml_context (`cache.input_ctx`) so they can be
+    // allocated from `ggml_backend_vk_host_buffer_type()` on
+    // Vulkan and skip the staging-buffer hop on every per-step
+    // `ggml_backend_tensor_set`.  Graph tensors in `cache.ctx`
+    // reference these by pointer (ggml stores tensors as `void *`
+    // in the graph regardless of which context allocated them);
+    // gallocr's `ggml_gallocr_reserve` + `ggml_gallocr_alloc_graph`
+    // skips tensors that already have a `tensor->buffer` set, so
+    // pre-binding them in the host buffer doesn't interfere with
+    // gallocr's allocation pass for the intermediates + outputs.
+    //
+    // `text_in` STAYS in `cache.ctx` because the round-10
+    // upload-skip tracker means steps 1..N-1 don't upload at
+    // all; the marginal staging-hop saving for the single cold-
+    // miss step doesn't amortise.
+    {
+        // 8 tensor slots is well over what's needed (2 inputs);
+        // padded so future round-12 follow-ups can add more
+        // host-pinned inputs without re-tuning the size.
+        const size_t INPUT_OVERHEAD = ggml_tensor_overhead() * 8;
+        cache.input_ctx_storage.assign(INPUT_OVERHEAD, 0);
+        ggml_init_params input_p = { INPUT_OVERHEAD, cache.input_ctx_storage.data(), /*no_alloc=*/true };
+        cache.input_ctx = ggml_init(input_p);
+        cache.x_in = ggml_new_tensor_2d(cache.input_ctx, GGML_TYPE_F32, C, L);
+        ggml_set_name(cache.x_in, "vector_group_in_tc"); ggml_set_input(cache.x_in);
+        cache.temb_in = ggml_new_tensor_1d(cache.input_ctx, GGML_TYPE_F32, 64);
+        ggml_set_name(cache.temb_in, "vector_group_temb"); ggml_set_input(cache.temb_in);
+        // QVAC-18605 round 13 #1 — consolidated allocator
+        // (round-12 inlined the try-pinned-host + fallback
+        // boilerplate at 4 sites; this round factors it out).
+        cache.input_buf = alloc_input_scratchpad_or_throw(
+            model, cache.input_ctx, "vector_group_graph_cache");
+    }
     cache.text_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, text_len, 256);
-    ggml_set_name(cache.text_in, "vector_group_text"); ggml_set_input(cache.text_in);
-
-    ggml_tensor * cur = cache.x_in;
+    ggml_set_name(cache.text_in, "vector_group_text");
+    // Same round-10 upload-skip pattern as the front-cache: `text_in`
+    // is uploaded once per synth (`current_step == 0` resets, every
+    // other step skips).  Mark INPUT + OUTPUT so the buffer survives
+    // gallocr's free pass — without OUTPUT, step 0's compute frees
+    // the buffer for intermediate reuse, and the step-1..N skipped
+    // upload reads stale data.  See the matching note on
+    // `front_cache.text_in_t` in `supertonic_vector_trace_proj_ggml`.
+    ggml_set_input(cache.text_in);  ggml_set_output(cache.text_in);
+
+    ggml_tensor * cur = transpose_time_channel_ggml(cache.ctx, cache.x_in);
+    ggml_set_name(cur, "vector_group_in");
     int dils[4] = {1, 2, 4, 8};
     for (int j = 0; j < 4; ++j) {
         cur = vector_convnext_ggml(cache.ctx, model,
@@ -869,9 +1522,26 @@ void build_group_graph_cache(vector_group_graph_cache & cache,
             ggml_build_forward_expand(cache.gf, cur);
         }
     }
-    ggml_tensor * t_proj = ggml_mul_mat(cache.ctx,
-        ggml_cont(cache.ctx, ggml_transpose(cache.ctx, require_source_tensor(model, matmul_source))),
-        ggml_reshape_2d(cache.ctx, cache.temb_in, 64, 1));
+    // F6: pre-transposed companion lives in model.ctx_w under
+    // `<matmul_source>__T` (populated at load).  Falls back to the
+    // per-pointer `pretransposed_weights` map (Metal's broader Q/K/V
+    // pretranspose roster), and finally to an in-graph
+    // `ggml_cont(ggml_transpose(W))` rewrite if neither covers this
+    // weight.
+    ggml_tensor * t_proj;
+    {
+        auto pretrans_it = model.source_tensors.find(matmul_source + "__T");
+        ggml_tensor * w_t = (pretrans_it != model.source_tensors.end()) ? pretrans_it->second : nullptr;
+        if (!w_t) {
+            ggml_tensor * t_proj_w_orig = require_source_tensor(model, matmul_source);
+            w_t = try_pretransposed_weight(model, t_proj_w_orig);
+            if (!w_t) {
+                w_t = ggml_cont(cache.ctx, ggml_transpose(cache.ctx, t_proj_w_orig));
+            }
+        }
+        t_proj = ggml_mul_mat(cache.ctx, w_t,
+            ggml_reshape_2d(cache.ctx, cache.temb_in, 64, 1));
+    }
     t_proj = ggml_add(cache.ctx, t_proj,
         ggml_reshape_2d(cache.ctx,
             require_source_tensor(model, vector_main_block(linear_block) + ".linear.linear.bias"),
@@ -891,21 +1561,126 @@ void build_group_graph_cache(vector_group_graph_cache & cache,
     ggml_build_forward_expand(cache.gf, cur);
 
     const std::string attn_prefix = vector_main_block(post_block + 1) + ".attn.";
-    ggml_tensor * q = dense_matmul_time_ggml(cache.ctx, cur,
+    ggml_tensor * q = dense_matmul_time_pretransposed_ggml(cache.ctx, model, cur,
         require_source_tensor(model, q_matmul_source),
         require_source_tensor(model, attn_prefix + "W_query.linear.bias"));
-    ggml_tensor * k = dense_matmul_time_ggml(cache.ctx, cache.text_in,
+    ggml_tensor * k = dense_matmul_time_pretransposed_ggml(cache.ctx, model, cache.text_in,
         require_source_tensor(model, k_matmul_source),
         require_source_tensor(model, attn_prefix + "W_key.linear.bias"));
-    ggml_tensor * v = dense_matmul_time_ggml(cache.ctx, cache.text_in,
+    // QVAC-18966 — pack V into the layout the downstream
+    // `run_text_attention_cache_gpu` consumes via
+    // `ggml_backend_tensor_copy(v_src, v_tc_in)`.  `v_tc_in` is
+    // `ggml_new_tensor_2d(F32, A=HD, kv_len)` → ne=[HD, kv_len]
+    // with natural strides nb=[elem, HD*elem] (time-major-flat
+    // memory `data[c + t*HD]`).  `dense_matmul_time_(pre)ggml`
+    // produces ne=[L_kv, HD] with channel-major-flat memory
+    // (`data[t + c*L_kv]`) — the byte-for-byte transpose of what
+    // the bridge expects.  `ggml_cont(ggml_transpose(...))` flips
+    // the strides + materialises a contiguous fresh tensor with
+    // the right layout.  Mirrors the head-of-pipeline transpose
+    // inside `apply_rope_to_packed_qk` so Q-rope / K-rope / V all
+    // land in `q_tc_in` / `k_tc_in` / `v_tc_in` bit-exactly.  See
+    // the header doc on `apply_rope_to_packed_qk` in
+    // `supertonic_internal.h` for the full layout reasoning.
+    //
+    // Note (Vulkan branch): master's
+    // `dense_matmul_time_pretransposed_ggml` upgrade only pre-
+    // transposes WEIGHTS, not the activation layout, so the
+    // output ne=[T, OC] channel-major-flat stays identical to
+    // the legacy `dense_matmul_time_ggml`.  The same
+    // `ggml_cont(ggml_transpose(...))` head-of-V-pipeline fix
+    // therefore lands the right bytes for both variants.
+    //
+    // Legacy host bridge: `tensor_raw_f32(v_gpu)` downloads the
+    // post-transpose bytes (time-major-flat `out[t*HD + c]`) —
+    // bit-identical to what scalar `apply_rope`'s reference loop
+    // produces and what every legacy `push_trace`-consuming
+    // harness expects (callers updated in lock-step).
+    ggml_tensor * v_matmul = dense_matmul_time_pretransposed_ggml(cache.ctx, model, cache.text_in,
         require_source_tensor(model, v_matmul_source),
         require_source_tensor(model, attn_prefix + "W_value.linear.bias"));
+    ggml_tensor * v = ggml_cont(cache.ctx, ggml_transpose(cache.ctx, v_matmul));
     ggml_set_name(q, q_name.c_str()); ggml_set_output(q); ggml_build_forward_expand(cache.gf, q);
     ggml_set_name(k, k_name.c_str()); ggml_set_output(k); ggml_build_forward_expand(cache.gf, k);
     ggml_set_name(v, v_name.c_str()); ggml_set_output(v); ggml_build_forward_expand(cache.gf, v);
 
-    // Allocation is per-call via the model scheduler (supertonic_sched_alloc
-    // in run), which routes GGML_OP_CUSTOM ops to CPU. No per-cache gallocr.
+    // F23 — bake the RoPE rotation into the same graph that
+    // produces Q/K, so the host path drops the per-step CPU
+    // `apply_rope(theta, q_out, …)` round-trips entirely.  Q's
+    // sequence length is `L` (latent_len) and K's is `text_len`;
+    // each gets its own cos/sin table input (`ne=[half, L]` /
+    // `ne=[half, text_len]`) populated once at build time.  The
+    // post-rotation tensors are exposed under
+    // `<q_name>_rope` / `<k_name>_rope` so trace harnesses can
+    // download both the pre- and post-RoPE values for parity
+    // checks against the scalar path.  Falls back to no-op when
+    // the GGUF didn't ship a `vector_rope_theta` (cache.apply_rope
+    // stays false; call sites then keep the legacy host
+    // apply_rope call).
+    const int H = 4;
+    const int D = 64;
+    const int half = D / 2;
+    cache.apply_rope = (int) model.vector_rope_theta.size() == half;
+    if (cache.apply_rope) {
+        // RoPE cos/sin tables are constants for the cache's (L, text_len,
+        // θ) key — uploaded once at build time and never per-call.  Mark
+        // as both INPUT and OUTPUT so gallocr doesn't free the buffer
+        // after the first compute pass (without OUTPUT, the leaf-input
+        // buffer is released for intermediate reuse on the next compute,
+        // silently corrupting the cos/sin data on the second call).
+        cache.q_cos_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, half, L);
+        ggml_set_name(cache.q_cos_in,
+            ("vector_group_q_rope_cos_g" + std::to_string(group)).c_str());
+        ggml_set_input(cache.q_cos_in);  ggml_set_output(cache.q_cos_in);
+        cache.q_sin_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, half, L);
+        ggml_set_name(cache.q_sin_in,
+            ("vector_group_q_rope_sin_g" + std::to_string(group)).c_str());
+        ggml_set_input(cache.q_sin_in);  ggml_set_output(cache.q_sin_in);
+        cache.k_cos_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, half, text_len);
+        ggml_set_name(cache.k_cos_in,
+            ("vector_group_k_rope_cos_g" + std::to_string(group)).c_str());
+        ggml_set_input(cache.k_cos_in);  ggml_set_output(cache.k_cos_in);
+        cache.k_sin_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, half, text_len);
+        ggml_set_name(cache.k_sin_in,
+            ("vector_group_k_rope_sin_g" + std::to_string(group)).c_str());
+        ggml_set_input(cache.k_sin_in);  ggml_set_output(cache.k_sin_in);
+
+        ggml_tensor * q_rope = apply_rope_to_packed_qk(cache.ctx, q,
+            cache.q_cos_in, cache.q_sin_in, H, D);
+        ggml_tensor * k_rope = apply_rope_to_packed_qk(cache.ctx, k,
+            cache.k_cos_in, cache.k_sin_in, H, D);
+        cache.q_rope_name = q_name + "_rope";
+        cache.k_rope_name = k_name + "_rope";
+        ggml_set_name(q_rope, cache.q_rope_name.c_str());
+        ggml_set_output(q_rope);
+        ggml_build_forward_expand(cache.gf, q_rope);
+        ggml_set_name(k_rope, cache.k_rope_name.c_str());
+        ggml_set_output(k_rope);
+        ggml_build_forward_expand(cache.gf, k_rope);
+    }
+
+    cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
+    if (!cache.allocr) throw std::runtime_error("ggml_gallocr_new vector group cache failed");
+    if (!ggml_gallocr_reserve(cache.allocr, cache.gf)) {
+        throw std::runtime_error("ggml_gallocr_reserve vector group cache failed");
+    }
+    ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
+
+    // Upload the cos/sin tables — these inputs are stable for the
+    // entire cache lifetime (cos/sin depend only on L / text_len /
+    // θ, all encoded in the cache key + the model), so this is a
+    // one-shot population.
+    if (cache.apply_rope) {
+        std::vector<float> q_cos, q_sin, k_cos, k_sin;
+        make_rope_cos_sin_tables(model.vector_rope_theta.data(), L, half,
+                                 q_cos, q_sin);
+        make_rope_cos_sin_tables(model.vector_rope_theta.data(), text_len, half,
+                                 k_cos, k_sin);
+        ggml_backend_tensor_set(cache.q_cos_in, q_cos.data(), 0, q_cos.size() * sizeof(float));
+        ggml_backend_tensor_set(cache.q_sin_in, q_sin.data(), 0, q_sin.size() * sizeof(float));
+        ggml_backend_tensor_set(cache.k_cos_in, k_cos.data(), 0, k_cos.size() * sizeof(float));
+        ggml_backend_tensor_set(cache.k_sin_in, k_sin.data(), 0, k_sin.size() * sizeof(float));
+    }
 }
 
 vector_group_graph_result run_group_graph_cache(vector_group_graph_cache & cache,
@@ -930,13 +1705,36 @@ vector_group_graph_result run_group_graph_cache(vector_group_graph_cache & cache
                                                 const std::string & v_name,
                                                 const char * island,
                                                 std::vector<supertonic_trace_tensor> * trace) {
-    // Reuse the shape-keyed graph on the direct backend path; rebuild + route
-    // through the scheduler only when an op must run on CPU. Mirrors run_hift_decode.
-    build_group_graph_cache(cache, model, L, C, group, conv_block, linear_block, matmul_source, post_block,
-                            text_len, q_matmul_source, k_matmul_source, v_matmul_source,
-                            q_name, k_name, v_name,
-                            trace != nullptr);
-    std::vector<float> x_raw = pack_time_channel_for_ggml(x_tc, L, C);
+    // QVAC-18605 — cache-key check (skip rebuild when shape/sources/
+    // trace flag haven't changed).  Build is expensive on the hot
+    // denoise-step path; the steady-state synth pays one rebuild on
+    // the cold-miss step, zero on every subsequent step.
+    if (cache.model != &model || cache.generation_id != model.generation_id ||
+        cache.L != L || cache.C != C || cache.text_len != text_len ||
+        cache.group != group || cache.conv_block != conv_block ||
+        cache.linear_block != linear_block || cache.post_block != post_block ||
+        cache.trace_outputs != (trace != nullptr) ||
+        cache.matmul_source != matmul_source ||
+        cache.q_matmul_source != q_matmul_source || cache.k_matmul_source != k_matmul_source ||
+        cache.v_matmul_source != v_matmul_source) {
+        build_group_graph_cache(cache, model, L, C, group, conv_block, linear_block, matmul_source, post_block,
+                                text_len, q_matmul_source, k_matmul_source, v_matmul_source,
+                                q_name, k_name, v_name,
+                                trace != nullptr);
+    }
+    // QVAC-19254 — direct vs scheduler routing: when every node is
+    // supported by the primary backend, use the per-cache gallocr +
+    // direct compute; when an op must run on CPU (GGML_OP_CUSTOM),
+    // fall through to the model scheduler.
+    //
+    // HEAD's `build_group_graph_cache` already creates cache.allocr +
+    // calls `ggml_gallocr_alloc_graph` AND uploads the cache-lifetime
+    // RoPE cos/sin constants right after.  Re-calling alloc_graph
+    // here would clobber those uploaded constants (gallocr rebinds
+    // tensor offsets and the freshly-allocated buffer doesn't carry
+    // build-time data forward).  So on direct path: only allocate
+    // the gallocr lazily IF the build didn't (defensive — every
+    // current build path does), and never re-alloc.
     bool direct = true;
     const int n_nodes = ggml_graph_n_nodes(cache.gf);
     for (int i = 0; i < n_nodes; ++i) {
@@ -949,16 +1747,32 @@ vector_group_graph_result run_group_graph_cache(vector_group_graph_cache & cache
             if (!ggml_gallocr_reserve(cache.allocr, cache.gf)) {
                 throw std::runtime_error("ggml_gallocr_reserve supertonic group graph failed");
             }
+            ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
         }
-        ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
     } else {
         supertonic_sched_alloc(model, cache.gf);
     }
-    ggml_backend_tensor_set(cache.x_in, x_raw.data(), 0, x_raw.size()*sizeof(float));
+    // F12: cache.x_in is now ne=[C, L] (CPU-native time-major).
+    // Upload `x_tc` directly — the host pack loop is gone; the
+    // graph runs `ggml_cont(ggml_transpose(...))` to recover the
+    // [L, C] layout downstream ops expect.
+    ggml_backend_tensor_set(cache.x_in, x_tc.data(), 0, x_tc.size()*sizeof(float));
     ggml_backend_tensor_set(cache.temb_in, temb.data(), 0, temb.size()*sizeof(float));
-    ggml_backend_tensor_set(cache.text_in, text_lc_host, 0, (size_t) text_len * 256 * sizeof(float));
-    if (direct) supertonic_graph_compute(model, cache.gf);
-    else        profile_vector_compute(model, cache.gf, current_step, island);
+    // QVAC-18605 round 10 — text_lc_host upload-skip.  Same
+    // `text_emb` pointer that the front-block cache sees: stable
+    // within one synth (5 calls × same pointer), potentially
+    // reused-at-same-address across synths.  Synth-boundary reset
+    // on `current_step == 0` invalidates the cache so the next
+    // synth's first step always uploads.  Per-synth wins:
+    // 4 (skipped) × 3 (groups) × text_len × 256 × 4 bytes.  See
+    // upload_skip_tracker contract in supertonic_internal.h.
+    if (current_step == 0) cache.text_in_skip.reset();
+    if (cache.text_in_skip.needs_upload(text_lc_host)) {
+        ggml_backend_tensor_set(cache.text_in, text_lc_host, 0, (size_t) text_len * 256 * sizeof(float));
+        cache.text_in_skip.mark_uploaded(text_lc_host);
+    }
+    if (direct) profile_vector_compute(model, cache.gf, current_step, island);
+    else        profile_vector_compute(model, cache.gf, current_step, island, /*use_sched=*/true);
     if (trace) {
         for (int j = 0; j < 4; ++j) {
             const std::string name = "ve_group" + std::to_string(group) + "_convnext" + std::to_string(j);
@@ -971,9 +1785,76 @@ vector_group_graph_result run_group_graph_cache(vector_group_graph_cache & cache
         std::to_string(post_block) + "_convnext0";
     vector_group_graph_result out;
     out.post = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, post_name.c_str()));
-    out.q = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, q_name.c_str()));
-    out.k = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, k_name.c_str()));
-    out.v = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, v_name.c_str()));
+    // F23: on trace runs we still download the pre-RoPE Q/K so the
+    // scalar-parity harness can compare them against its own scalar
+    // `ve_g<n>_attn_q` reference.  Production runs don't push these
+    // through PUSH_GGML_TRACE so the download is the only cost.
+    // The post-RoPE Q/K (`q_rope` / `k_rope`) are what callers feed
+    // into `run_text_attention_cache`, eliminating the per-step
+    // host `apply_rope(theta, …)` round-trips entirely.
+    // 2C-lite — expose the GPU-side handles so the attention
+    // call site can `ggml_backend_tensor_copy` directly into its
+    // own cache.  Pointers are valid until the next rebuild of
+    // this cache (i.e., until L/C/text_len/group/... changes).
+    // The host downloads of q_rope/k_rope/v_gpu are now gated on
+    // `trace != nullptr` for the FAST path (apply_rope == true)
+    // because the production path no longer reads `out.q_rope` /
+    // `out.k_rope` / `out.v` — it consumes `*_gpu` instead via
+    // `run_text_attention_cache_gpu`.  The LEGACY path
+    // (apply_rope == false; e.g. malformed GGUF without
+    // vector_rope_theta) still needs q/k/v on the host because it
+    // calls scalar `apply_rope` and the host `run_text_attention_
+    // cache` overload.
+    if (cache.apply_rope) {
+        out.q_rope_gpu = ggml_graph_get_tensor(cache.gf, cache.q_rope_name.c_str());
+        out.k_rope_gpu = ggml_graph_get_tensor(cache.gf, cache.k_rope_name.c_str());
+    }
+    out.v_gpu = ggml_graph_get_tensor(cache.gf, v_name.c_str());
+
+    const bool need_host_qkv = (trace != nullptr) || !cache.apply_rope;
+    if (need_host_qkv) {
+        // Trace harnesses want pre-RoPE Q/K + V for the
+        // `push_trace` block below and the call-site
+        // `PUSH_GGML_TRACE({"ve_g*_attn_v", …})` push.  The legacy
+        // host-RoPE fallback consumes them directly.
+        //
+        // Q / K matmul outputs are UNCHANGED ne=[L, HD] / ne=[text_
+        // len, HD] channel-major-flat memory, so `tensor_to_time_
+        // channel` is the right call (decodes col=c, row=t at
+        // `c*L + t` into out[t*HD + c]).
+        out.q = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, q_name.c_str()));
+        out.k = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, k_name.c_str()));
+        // QVAC-18966 — V is now graph-packed to ne=[HD, text_len]
+        // time-major-flat by the head-of-V transpose in
+        // `build_group_graph_cache`.  `tensor_raw_f32` downloads
+        // the bytes in the layout scalar `apply_rope` /
+        // `flash_attention_qkv` host references expect
+        // (`v[t*HD + c]`).  `tensor_to_time_channel` would now
+        // mis-interpret the swapped ne (reading HD as L_var and
+        // L as C_var) and silently feed wrong-orientation V into
+        // the attention.  See the header doc on
+        // `apply_rope_to_packed_qk` in `supertonic_internal.h`.
+        out.v = tensor_raw_f32(ggml_graph_get_tensor(cache.gf, v_name.c_str()));
+    }
+    if (trace && cache.apply_rope) {
+        // Trace-only extra downloads — post-RoPE Q/K mirrors the
+        // call site's `PUSH_GGML_TRACE({"ve_g*_attn_q_rope", …})`.
+        //
+        // QVAC-18966 — post-fix layout contract:
+        // `apply_rope_to_packed_qk` now produces ne=[HD, L] with
+        // time-major-flat memory (`data[c + t*HD]`).  Those bytes
+        // ARE the scalar `apply_rope`'s native flat layout
+        // (`out[t*HD + c]`), so `tensor_raw_f32` downloads them
+        // directly — no transpose needed.  `tensor_to_time_channel`
+        // would mis-interpret the new ne shape (reading `HD` as
+        // L_var and `L` as C_var) and produce the transpose of
+        // the transpose.  See the header doc on
+        // `apply_rope_to_packed_qk` in `supertonic_internal.h`.
+        out.q_rope = tensor_raw_f32(
+            ggml_graph_get_tensor(cache.gf, cache.q_rope_name.c_str()));
+        out.k_rope = tensor_raw_f32(
+            ggml_graph_get_tensor(cache.gf, cache.k_rope_name.c_str()));
+    }
     if (trace) {
         push_trace(*trace, post_name, L, C, out.post);
         push_trace(*trace, q_name, L, 256, out.q);
@@ -988,6 +1869,32 @@ struct vector_res_style_qkv_result {
     std::vector<float> sq;
     std::vector<float> sk;
     std::vector<float> sv;
+
+    // QVAC-18605 round 9 — GPU-side handles for the post-projection
+    // style Q / K / V tensors so the next-stage style flash-attn
+    // call site (`run_text_attention_cache_gpu`) can blit them
+    // device→device instead of round-tripping through `sq` / `sk`
+    // / `sv` host vectors.  Same lifetime + dispatch pattern as
+    // `vector_group_graph_result::q_rope_gpu` / `v_gpu` (round-1
+    // 2C-lite for text attention; rounds 8 + 9 extend to front-
+    // block + style sites).
+    //
+    // Pointers are valid as long as the producing
+    // `vector_res_style_qkv_cache` is alive and hasn't been
+    // rebuilt (cache is `thread_local` at every call site;
+    // rebuild only on shape / matmul-source change).
+    //
+    // Always populated by `run_res_style_qkv_cache` (cheap —
+    // just `ggml_graph_get_tensor`); the host vectors above are
+    // gated on `trace != nullptr` (production path skips the
+    // download because it consumes `*_gpu` instead).  `post`
+    // stays unconditional — consumed by the next-stage
+    // `run_style_residual_cache` which still expects a host
+    // vector (cross-stage GPU bridge for `post` is deferred —
+    // see `aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md`).
+    ggml_tensor * sq_gpu = nullptr;
+    ggml_tensor * sk_gpu = nullptr;
+    ggml_tensor * sv_gpu = nullptr;
 };
 
 struct vector_res_style_qkv_cache {
@@ -1016,6 +1923,18 @@ struct vector_res_style_qkv_cache {
     ggml_tensor * rhs_in = nullptr;
     ggml_tensor * style_v_in = nullptr;
     ggml_tensor * kctx_in = nullptr;
+
+    // Audit F4 — skip the re-upload of `style_v_in` and `kctx_in`
+    // when the caller hands us the same host vectors as the
+    // previous call.  `cached_style_layouts` returns a stable
+    // pointer keyed on (model.generation_id, style_ttl), so the
+    // pointer comparison is a sound "same data" proxy.
+    // Steady-state per synth: 4 caches × 5 steps = 20 invocations,
+    // 1 cold-miss upload per cache, then ≥4 × (5−1) = 16 skipped.
+    // Across synths with the same voice: zero uploads after the
+    // first synth.  See AUDIT_SUPERTONIC_OPENCL.md F4.
+    const std::vector<float> * last_style_v_raw_uploaded = nullptr;
+    const std::vector<float> * last_kctx_raw_uploaded = nullptr;
 };
 
 void free_res_style_qkv_cache(vector_res_style_qkv_cache & cache) {
@@ -1080,16 +1999,38 @@ void build_res_style_qkv_cache(vector_res_style_qkv_cache & cache,
     cache.ctx = ggml_init(p);
     cache.gf = ggml_new_graph_custom(cache.ctx, NODES, false);
 
-    cache.lhs_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, C);
-    ggml_set_name(cache.lhs_in, "res_style_lhs"); ggml_set_input(cache.lhs_in);
-    cache.rhs_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, C);
-    ggml_set_name(cache.rhs_in, "res_style_rhs"); ggml_set_input(cache.rhs_in);
+    // F12: lhs / rhs ingested in CPU-native `[C, L]` channel-fast
+    // layout — `run_res_style_qkv_cache` uploads `lhs_tc` / `rhs_tc`
+    // directly, no host pack.  `style_v_in` / `kctx_in` are already
+    // shaped `[50, 256]` (i.e. `[ttl_len=L_ttl, C_style=256]`) and
+    // come from `cached_style_layouts(...)`, which produces stable
+    // c-major buffers shared across all 4 style residual sites —
+    // those keep their existing layout to preserve the F4 pointer-
+    // compare upload-skip optimization.
+    cache.lhs_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, C, L);
+    ggml_set_name(cache.lhs_in, "res_style_lhs_tc"); ggml_set_input(cache.lhs_in);
+    cache.rhs_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, C, L);
+    ggml_set_name(cache.rhs_in, "res_style_rhs_tc"); ggml_set_input(cache.rhs_in);
+    // style_v_in / kctx_in use the F4 pointer-compare upload-skip — the
+    // host pointer is stable across calls within one synth, so they're
+    // uploaded only on cold miss / pointer change.  That assumption
+    // requires the backend buffer to ALSO be stable.  gallocr frees
+    // leaf inputs once their last consumer runs, releasing the buffer
+    // for intermediate reuse on the next compute pass.  Mark INPUT +
+    // OUTPUT so the buffer is kept alive and the skip-upload optimisation
+    // actually preserves the uploaded data.
     cache.style_v_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, 50, 256);
-    ggml_set_name(cache.style_v_in, "res_style_ttl_lc"); ggml_set_input(cache.style_v_in);
+    ggml_set_name(cache.style_v_in, "res_style_ttl_lc");
+    ggml_set_input(cache.style_v_in);  ggml_set_output(cache.style_v_in);
     cache.kctx_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, 50, 256);
-    ggml_set_name(cache.kctx_in, "res_style_kctx_lc"); ggml_set_input(cache.kctx_in);
-
-    ggml_tensor * res = ggml_add(cache.ctx, cache.lhs_in, cache.rhs_in);
+    ggml_set_name(cache.kctx_in, "res_style_kctx_lc");
+    ggml_set_input(cache.kctx_in);  ggml_set_output(cache.kctx_in);
+
+    ggml_tensor * lhs_lc = transpose_time_channel_ggml(cache.ctx, cache.lhs_in);
+    ggml_tensor * rhs_lc = transpose_time_channel_ggml(cache.ctx, cache.rhs_in);
+    ggml_set_name(lhs_lc, "res_style_lhs");
+    ggml_set_name(rhs_lc, "res_style_rhs");
+    ggml_tensor * res = ggml_add(cache.ctx, lhs_lc, rhs_lc);
     ggml_set_name(res, residual_name.c_str());
     if (trace_outputs) {
         ggml_set_output(res);
@@ -1110,16 +2051,42 @@ void build_res_style_qkv_cache(vector_res_style_qkv_cache & cache,
     ggml_build_forward_expand(cache.gf, post);
 
     const std::string style_prefix = vector_main_block(style_block) + ".attention.";
-    ggml_tensor * sq = dense_matmul_time_ggml(cache.ctx, post,
+    // Round 11 sq/sk/sv layout fix layered on top of master's
+    // `dense_matmul_time_pretransposed_ggml` upgrade.  Same
+    // reasoning as the front-block V site above: pretransposed
+    // variant still produces ne=[T, OC] channel-major-flat
+    // memory; the round-11 `ggml_cont(ggml_transpose(...))`
+    // below this block remains required to land bytes in the
+    // ne=[HD, L] time-major-flat layout `q_tc_in`/`k_tc_in`/
+    // `v_tc_in` expect for the GPU-bridge blit.
+    ggml_tensor * sq_matmul = dense_matmul_time_pretransposed_ggml(cache.ctx, model, post,
         require_source_tensor(model, q_matmul_source),
         require_source_tensor(model, style_prefix + "W_query.linear.bias"));
-    ggml_tensor * sk = dense_matmul_time_ggml(cache.ctx, cache.kctx_in,
+    ggml_tensor * sk_matmul = dense_matmul_time_pretransposed_ggml(cache.ctx, model, cache.kctx_in,
         require_source_tensor(model, k_matmul_source),
         require_source_tensor(model, style_prefix + "W_key.linear.bias"));
-    sk = ggml_tanh(cache.ctx, sk);
-    ggml_tensor * sv = dense_matmul_time_ggml(cache.ctx, cache.style_v_in,
+    sk_matmul = ggml_tanh(cache.ctx, sk_matmul);
+    ggml_tensor * sv_matmul = dense_matmul_time_pretransposed_ggml(cache.ctx, model, cache.style_v_in,
         require_source_tensor(model, v_matmul_source),
         require_source_tensor(model, style_prefix + "W_value.linear.bias"));
+    // QVAC-18605 follow-up — pack style Q/K/V into the time-major-
+    // flat layout that `run_text_attention_cache_gpu` consumes via
+    // `ggml_backend_tensor_copy`.  The style attention path has
+    // no RoPE (cos/sin tables are absent for the style sites), so
+    // the head-of-pipeline transpose inside
+    // `apply_rope_to_packed_qk` doesn't run here — we open-code
+    // it for each of the three matmul outputs.  Matmul output is
+    // ne=[L_in, HD] channel-major-flat (`data[t + c*L_in]`);
+    // `q_tc_in` / `k_tc_in` / `v_tc_in` in
+    // `vector_text_attention_cache` are ne=[HD, L_in] time-major-
+    // flat (`data[c + t*HD]`).  `ggml_cont(ggml_transpose(...))`
+    // flips strides + materialises a contiguous fresh tensor
+    // with the right layout.  See the header doc on
+    // `apply_rope_to_packed_qk` in `supertonic_internal.h` for
+    // the full reasoning.
+    ggml_tensor * sq = ggml_cont(cache.ctx, ggml_transpose(cache.ctx, sq_matmul));
+    ggml_tensor * sk = ggml_cont(cache.ctx, ggml_transpose(cache.ctx, sk_matmul));
+    ggml_tensor * sv = ggml_cont(cache.ctx, ggml_transpose(cache.ctx, sv_matmul));
     ggml_set_name(sq, q_name.c_str()); ggml_set_output(sq); ggml_build_forward_expand(cache.gf, sq);
     ggml_set_name(sk, k_name.c_str()); ggml_set_output(sk); ggml_build_forward_expand(cache.gf, sk);
     ggml_set_name(sv, v_name.c_str()); ggml_set_output(sv); ggml_build_forward_expand(cache.gf, sv);
@@ -1152,14 +2119,19 @@ vector_res_style_qkv_result run_res_style_qkv_cache(vector_res_style_qkv_cache &
                                                     const char * island,
                                                     std::vector<supertonic_trace_tensor> * trace) {
     const bool want_trace = trace != nullptr;
-    // Reuse the shape-keyed graph on the direct backend path; rebuild + route
-    // through the scheduler only when an op must run on CPU. Mirrors run_hift_decode.
-    build_res_style_qkv_cache(cache, model, L, C, norm_block, post_block, style_block,
-                              q_matmul_source, k_matmul_source, v_matmul_source,
-                              residual_name, norm_name, post_name, q_name, k_name, v_name,
-                              want_trace);
-    std::vector<float> lhs_raw = pack_time_channel_for_ggml(lhs_tc, L, C);
-    std::vector<float> rhs_raw = pack_time_channel_for_ggml(rhs_tc, L, C);
+    // QVAC-18605 — cache-key check (skip rebuild on hot path).
+    if (cache.model != &model || cache.generation_id != model.generation_id ||
+        cache.L != L || cache.C != C ||
+        cache.norm_block != norm_block || cache.post_block != post_block ||
+        cache.style_block != style_block || cache.trace_outputs != want_trace ||
+        cache.q_matmul_source != q_matmul_source || cache.k_matmul_source != k_matmul_source ||
+        cache.v_matmul_source != v_matmul_source) {
+        build_res_style_qkv_cache(cache, model, L, C, norm_block, post_block, style_block,
+                                  q_matmul_source, k_matmul_source, v_matmul_source,
+                                  residual_name, norm_name, post_name, q_name, k_name, v_name,
+                                  want_trace);
+    }
+    // QVAC-19254 — direct vs scheduler routing.
     bool direct = true;
     const int n_nodes = ggml_graph_n_nodes(cache.gf);
     for (int i = 0; i < n_nodes; ++i) {
@@ -1177,22 +2149,62 @@ vector_res_style_qkv_result run_res_style_qkv_cache(vector_res_style_qkv_cache &
     } else {
         supertonic_sched_alloc(model, cache.gf);
     }
-    ggml_backend_tensor_set(cache.lhs_in, lhs_raw.data(), 0, lhs_raw.size() * sizeof(float));
-    ggml_backend_tensor_set(cache.rhs_in, rhs_raw.data(), 0, rhs_raw.size() * sizeof(float));
-    ggml_backend_tensor_set(cache.style_v_in, style_v_raw.data(), 0, style_v_raw.size() * sizeof(float));
-    ggml_backend_tensor_set(cache.kctx_in, kctx_raw.data(), 0, kctx_raw.size() * sizeof(float));
-    if (direct) supertonic_graph_compute(model, cache.gf);
-    else        profile_vector_compute(model, cache.gf, current_step, island);
+    // F12: direct upload of CPU-native `[L, C]` (time-major)
+    // buffers — `cache.lhs_in` / `cache.rhs_in` are now `ne=[C, L]`
+    // and the graph transposes them inside; no host pack.
+    ggml_backend_tensor_set(cache.lhs_in, lhs_tc.data(), 0, lhs_tc.size() * sizeof(float));
+    ggml_backend_tensor_set(cache.rhs_in, rhs_tc.data(), 0, rhs_tc.size() * sizeof(float));
+    // F4: pointer-compare against the last successfully uploaded
+    // host vector.  Cache rebuilds (above) reset last_*_uploaded
+    // to nullptr via `cache = {}`, so the cold-miss path always
+    // fires the upload regardless of pointer match.
+    if (cache.last_style_v_raw_uploaded != &style_v_raw) {
+        ggml_backend_tensor_set(cache.style_v_in, style_v_raw.data(), 0, style_v_raw.size() * sizeof(float));
+        cache.last_style_v_raw_uploaded = &style_v_raw;
+    }
+    if (cache.last_kctx_raw_uploaded != &kctx_raw) {
+        ggml_backend_tensor_set(cache.kctx_in, kctx_raw.data(), 0, kctx_raw.size() * sizeof(float));
+        cache.last_kctx_raw_uploaded = &kctx_raw;
+    }
+    if (direct) profile_vector_compute(model, cache.gf, current_step, island);
+    else        profile_vector_compute(model, cache.gf, current_step, island, /*use_sched=*/true);
     if (trace) {
         push_trace(*trace, residual_name, L, C, tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, residual_name.c_str())));
         push_trace(*trace, norm_name, L, C, tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, norm_name.c_str())));
     }
     vector_res_style_qkv_result out;
+
+    // QVAC-18605 round 9 — populate GPU handles for the post-
+    // projection Q / K / V tensors unconditionally.  Cheap (no
+    // GPU sync; just a name-to-pointer lookup in the cached
+    // graph).  Lifetime contract documented on the struct.
+    out.sq_gpu = ggml_graph_get_tensor(cache.gf, q_name.c_str());
+    out.sk_gpu = ggml_graph_get_tensor(cache.gf, k_name.c_str());
+    out.sv_gpu = ggml_graph_get_tensor(cache.gf, v_name.c_str());
+
+    // `post` stays a host download — the next-stage
+    // `run_style_residual_cache` still consumes a host vector.
     out.post = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, post_name.c_str()));
-    out.sq = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, q_name.c_str()));
-    out.sk = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, k_name.c_str()));
-    out.sv = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, v_name.c_str()));
+
+    // QVAC-18605 round 9 — gate `sq` / `sk` / `sv` host downloads
+    // on trace mode.  Production path skips them because the
+    // call site uses `out.sq_gpu` / `out.sk_gpu` / `out.sv_gpu`
+    // via `run_text_attention_cache_gpu`.  Eliminates 3 sync
+    // points per call × 4 sites × 5 denoise steps = 60 GPU→host
+    // downloads / synth.  Mirrors the round-1 2C-lite
+    // `need_host_qkv = (trace != nullptr)` gate on the group
+    // graph cache.
     if (trace) {
+        // QVAC-18605 follow-up — sq / sk / sv are now graph-packed
+        // to ne=[HD, L] time-major-flat (see the matmul-output
+        // transpose in `build_res_style_qkv_cache`).
+        // `tensor_raw_f32` downloads the bytes in the layout
+        // scalar reference and trace harnesses expect
+        // (`out[t*256 + c]`).  See the header doc on
+        // `apply_rope_to_packed_qk` in `supertonic_internal.h`.
+        out.sq = tensor_raw_f32(out.sq_gpu);
+        out.sk = tensor_raw_f32(out.sk_gpu);
+        out.sv = tensor_raw_f32(out.sv_gpu);
         push_trace(*trace, post_name, L, C, out.post);
         push_trace(*trace, q_name, L, 256, out.sq);
         push_trace(*trace, k_name, 50, 256, out.sk);
@@ -1201,6 +2213,113 @@ vector_res_style_qkv_result run_res_style_qkv_cache(vector_res_style_qkv_cache &
     return out;
 }
 
+// Audit finding F8 — cached "(add residual) + layer_norm" graph.
+//
+// The vector estimator's GGML production path runs four of these
+// tiny graphs per step: one after each group's style-attention
+// output to fold the style residual back into the main activation
+// before the next group's convnext block runs.  Pre-audit, each
+// call allocated a fresh `ggml_context`, `ggml_cgraph`, and
+// `ggml_gallocr_t`, then freed them at the end.  Per synth that's
+// 4 sites × 5 steps = 20 allocator churns; key is constant within
+// a synth, so caching gets that down to 4 cold-miss rebuilds per
+// model+L combination.
+struct vector_style_residual_graph_cache {
+    const supertonic_model * model = nullptr;
+    uint64_t generation_id = 0;
+    int L = 0;
+    int C = 0;
+    int norm_block = 0;
+    bool trace_outputs = false;
+    std::vector<uint8_t> buf;
+    ggml_context * ctx = nullptr;
+    ggml_cgraph * gf = nullptr;
+    ggml_gallocr_t allocr = nullptr;
+    ggml_tensor * lhs_in = nullptr;
+    ggml_tensor * out_in = nullptr;
+};
+
+inline void free_style_residual_cache(vector_style_residual_graph_cache & cache) {
+    supertonic_safe_gallocr_free(cache.allocr, cache.generation_id);
+    if (cache.ctx) ggml_free(cache.ctx);
+    cache = {};
+}
+
+inline void build_style_residual_cache(vector_style_residual_graph_cache & cache,
+                                       const supertonic_model & model,
+                                       int L, int C, int norm_block, bool trace_outputs) {
+    free_style_residual_cache(cache);
+    cache.model = &model;
+    cache.generation_id = model.generation_id;
+    cache.L = L;
+    cache.C = C;
+    cache.norm_block = norm_block;
+    cache.trace_outputs = trace_outputs;
+
+    constexpr int NODES = 128;
+    const size_t buf_size = ggml_tensor_overhead() * NODES +
+                            ggml_graph_overhead_custom(NODES, false);
+    cache.buf.assign(buf_size, 0);
+    ggml_init_params p = { buf_size, cache.buf.data(), true };
+    cache.ctx = ggml_init(p);
+    cache.gf = ggml_new_graph_custom(cache.ctx, NODES, false);
+
+    // F12: ingest both residual operands in CPU-native `[C, L]`
+    // layout — `run_style_residual_cache` uploads `lhs_tc` /
+    // `out_tc` directly; the graph transposes both inside.
+    cache.lhs_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, C, L);
+    ggml_set_name(cache.lhs_in, "sr_lhs_in_tc"); ggml_set_input(cache.lhs_in);
+    cache.out_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, C, L);
+    ggml_set_name(cache.out_in, "sr_out_in_tc"); ggml_set_input(cache.out_in);
+
+    ggml_tensor * lhs_lc = transpose_time_channel_ggml(cache.ctx, cache.lhs_in);
+    ggml_tensor * out_lc = transpose_time_channel_ggml(cache.ctx, cache.out_in);
+    ggml_set_name(lhs_lc, "sr_lhs");
+    ggml_set_name(out_lc, "sr_out");
+    ggml_tensor * res = ggml_add(cache.ctx, lhs_lc, out_lc);
+    ggml_set_name(res, "sr_residual");
+    if (trace_outputs) {
+        ggml_set_output(res);
+        ggml_build_forward_expand(cache.gf, res);
+    }
+    ggml_tensor * norm = layer_norm_ggml(cache.ctx, res,
+        require_source_tensor(model, vector_main_block(norm_block) + ".norm.norm.weight"),
+        require_source_tensor(model, vector_main_block(norm_block) + ".norm.norm.bias"));
+    ggml_set_name(norm, "sr_norm"); ggml_set_output(norm);
+    ggml_build_forward_expand(cache.gf, norm);
+
+    cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
+    if (!cache.allocr) throw std::runtime_error("ggml_gallocr_new style residual cache failed");
+    if (!ggml_gallocr_reserve(cache.allocr, cache.gf)) {
+        throw std::runtime_error("ggml_gallocr_reserve style residual cache failed");
+    }
+    ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
+}
+
+inline std::vector<float> run_style_residual_cache(
+    vector_style_residual_graph_cache & cache,
+    const supertonic_model & model,
+    const std::vector<float> & lhs_tc,
+    const std::vector<float> & out_tc,
+    int L, int C, int norm_block,
+    int current_step, const char * island,
+    std::vector<float> * residual_trace_out) {
+    const bool want_trace = residual_trace_out != nullptr;
+    if (cache.model != &model || cache.generation_id != model.generation_id ||
+        cache.L != L || cache.C != C ||
+        cache.norm_block != norm_block || cache.trace_outputs != want_trace) {
+        build_style_residual_cache(cache, model, L, C, norm_block, want_trace);
+    }
+    // F12: direct upload — host pack loops eliminated.
+    ggml_backend_tensor_set(cache.lhs_in, lhs_tc.data(), 0, lhs_tc.size()*sizeof(float));
+    ggml_backend_tensor_set(cache.out_in, out_tc.data(), 0, out_tc.size()*sizeof(float));
+    profile_vector_compute(model, cache.gf, current_step, island);
+    if (residual_trace_out) {
+        *residual_trace_out = tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, "sr_residual"));
+    }
+    return tensor_to_time_channel(ggml_graph_get_tensor(cache.gf, "sr_norm"));
+}
+
 struct vector_tail_graph_cache {
     const supertonic_model * model = nullptr;
     uint64_t generation_id = 0;
@@ -1300,13 +2419,22 @@ void build_tail_graph_cache(vector_tail_graph_cache & cache,
     cache.ctx = ggml_init(p);
     cache.gf = ggml_new_graph_custom(cache.ctx, NODES, false);
 
-    cache.tail_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, C);
-    ggml_set_name(cache.tail_in, "tail_in"); ggml_set_input(cache.tail_in);
+    // F12: ingest `tail_in` in CPU-native `[C, L]` channel-fast
+    // layout — `run_tail_graph_cache` uploads `x_tc` directly; the
+    // graph transposes it inside.  `tail_noise` stays at `[L, Cin]`
+    // because the (non-CPU non-trace) tail update path adds it
+    // directly to `velocity_t` (shape [L, Cin]); see the
+    // accompanying redundancy fix in `run_tail_graph_cache` which
+    // also skips two redundant CPU transposes on `noisy_latent`
+    // that cancel each other out.
+    cache.tail_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, C, L);
+    ggml_set_name(cache.tail_in, "tail_in_tc"); ggml_set_input(cache.tail_in);
     cache.tail_mask = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, L);
     ggml_set_name(cache.tail_mask, "tail_mask"); ggml_set_input(cache.tail_mask);
     cache.tail_noise = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, Cin);
     ggml_set_name(cache.tail_noise, "tail_noise"); ggml_set_input(cache.tail_noise);
-    ggml_tensor * tail = cache.tail_in;
+    ggml_tensor * tail = transpose_time_channel_ggml(cache.ctx, cache.tail_in);
+    ggml_set_name(tail, "tail_in");
     for (int j = 0; j < 4; ++j) {
         tail = vector_convnext_ggml(cache.ctx, model,
             "vector_estimator:tts.ttl.vector_field.last_convnext.convnext." + std::to_string(j),
@@ -1319,7 +2447,10 @@ void build_tail_graph_cache(vector_tail_graph_cache & cache,
     }
     ggml_tensor * velocity_t = nullptr;
 #if defined(TTS_CPP_USE_ACCELERATE) || defined(TTS_CPP_USE_CBLAS)
-    if (!trace_outputs) {
+    // CPU-only fused tail-update op (BLAS matmul + mask + step scale +
+    // residual add).  The `else` branch below is the pure-GGML
+    // decomposition used on GPU backends and during trace runs.
+    if (!trace_outputs && supertonic_use_cpu_custom_ops()) {
         ggml_tensor * args[] = {
             tail,
             cache.tail_mask,
@@ -1360,17 +2491,14 @@ std::vector<float> run_tail_graph_cache(vector_tail_graph_cache & cache,
                                         int current_step,
                                         int total_steps,
                                         std::vector<supertonic_trace_tensor> * trace) {
-    // Reuse the shape-keyed graph on the direct backend path; rebuild + route
-    // through the scheduler only when an op must run on CPU. Mirrors run_hift_decode.
-    build_tail_graph_cache(cache, model, L, C, Cin, total_steps, trace != nullptr);
-    std::vector<float> tail_in_raw = pack_time_channel_for_ggml(x_tc, L, C);
-    std::vector<float> noise_tc((size_t)L*Cin);
-    for (int t = 0; t < L; ++t) {
-        for (int c = 0; c < Cin; ++c) {
-            noise_tc[(size_t)t*Cin+c] = noisy_latent[(size_t)c*L+t];
-        }
+    // QVAC-18605 — cache-key check (skip rebuild on hot path).
+    if (cache.model != &model || cache.generation_id != model.generation_id ||
+        cache.L != L || cache.C != C ||
+        cache.Cin != Cin || cache.total_steps != total_steps ||
+        cache.trace_outputs != (trace != nullptr)) {
+        build_tail_graph_cache(cache, model, L, C, Cin, total_steps, trace != nullptr);
     }
-    std::vector<float> noise_raw = pack_time_channel_for_ggml(noise_tc, L, Cin);
+    // QVAC-19254 — direct vs scheduler routing.
     bool direct = true;
     const int n_nodes = ggml_graph_n_nodes(cache.gf);
     for (int i = 0; i < n_nodes; ++i) {
@@ -1388,11 +2516,22 @@ std::vector<float> run_tail_graph_cache(vector_tail_graph_cache & cache,
     } else {
         supertonic_sched_alloc(model, cache.gf);
     }
-    ggml_backend_tensor_set(cache.tail_in, tail_in_raw.data(), 0, tail_in_raw.size()*sizeof(float));
+    // F12: direct upload of `x_tc` to `cache.tail_in` (now
+    // `ne=[C, L]`).  Also eliminates an inadvertent CPU
+    // double-transpose on `noisy_latent`: the old code unpacked
+    // `noisy_latent[c*L+t]` → `noise_tc[t*Cin+c]` (CPU loop #1)
+    // then packed `noise_tc[t*Cin+c]` → `noise_raw[c*L+t]` (CPU
+    // loop #2), producing `noise_raw` byte-equivalent to
+    // `noisy_latent`.  `noisy_latent` is already in the
+    // channel-major memory layout `ne=[L, Cin]` (with natural
+    // strides) wants — its element (c, t) at byte `c*L + t`
+    // matches GGML's element (l=t, c=c) at memory byte `t + c*L`.
+    // Uploading directly skips both loops.
+    ggml_backend_tensor_set(cache.tail_in, x_tc.data(), 0, x_tc.size()*sizeof(float));
     ggml_backend_tensor_set(cache.tail_mask, latent_mask, 0, (size_t)L*sizeof(float));
-    ggml_backend_tensor_set(cache.tail_noise, noise_raw.data(), 0, noise_raw.size()*sizeof(float));
-    if (direct) supertonic_graph_compute(model, cache.gf);
-    else        profile_vector_compute(model, cache.gf, current_step, "tail");
+    ggml_backend_tensor_set(cache.tail_noise, noisy_latent, 0, (size_t)L*Cin*sizeof(float));
+    if (direct) profile_vector_compute(model, cache.gf, current_step, "tail");
+    else        profile_vector_compute(model, cache.gf, current_step, "tail", /*use_sched=*/true);
     if (trace) {
         for (int j = 0; j < 4; ++j) {
             const std::string name = "ve_last_convnext" + std::to_string(j);
@@ -1472,6 +2611,39 @@ std::vector<float> time_embedding(const supertonic_model & m, int current, int t
     return o;
 }
 
+// Audit F9 — cache `time_embedding(model, current, total)` outputs
+// keyed by `(current, total)`.  Pure function over its key, so a
+// stored entry is the byte-exact result the slow path would produce.
+// Cache lives in `model.time_emb_cache` (mutable map); steady-state
+// hit rate after the first synth is (total_steps − 1) / total_steps
+// (only the cold-miss step on each new key triggers the underlying
+// `time_embedding`).  Returns a copy by value (only 64 floats) so
+// callers don't have to worry about cache mutation invalidating
+// their reference across nested lookups.
+inline uint64_t time_emb_cache_key(int current, int total) {
+    return ((uint64_t)(uint32_t) current << 32) | (uint32_t) total;
+}
+
+} // namespace
+
+std::array<float, 64> cached_time_embedding(const supertonic_model & model,
+                                            int current_step,
+                                            int total_steps) {
+    const uint64_t key = time_emb_cache_key(current_step, total_steps);
+    auto it = model.time_emb_cache.find(key);
+    if (it != model.time_emb_cache.end()) {
+        return it->second;
+    }
+    std::vector<float> raw = time_embedding(model, current_step, total_steps);
+    std::array<float, 64> arr{};
+    const size_t n = std::min((size_t) 64, raw.size());
+    for (size_t i = 0; i < n; ++i) arr[i] = raw[i];
+    auto ins = model.time_emb_cache.emplace(key, arr);
+    return ins.first->second;
+}
+
+namespace {
+
 void apply_rope(const float * theta, std::vector<float> & x, int L, int H, int D) {
     int half = D/2;
     for(int h=0;h<H;++h) for(int t=0;t<L;++t) for(int d=0;d<half;++d) {
@@ -1494,8 +2666,10 @@ void rope_attn(const supertonic_model & m, int group, std::vector<float> & x, in
     for(int t=0;t<LT;++t) for(int c=0;c<256;++c) text_lc[(size_t)t*256+c]=text_emb[(size_t)c*LT+t];
     dense_matmul_time(text_lc,LT,256,read_f32(m,"vector_estimator:onnx::MatMul_"+std::to_string(kids[group])),read_f32(m,base+"W_key.linear.bias"),A,k);
     dense_matmul_time(text_lc,LT,256,read_f32(m,"vector_estimator:onnx::MatMul_"+std::to_string(vids[group])),read_f32(m,base+"W_value.linear.bias"),A,v);
-    auto theta_t = read_f32(m,"vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-    apply_rope(theta_t.data.data(),q,L,H,D); apply_rope(theta_t.data.data(),k,LT,H,D);
+    // F1: shared host-side cache; same data as
+    // `read_f32(m, "...3.attn.theta")` but no per-call backend read.
+    const float * theta_t = m.vector_rope_theta.data();
+    apply_rope(theta_t,q,L,H,D); apply_rope(theta_t,k,LT,H,D);
     std::vector<float> attn_out((size_t)L*A,0), scores(LT), probs(LT);
     float scale=1.0f/16.0f;
     for(int h=0;h<H;++h) for(int qi=0;qi<L;++qi){
@@ -1543,7 +2717,9 @@ bool supertonic_vector_step_cpu(const supertonic_model & model, const float * no
         std::vector<float> x;
         conv1x1(in,L,Cin,read_f32(model,"vector_estimator:tts.ttl.vector_field.proj_in.net.weight"),nullptr,C,x);
         for(int t=0;t<L;++t) for(int c=0;c<C;++c) x[(size_t)t*C+c]*=latent_mask[t];
-        std::vector<float> te=time_embedding(model,current_step,total_steps);
+        // F9: cached time-embedding (5 distinct keys per default schedule).
+        auto te_arr = cached_time_embedding(model, current_step, total_steps);
+        std::vector<float> te(te_arr.begin(), te_arr.end());
         static const int time_ids[4]={3095,3140,3185,3230};
         for(int group=0;group<4;++group){
             int ob=group*6;
@@ -1589,6 +2765,7 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
                                        bool include_scalar_trace,
                                        bool include_ggml_trace,
                                        std::vector<float> * next_latent_tc_out) {
+    supertonic_op_dispatch_scope dispatch(model);
     try {
         scalar_trace.clear();
         ggml_trace.clear();
@@ -1625,7 +2802,9 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
                 push_trace(scalar_trace, "ve_block0_convnext" + std::to_string(j), L, C, block);
             }
 
-            std::vector<float> te = time_embedding(model, current_step, total_steps);
+            // F9: cached time-embedding.
+            auto te_arr = cached_time_embedding(model, current_step, total_steps);
+            std::vector<float> te(te_arr.begin(), te_arr.end());
             std::vector<float> tb;
             dense_matmul_vec(te, read_f32(model, "vector_estimator:onnx::MatMul_3095"),
                              read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.1.linear.linear.bias"),
@@ -1655,9 +2834,10 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             push_trace(scalar_trace, "ve_attn0_q", L, A, q);
             push_trace(scalar_trace, "ve_attn0_k", text_len, A, k);
             push_trace(scalar_trace, "ve_attn0_v", text_len, A, v);
-            auto theta_t = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-            apply_rope(theta_t.data.data(), q, L, 4, 64);
-            apply_rope(theta_t.data.data(), k, text_len, 4, 64);
+            // F1: theta lives in model.vector_rope_theta (populated at load).
+            const float * theta_t = model.vector_rope_theta.data();
+            apply_rope(theta_t, q, L, 4, 64);
+            apply_rope(theta_t, k, text_len, 4, 64);
             push_trace(scalar_trace, "ve_attn0_q_rope", L, A, q);
             push_trace(scalar_trace, "ve_attn0_k_rope", text_len, A, k);
 
@@ -1786,9 +2966,10 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             push_trace(scalar_trace, "ve_g1_attn_q", L, A1, q1);
             push_trace(scalar_trace, "ve_g1_attn_k", text_len, A1, k1);
             push_trace(scalar_trace, "ve_g1_attn_v", text_len, A1, v1);
-            auto theta1 = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-            apply_rope(theta1.data.data(), q1, L, 4, 64);
-            apply_rope(theta1.data.data(), k1, text_len, 4, 64);
+            // F1: theta lives in model.vector_rope_theta (populated at load).
+            const float * theta1 = model.vector_rope_theta.data();
+            apply_rope(theta1, q1, L, 4, 64);
+            apply_rope(theta1, k1, text_len, 4, 64);
             push_trace(scalar_trace, "ve_g1_attn_q_rope", L, A1, q1);
             push_trace(scalar_trace, "ve_g1_attn_k_rope", text_len, A1, k1);
             std::vector<float> ctx1((size_t)L*A1, 0.0f), scores1(text_len), probs1(text_len);
@@ -1897,9 +3078,10 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             push_trace(scalar_trace, "ve_g2_attn_q", L, A2, q2);
             push_trace(scalar_trace, "ve_g2_attn_k", text_len, A2, k2);
             push_trace(scalar_trace, "ve_g2_attn_v", text_len, A2, v2);
-            auto theta2 = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-            apply_rope(theta2.data.data(), q2, L, 4, 64);
-            apply_rope(theta2.data.data(), k2, text_len, 4, 64);
+            // F1: theta lives in model.vector_rope_theta (populated at load).
+            const float * theta2 = model.vector_rope_theta.data();
+            apply_rope(theta2, q2, L, 4, 64);
+            apply_rope(theta2, k2, text_len, 4, 64);
             push_trace(scalar_trace, "ve_g2_attn_q_rope", L, A2, q2);
             push_trace(scalar_trace, "ve_g2_attn_k_rope", text_len, A2, k2);
             std::vector<float> ctx2((size_t)L*A2, 0.0f), scores2(text_len), probs2(text_len);
@@ -2008,9 +3190,10 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             push_trace(scalar_trace, "ve_g3_attn_q", L, A3, q3);
             push_trace(scalar_trace, "ve_g3_attn_k", text_len, A3, k3);
             push_trace(scalar_trace, "ve_g3_attn_v", text_len, A3, v3);
-            auto theta3 = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-            apply_rope(theta3.data.data(), q3, L, 4, 64);
-            apply_rope(theta3.data.data(), k3, text_len, 4, 64);
+            // F1: theta lives in model.vector_rope_theta (populated at load).
+            const float * theta3 = model.vector_rope_theta.data();
+            apply_rope(theta3, q3, L, 4, 64);
+            apply_rope(theta3, k3, text_len, 4, 64);
             push_trace(scalar_trace, "ve_g3_attn_q_rope", L, A3, q3);
             push_trace(scalar_trace, "ve_g3_attn_k_rope", text_len, A3, k3);
             std::vector<float> ctx3((size_t)L*A3, 0.0f), scores3(text_len), probs3(text_len);
@@ -2110,98 +3293,359 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
         push_trace(scalar_trace, "ve_next_latent_tc", L, Cin, next_latent);
         }
 
-        constexpr int MAX_NODES = 2048;
-        static size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
-                                 ggml_graph_overhead_custom(MAX_NODES, false);
-        thread_local std::vector<uint8_t> buf(buf_size);
-        ggml_init_params p = { buf_size, buf.data(), true };
-        ggml_context * ctx = ggml_init(p);
-        ggml_cgraph * gf = ggml_new_graph_custom(ctx, MAX_NODES, false);
-
-        ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, Cin);
-        ggml_set_name(x, "ve_latent_tc");
-        ggml_set_input(x);
-        ggml_tensor * mask = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, L);
-        ggml_set_name(mask, "ve_latent_mask");
-        ggml_set_input(mask);
-        ggml_tensor * t_emb = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 64);
-        ggml_set_name(t_emb, "ve_time_emb");
-        ggml_set_input(t_emb);
-        ggml_tensor * text_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, text_len, 256);
-        ggml_set_name(text_in, "ve_text_lc");
-        ggml_set_input(text_in);
-        ggml_tensor * y = conv1d_f32(ctx, require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.proj_in.net.weight"), x, 1, 0, 1);
-        ggml_tensor * masked = ggml_mul(ctx, y, repeat_like(ctx, mask, y));
-        ggml_set_name(masked, "ve_masked");
-        if (include_ggml_trace) {
-            ggml_set_output(masked);
-            ggml_build_forward_expand(gf, masked);
-        }
-
-        ggml_tensor * cur = masked;
-        int dils_ggml[4] = {1, 2, 4, 8};
-        for (int j = 0; j < 4; ++j) {
-            cur = vector_convnext_ggml(ctx, model,
-                "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext." + std::to_string(j),
-                cur, dils_ggml[j]);
+        // F19 — vector-estimator front-block graph cache.  Same
+        // pattern as F8 / F11 / F14 / F18: build once per
+        // (model, L, text_len, trace), survive across denoise
+        // steps.  Pre-audit: 5 fresh alloc/free cycles per synth
+        // (one per step); post-audit: 1 cold-miss rebuild on the
+        // first step of the first synth, zero rebuilds thereafter
+        // for fixed-shape prompts.
+        //
+        // `trace` is part of the key because the graph wires extra
+        // `ggml_set_output` markers for the intermediate convnext
+        // outputs in trace mode; rebuilding when the flag flips
+        // keeps the gallocr's reserved buffer right-sized.
+        struct ve_front_block_graph_cache {
+            const supertonic_model * model = nullptr;
+            uint64_t generation_id = 0;
+            int L = 0;
+            int text_len = 0;
+            bool trace_outputs = false;
+            std::vector<uint8_t> buf;
+            ggml_context * ctx = nullptr;
+            ggml_cgraph * gf = nullptr;
+            ggml_gallocr_t allocr = nullptr;
+            // QVAC-18605 round 12 #5 — host-pinned input scratchpad
+            // for the three hot per-step inputs (x_in, mask_in,
+            // t_emb_in).  Same dispatch pattern as
+            // `vector_group_graph_cache`: helper returns nullptr on
+            // CPU / non-Vulkan backends; we fall back to the
+            // default backend buffer via
+            // `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`.
+            // `text_in_t` stays in `ctx` (gallocr-allocated) — the
+            // round-10 upload-skip tracker handles the per-step
+            // upload elision so the staging-hop saving doesn't
+            // amortise on the cold-miss-only path.
+            std::vector<uint8_t> input_ctx_storage;
+            ggml_context * input_ctx = nullptr;
+            ggml_backend_buffer_t input_buf = nullptr;
+            ggml_tensor * x_in = nullptr;
+            ggml_tensor * mask_in = nullptr;
+            ggml_tensor * t_emb_in = nullptr;
+            ggml_tensor * text_in_t = nullptr;
+            // F23 — in-graph RoPE inputs (cos/sin tables for Q's
+            // sequence length L and K's sequence length text_len).
+            // Stable for the cache's lifetime; uploaded once at
+            // build time.  `apply_rope` is false when the GGUF
+            // didn't ship vector_rope_theta, in which case the
+            // legacy host apply_rope path is taken downstream.
+            bool apply_rope = false;
+            ggml_tensor * q_cos_in = nullptr;
+            ggml_tensor * q_sin_in = nullptr;
+            ggml_tensor * k_cos_in = nullptr;
+            ggml_tensor * k_sin_in = nullptr;
+
+            // QVAC-18605 round 10 — pointer-compare upload-skip
+            // tracker for `text_in_t`.  `text_emb` is stable within
+            // one synth (5 calls × same pointer) but the stack-
+            // local `std::vector<float>` may be reallocated to the
+            // SAME address across synths (allocator size-class
+            // reuse).  Caller resets at `current_step == 0` to
+            // avoid leaking synth-N data into synth-N+1.  See the
+            // upload_skip_tracker contract in
+            // supertonic_internal.h.
+            //
+            // Cache rebuild zeroes this via `front_cache = {}`
+            // (the tracker's only field is a pointer that
+            // zero-initialises to nullptr → effective reset).
+            upload_skip_tracker text_in_skip;
+        };
+        thread_local ve_front_block_graph_cache front_cache;
+        if (front_cache.model != &model ||
+            front_cache.generation_id != model.generation_id ||
+            front_cache.L != L ||
+            front_cache.text_len != text_len ||
+            front_cache.trace_outputs != include_ggml_trace) {
+            // Tear down stale state.  Round 12 #5 — same teardown
+            // order as `free_group_graph_cache`: gallocr → main
+            // ctx → input host buffer → input ctx.  Reversing
+            // order would dangle gallocr pointers into freed
+            // input-ctx tensor metadata.
+            supertonic_safe_gallocr_free(front_cache.allocr, front_cache.generation_id);
+            if (front_cache.ctx) ggml_free(front_cache.ctx);
+            if (front_cache.input_buf) ggml_backend_buffer_free(front_cache.input_buf);
+            if (front_cache.input_ctx) ggml_free(front_cache.input_ctx);
+            front_cache = {};
+            front_cache.model = &model;
+            front_cache.generation_id = model.generation_id;
+            front_cache.L = L;
+            front_cache.text_len = text_len;
+            front_cache.trace_outputs = include_ggml_trace;
+
+            constexpr int MAX_NODES = 2048;
+            const size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
+                                    ggml_graph_overhead_custom(MAX_NODES, false);
+            front_cache.buf.assign(buf_size, 0);
+            ggml_init_params p = { buf_size, front_cache.buf.data(), true };
+            front_cache.ctx = ggml_init(p);
+            front_cache.gf  = ggml_new_graph_custom(front_cache.ctx, MAX_NODES, false);
+
+            // QVAC-18605 round 12 #5 — host-pinned scratchpad for
+            // the 3 hot per-step inputs (x_in, mask_in, t_emb_in).
+            // text_in_t stays in the main ctx (round-10 upload-skip
+            // tracker elides per-step uploads; pinned-host doesn't
+            // amortise on the cold-miss-only path).
+            {
+                const size_t INPUT_OVERHEAD = ggml_tensor_overhead() * 8;
+                front_cache.input_ctx_storage.assign(INPUT_OVERHEAD, 0);
+                ggml_init_params input_p = { INPUT_OVERHEAD, front_cache.input_ctx_storage.data(), /*no_alloc=*/true };
+                front_cache.input_ctx = ggml_init(input_p);
+                front_cache.x_in = ggml_new_tensor_2d(front_cache.input_ctx, GGML_TYPE_F32, L, Cin);
+                ggml_set_name(front_cache.x_in, "ve_latent_tc");
+                ggml_set_input(front_cache.x_in);
+                front_cache.mask_in = ggml_new_tensor_1d(front_cache.input_ctx, GGML_TYPE_F32, L);
+                ggml_set_name(front_cache.mask_in, "ve_latent_mask");
+                ggml_set_input(front_cache.mask_in);
+                front_cache.t_emb_in = ggml_new_tensor_1d(front_cache.input_ctx, GGML_TYPE_F32, 64);
+                ggml_set_name(front_cache.t_emb_in, "ve_time_emb");
+                ggml_set_input(front_cache.t_emb_in);
+                // QVAC-18605 round 13 #1 — consolidated allocator
+                // (round-12 inlined the try-pinned-host + fallback
+                // boilerplate; this round factors it out via
+                // `alloc_input_scratchpad_or_throw`).
+                front_cache.input_buf = alloc_input_scratchpad_or_throw(
+                    model, front_cache.input_ctx, "ve_front_block_graph_cache");
+            }
+            front_cache.text_in_t = ggml_new_tensor_2d(front_cache.ctx, GGML_TYPE_F32, text_len, 256);
+            ggml_set_name(front_cache.text_in_t, "ve_text_lc");
+            // text_in_t is uploaded once per synth (round-10 upload-skip
+            // tracker — `current_step == 0` resets, every other step
+            // skips the upload as the host pointer is stable).  Without
+            // OUTPUT the gallocr-managed buffer is freed after step 0's
+            // last consumer runs and aliased with step 1's intermediates,
+            // silently corrupting the text embedding for steps 1..N-1.
+            // INPUT alone protects the initial allocation but not the
+            // buffer's lifetime across compute passes.  See the matching
+            // notes on the relpos masks + RoPE cos/sin tables.
+            ggml_set_input(front_cache.text_in_t);  ggml_set_output(front_cache.text_in_t);
+
+            ggml_tensor * y_t = conv1d_f32(front_cache.ctx,
+                require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.proj_in.net.weight"),
+                front_cache.x_in, 1, 0, 1);
+            ggml_tensor * masked_t = ggml_mul(front_cache.ctx, y_t,
+                repeat_like(front_cache.ctx, front_cache.mask_in, y_t));
+            ggml_set_name(masked_t, "ve_masked");
             if (include_ggml_trace) {
-                const std::string name = "ve_block0_convnext" + std::to_string(j);
-                ggml_set_name(cur, name.c_str());
-                ggml_set_output(cur);
-                ggml_build_forward_expand(gf, cur);
+                ggml_set_output(masked_t);
+                ggml_build_forward_expand(front_cache.gf, masked_t);
+            }
+            ggml_tensor * cur_t = masked_t;
+            int dils_ggml[4] = {1, 2, 4, 8};
+            for (int j = 0; j < 4; ++j) {
+                cur_t = vector_convnext_ggml(front_cache.ctx, model,
+                    "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext." + std::to_string(j),
+                    cur_t, dils_ggml[j]);
+                if (include_ggml_trace) {
+                    const std::string name = "ve_block0_convnext" + std::to_string(j);
+                    ggml_set_name(cur_t, name.c_str());
+                    ggml_set_output(cur_t);
+                    ggml_build_forward_expand(front_cache.gf, cur_t);
+                }
+            }
+
+            // F6 pre-transposed t_proj companion or fallback.
+            ggml_tensor * t_proj_w_t;
+            {
+                auto pretrans_it = model.source_tensors.find("vector_estimator:onnx::MatMul_3095__T");
+                t_proj_w_t = (pretrans_it != model.source_tensors.end()) ? pretrans_it->second : nullptr;
+                if (!t_proj_w_t) {
+                    t_proj_w_t = ggml_cont(front_cache.ctx, ggml_transpose(front_cache.ctx,
+                        require_source_tensor(model, "vector_estimator:onnx::MatMul_3095")));
+                }
+            }
+            ggml_tensor * t_proj = ggml_mul_mat(front_cache.ctx, t_proj_w_t,
+                ggml_reshape_2d(front_cache.ctx, front_cache.t_emb_in, 64, 1));
+            t_proj = ggml_add(front_cache.ctx, t_proj,
+                ggml_reshape_2d(front_cache.ctx,
+                    require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.1.linear.linear.bias"),
+                    C, 1));
+            cur_t = ggml_add(front_cache.ctx, cur_t, repeat_like(front_cache.ctx, t_proj, cur_t));
+            ggml_set_name(cur_t, "ve_time_add0");
+            if (include_ggml_trace) {
+                ggml_set_output(cur_t);
+                ggml_build_forward_expand(front_cache.gf, cur_t);
+            }
+
+            cur_t = vector_convnext_ggml(front_cache.ctx, model,
+                "vector_estimator:tts.ttl.vector_field.main_blocks.2.convnext.0",
+                cur_t, 1);
+            ggml_set_name(cur_t, "ve_block2_convnext0");
+            ggml_set_output(cur_t);
+            ggml_build_forward_expand(front_cache.gf, cur_t);
+            ggml_tensor * q_t = dense_matmul_time_ggml(front_cache.ctx, cur_t,
+                require_source_tensor(model, "vector_estimator:onnx::MatMul_3101"),
+                require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.W_query.linear.bias"));
+            ggml_set_name(q_t, "ve_attn0_q");
+            ggml_set_output(q_t);
+            ggml_build_forward_expand(front_cache.gf, q_t);
+            ggml_tensor * k_t = dense_matmul_time_ggml(front_cache.ctx, front_cache.text_in_t,
+                require_source_tensor(model, "vector_estimator:onnx::MatMul_3102"),
+                require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.W_key.linear.bias"));
+            ggml_set_name(k_t, "ve_attn0_k");
+            ggml_set_output(k_t);
+            ggml_build_forward_expand(front_cache.gf, k_t);
+            // QVAC-18966 — pack V into the layout
+            // `run_text_attention_cache_gpu` consumes via
+            // `ggml_backend_tensor_copy(v_src, v_tc_in)`.  See the
+            // identical transpose in `build_group_graph_cache` +
+            // the header doc on `apply_rope_to_packed_qk` in
+            // `supertonic_internal.h`.  Matmul output is ne=[L_kv,
+            // HD] channel-major-flat; v_tc_in expects ne=[HD,
+            // L_kv] time-major-flat.  Legacy host bridge
+            // downloads `ve_attn0_v` via `tensor_raw_f32` to get
+            // bytes in the time-major-flat shape scalar
+            // `apply_rope` / `flash_attention_qkv` references.
+            ggml_tensor * v_matmul = dense_matmul_time_ggml(front_cache.ctx, front_cache.text_in_t,
+                require_source_tensor(model, "vector_estimator:onnx::MatMul_3103"),
+                require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.W_value.linear.bias"));
+            ggml_tensor * v_t = ggml_cont(front_cache.ctx,
+                ggml_transpose(front_cache.ctx, v_matmul));
+            ggml_set_name(v_t, "ve_attn0_v");
+            ggml_set_output(v_t);
+            ggml_build_forward_expand(front_cache.gf, v_t);
+
+            // F23 — same in-graph RoPE wiring as the per-group
+            // graph cache: produce post-rotation
+            // `ve_attn0_q_rope` / `ve_attn0_k_rope` outputs so the
+            // call site below can drop the host `apply_rope`
+            // round-trips.  Falls through to the legacy host
+            // rotation path when the GGUF didn't ship theta.
+            const int FRONT_H = 4;
+            const int FRONT_D = 64;
+            const int FRONT_HALF = FRONT_D / 2;
+            front_cache.apply_rope =
+                (int) model.vector_rope_theta.size() == FRONT_HALF;
+            if (front_cache.apply_rope) {
+                // RoPE cos/sin tables are cache-lifetime constants
+                // (depend only on L / text_len / θ).  Mark INPUT + OUTPUT
+                // so gallocr keeps the buffers alive across compute
+                // passes — see the matching note in build_group_graph_cache.
+                front_cache.q_cos_in = ggml_new_tensor_2d(front_cache.ctx,
+                    GGML_TYPE_F32, FRONT_HALF, L);
+                ggml_set_name(front_cache.q_cos_in, "ve_attn0_q_rope_cos");
+                ggml_set_input(front_cache.q_cos_in);  ggml_set_output(front_cache.q_cos_in);
+                front_cache.q_sin_in = ggml_new_tensor_2d(front_cache.ctx,
+                    GGML_TYPE_F32, FRONT_HALF, L);
+                ggml_set_name(front_cache.q_sin_in, "ve_attn0_q_rope_sin");
+                ggml_set_input(front_cache.q_sin_in);  ggml_set_output(front_cache.q_sin_in);
+                front_cache.k_cos_in = ggml_new_tensor_2d(front_cache.ctx,
+                    GGML_TYPE_F32, FRONT_HALF, text_len);
+                ggml_set_name(front_cache.k_cos_in, "ve_attn0_k_rope_cos");
+                ggml_set_input(front_cache.k_cos_in);  ggml_set_output(front_cache.k_cos_in);
+                front_cache.k_sin_in = ggml_new_tensor_2d(front_cache.ctx,
+                    GGML_TYPE_F32, FRONT_HALF, text_len);
+                ggml_set_name(front_cache.k_sin_in, "ve_attn0_k_rope_sin");
+                ggml_set_input(front_cache.k_sin_in);  ggml_set_output(front_cache.k_sin_in);
+                ggml_tensor * q_rope = apply_rope_to_packed_qk(front_cache.ctx,
+                    q_t, front_cache.q_cos_in, front_cache.q_sin_in,
+                    FRONT_H, FRONT_D);
+                ggml_set_name(q_rope, "ve_attn0_q_rope");
+                ggml_set_output(q_rope);
+                ggml_build_forward_expand(front_cache.gf, q_rope);
+                ggml_tensor * k_rope = apply_rope_to_packed_qk(front_cache.ctx,
+                    k_t, front_cache.k_cos_in, front_cache.k_sin_in,
+                    FRONT_H, FRONT_D);
+                ggml_set_name(k_rope, "ve_attn0_k_rope");
+                ggml_set_output(k_rope);
+                ggml_build_forward_expand(front_cache.gf, k_rope);
             }
-        }
 
-        ggml_tensor * t_proj = ggml_mul_mat(ctx,
-            ggml_cont(ctx, ggml_transpose(ctx, require_source_tensor(model, "vector_estimator:onnx::MatMul_3095"))),
-            ggml_reshape_2d(ctx, t_emb, 64, 1));
-        t_proj = ggml_add(ctx, t_proj,
-            ggml_reshape_2d(ctx,
-                require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.1.linear.linear.bias"),
-                C, 1));
-        cur = ggml_add(ctx, cur, repeat_like(ctx, t_proj, cur));
-        ggml_set_name(cur, "ve_time_add0");
-        if (include_ggml_trace) {
-            ggml_set_output(cur);
-            ggml_build_forward_expand(gf, cur);
-        }
-
-        cur = vector_convnext_ggml(ctx, model,
-            "vector_estimator:tts.ttl.vector_field.main_blocks.2.convnext.0",
-            cur, 1);
-        ggml_set_name(cur, "ve_block2_convnext0");
-        ggml_set_output(cur);
-        ggml_build_forward_expand(gf, cur);
-        ggml_tensor * q_t = dense_matmul_time_ggml(ctx, cur,
-            require_source_tensor(model, "vector_estimator:onnx::MatMul_3101"),
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.W_query.linear.bias"));
-        ggml_set_name(q_t, "ve_attn0_q");
-        ggml_set_output(q_t);
-        ggml_build_forward_expand(gf, q_t);
-        ggml_tensor * k_t = dense_matmul_time_ggml(ctx, text_in,
-            require_source_tensor(model, "vector_estimator:onnx::MatMul_3102"),
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.W_key.linear.bias"));
-        ggml_set_name(k_t, "ve_attn0_k");
-        ggml_set_output(k_t);
-        ggml_build_forward_expand(gf, k_t);
-        ggml_tensor * v_t = dense_matmul_time_ggml(ctx, text_in,
-            require_source_tensor(model, "vector_estimator:onnx::MatMul_3103"),
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.W_value.linear.bias"));
-        ggml_set_name(v_t, "ve_attn0_v");
-        ggml_set_output(v_t);
-        ggml_build_forward_expand(gf, v_t);
-
-        supertonic_sched_alloc(model, gf);
+            front_cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
+            if (!front_cache.allocr) {
+                ggml_free(front_cache.ctx);
+                front_cache = {};
+                throw std::runtime_error("ggml_gallocr_new failed");
+            }
+            if (!ggml_gallocr_reserve(front_cache.allocr, front_cache.gf)) {
+                ggml_gallocr_free(front_cache.allocr);
+                ggml_free(front_cache.ctx);
+                front_cache = {};
+                throw std::runtime_error("ggml_gallocr_reserve failed");
+            }
+            ggml_gallocr_alloc_graph(front_cache.allocr, front_cache.gf);
+
+            // F23 — upload cos/sin tables for the in-graph RoPE
+            // rotation.  These inputs depend only on (L, text_len,
+            // theta), all stable for the cache's lifetime; the
+            // upload is one-shot at build time.
+            if (front_cache.apply_rope) {
+                const int FRONT_HALF = 32;
+                std::vector<float> q_cos, q_sin, k_cos, k_sin;
+                make_rope_cos_sin_tables(model.vector_rope_theta.data(),
+                                         L, FRONT_HALF, q_cos, q_sin);
+                make_rope_cos_sin_tables(model.vector_rope_theta.data(),
+                                         text_len, FRONT_HALF, k_cos, k_sin);
+                ggml_backend_tensor_set(front_cache.q_cos_in, q_cos.data(),
+                                        0, q_cos.size() * sizeof(float));
+                ggml_backend_tensor_set(front_cache.q_sin_in, q_sin.data(),
+                                        0, q_sin.size() * sizeof(float));
+                ggml_backend_tensor_set(front_cache.k_cos_in, k_cos.data(),
+                                        0, k_cos.size() * sizeof(float));
+                ggml_backend_tensor_set(front_cache.k_sin_in, k_sin.data(),
+                                        0, k_sin.size() * sizeof(float));
+            }
+        }
+        // QVAC-18605 round 12 — reuse-or-rebuild done; expose the
+        // cache's compute graph + input tensors under the variable
+        // names the rest of this scope already uses.  HEAD's
+        // front_cache builds these same nodes (ve_time_add0,
+        // ve_block2_convnext0, ve_attn0_q/k/v, optional rope outputs)
+        // ONCE at cache-build time and reuses them across the 5
+        // denoise-step calls; master's inline-build path is the
+        // non-cached equivalent that rebuilds every call.  We keep
+        // the cache here; the post-`profile_vector_compute` GPU-
+        // bridge path below still reads the same named tensors.
+        ggml_cgraph * gf = front_cache.gf;
+        ggml_tensor * x = front_cache.x_in;
+        ggml_tensor * mask = front_cache.mask_in;
+        ggml_tensor * t_emb = front_cache.t_emb_in;
+        ggml_tensor * text_in = front_cache.text_in_t;
+        (void) text_in;
+        (void) mask; (void) t_emb;  // referenced via `front_cache.*` below
 
         ggml_backend_tensor_set(x, noisy_latent, 0, (size_t) L * Cin * sizeof(float));
         ggml_backend_tensor_set(mask, latent_mask, 0, (size_t) L * sizeof(float));
-        std::vector<float> te_host = time_embedding(model, current_step, total_steps);
+        // F9: cached time-embedding — second+ synth pays zero CPU cost
+        // for this step and skips the underlying 2 weight downloads.
+        // `te_host` stays a std::vector<float> because it's forwarded
+        // to `run_group_graph_cache(..., const std::vector<float> & temb, …)`
+        // three times below and changing that ABI would ripple into
+        // the trace harnesses.  64-element copy is negligible vs the
+        // GPU sync saved on the underlying read_f32 calls.
+        auto te_arr = cached_time_embedding(model, current_step, total_steps);
+        std::vector<float> te_host(te_arr.begin(), te_arr.end());
         ggml_backend_tensor_set(t_emb, te_host.data(), 0, te_host.size() * sizeof(float));
-        // text_emb is already in (channel, time) layout so the cache that
-        // used to wrap this set was a verbatim copy keyed on a pointer
-        // that never matched twice.  Removed; set the tensor directly
-        // from the caller-owned text_emb buffer.
-        ggml_backend_tensor_set(text_in, text_emb, 0, (size_t) text_len * 256 * sizeof(float));
+        // QVAC-18605 round 10 — text_emb upload-skip.  `text_emb`
+        // is stable within one synth (5 calls × same pointer); skip
+        // the upload on steps 1..N-1 if the pointer matches the
+        // last successful upload's pointer.  Synth-boundary reset
+        // (`current_step == 0`) invalidates the cache so the next
+        // synth's first step always uploads — protects against
+        // the stack-realloc-same-address hazard documented on
+        // `upload_skip_tracker` in supertonic_internal.h.
+        //
+        // The earlier comment "the cache that used to wrap this
+        // was a verbatim copy keyed on a pointer that never
+        // matched twice" referred to a per-call wrapper that
+        // forgot to use a stable cache instance — round 10 fixes
+        // that by storing the tracker on the (thread_local)
+        // front_cache instance, so consecutive `current_step`
+        // values within the same synth see a populated tracker.
+        if (current_step == 0) front_cache.text_in_skip.reset();
+        if (front_cache.text_in_skip.needs_upload(text_emb)) {
+            ggml_backend_tensor_set(text_in, text_emb, 0, (size_t) text_len * 256 * sizeof(float));
+            front_cache.text_in_skip.mark_uploaded(text_emb);
+        }
         profile_vector_compute(model, gf, current_step, "front_proj_attn0_qkv");
 
         PUSH_GGML_TRACE({"ve_latent_tc", {L, Cin}, in});
@@ -2213,25 +3657,127 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
         PUSH_GGML_TRACE({"ve_time_add0", {L, C}, tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_time_add0"))});
         std::vector<float> block2_ggml = tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_block2_convnext0"));
         PUSH_GGML_TRACE({"ve_block2_convnext0", {L, C}, block2_ggml});
-        std::vector<float> q_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_attn0_q"));
-        std::vector<float> k_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_attn0_k"));
-        std::vector<float> v_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_attn0_v"));
-        PUSH_GGML_TRACE({"ve_attn0_q", {L, 256}, q_out});
-        PUSH_GGML_TRACE({"ve_attn0_k", {text_len, 256}, k_out});
-        PUSH_GGML_TRACE({"ve_attn0_v", {text_len, 256}, v_out});
-        f32_tensor theta = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-        apply_rope(theta.data.data(), q_out, L, 4, 64);
-        apply_rope(theta.data.data(), k_out, text_len, 4, 64);
+        // QVAC-18605 round 8 — front-block attn0 GPU bridge.
+        //
+        // PR #16's audit follow-up #6 (2C-lite) shipped the GPU
+        // device→device blit infrastructure (`run_text_attention_cache_gpu`)
+        // and wired g1 / g2 / g3 group attentions to use it.  The
+        // front-block attn0 site was deferred because of cache-
+        // lifetime concerns at the time; round 8 picks it up.
+        //
+        // The front_cache (`ve_front_block_graph_cache` in the
+        // outer scope) is `thread_local` and stable across calls
+        // (rebuilds only on shape change L / text_len /
+        // trace_outputs).  After `profile_vector_compute` returns,
+        // the named output tensors `ve_attn0_v` and (when
+        // `apply_rope` is true) `ve_attn0_q_rope` /
+        // `ve_attn0_k_rope` are valid GPU handles for the
+        // duration of the next attention compute.  Same lifetime
+        // guarantee as the g1/g2/g3 caches → safe to pass into
+        // `run_text_attention_cache_gpu`.
+        //
+        // Eliminates per call: 3 GPU→host downloads + 3 host→GPU
+        // uploads.  Across 5 denoise steps × Q/K/V = 30 sync
+        // points / synth.  Production path only — trace mode
+        // still takes the legacy host-bridge path so the trace
+        // dump captures pre-attention Q/K/V host vectors.
+        //
+        // Note: the legacy host-bridge fallback below still uses
+        // `tensor_to_time_channel(v_gpu_attn0)`; round 11's
+        // QVAC-18966 layout fix re-patches that call site to
+        // `tensor_raw_f32(...)` after `ve_attn0_v` becomes
+        // `ggml_cont(ggml_transpose(...))`-shaped.
+        ggml_tensor * v_gpu_attn0      = ggml_graph_get_tensor(gf, "ve_attn0_v");
+        ggml_tensor * q_rope_gpu_attn0 = ggml_graph_get_tensor(gf, "ve_attn0_q_rope");
+        ggml_tensor * k_rope_gpu_attn0 = ggml_graph_get_tensor(gf, "ve_attn0_k_rope");
+        const bool front_in_graph_rope = (q_rope_gpu_attn0 != nullptr);
+        const bool front_use_gpu_bridge = front_in_graph_rope && !include_ggml_trace
+                                          && v_gpu_attn0 && k_rope_gpu_attn0;
+        std::vector<float> q_out, k_out, q_rotated, k_rotated, v_out;
         thread_local vector_text_attention_cache att0_cache;
         std::vector<float> att0_ctx_trace;
-        std::vector<float> attn_out_ggml = run_text_attention_cache(att0_cache, model, q_out, k_out, v_out,
-            L, text_len, 4, 64,
-            "vector_estimator:onnx::MatMul_3110",
-            "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.out_fc.linear.bias",
-            current_step, "attn0_flash",
-            include_ggml_trace ? &att0_ctx_trace : nullptr);
-        PUSH_GGML_TRACE({"ve_attn0_q_rope", {L, 256}, q_out});
-        PUSH_GGML_TRACE({"ve_attn0_k_rope", {text_len, 256}, k_out});
+        std::vector<float> attn_out_ggml;
+        if (front_use_gpu_bridge) {
+            // Fast path: device→device blit, host never sees Q/K/V.
+            // Mirrors the g1/g2/g3 dispatch at lines 2926-2933.
+            attn_out_ggml = run_text_attention_cache_gpu(att0_cache, model,
+                q_rope_gpu_attn0, k_rope_gpu_attn0, v_gpu_attn0,
+                L, text_len, 4, 64,
+                "vector_estimator:onnx::MatMul_3110",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.out_fc.linear.bias",
+                current_step, "attn0_flash",
+                /*ctx_trace=*/ nullptr);
+        } else {
+            // Legacy / trace-mode host bridge.  Falls back to the
+            // pre-round-8 download + rotate + upload pattern.
+            //
+            // QVAC-18605 follow-up — post-fix V graph layout:
+            // `ve_attn0_v` is now `ggml_cont(ggml_transpose(...))`
+            // of the matmul output (ne=[HD, text_len] time-major-
+            // flat memory).  `tensor_raw_f32` downloads the bytes
+            // directly in the layout scalar `apply_rope` /
+            // `flash_attention_qkv` host references expect
+            // (`v[t*HD + c]`).  Using `tensor_to_time_channel`
+            // here would mis-interpret the swapped ne.  See the
+            // header doc on `apply_rope_to_packed_qk` in
+            // `supertonic_internal.h`.  Q/K matmul outputs are
+            // UNCHANGED (still ne=[L, HD] channel-major-flat) so
+            // `tensor_to_time_channel` is the right call there.
+            v_out = tensor_raw_f32(v_gpu_attn0);
+            if (include_ggml_trace) {
+                q_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_attn0_q"));
+                k_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_attn0_k"));
+                PUSH_GGML_TRACE({"ve_attn0_q", {L, 256}, q_out});
+                PUSH_GGML_TRACE({"ve_attn0_k", {text_len, 256}, k_out});
+                PUSH_GGML_TRACE({"ve_attn0_v", {text_len, 256}, v_out});
+            }
+            // F23 — when the front-block graph has the in-graph
+            // RoPE wired in (model carries `vector_rope_theta`),
+            // feed `run_text_attention_cache` the already-rotated
+            // Q/K from the `_rope` graph outputs.  Host
+            // `apply_rope(theta, …)` is fully eliminated on the
+            // in-graph-rope path.
+            if (front_in_graph_rope) {
+                // QVAC-18605 follow-up — post-fix layout contract:
+                // `apply_rope_to_packed_qk` produces ne=[HD, L]
+                // with time-major-flat memory (`data[c + t*HD]`),
+                // which is bit-identical to scalar `apply_rope`'s
+                // output buffer.  `tensor_raw_f32` downloads those
+                // bytes directly — no transpose needed (and using
+                // `tensor_to_time_channel` here would mis-interpret
+                // the ne shape and produce the transpose of the
+                // transpose, silently feeding wrong-orientation
+                // Q/K into the attention).  See the header doc on
+                // `apply_rope_to_packed_qk` in
+                // `supertonic_internal.h`.
+                q_rotated = tensor_raw_f32(q_rope_gpu_attn0);
+                k_rotated = tensor_raw_f32(k_rope_gpu_attn0);
+            } else {
+                // Legacy GGUF path: rotate host-side.
+                if (q_out.empty()) {
+                    q_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_attn0_q"));
+                    k_out = tensor_to_time_channel(ggml_graph_get_tensor(gf, "ve_attn0_k"));
+                }
+                const float * theta = model.vector_rope_theta.data();
+                apply_rope(theta, q_out, L, 4, 64);
+                apply_rope(theta, k_out, text_len, 4, 64);
+                q_rotated = std::move(q_out);
+                k_rotated = std::move(k_out);
+            }
+            attn_out_ggml = run_text_attention_cache(att0_cache, model, q_rotated, k_rotated, v_out,
+                L, text_len, 4, 64,
+                "vector_estimator:onnx::MatMul_3110",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.out_fc.linear.bias",
+                current_step, "attn0_flash",
+                include_ggml_trace ? &att0_ctx_trace : nullptr);
+        }
+        // Trace pushes — `q_rotated` / `k_rotated` are populated
+        // by the legacy branch above; empty on the GPU-bridge
+        // path (in which case `PUSH_GGML_TRACE` is a no-op
+        // because `include_ggml_trace == false`).  Matches the
+        // g1/g2/g3 trace-push pattern at lines 2955-2956.
+        PUSH_GGML_TRACE({"ve_attn0_q_rope", {L, 256}, q_rotated});
+        PUSH_GGML_TRACE({"ve_attn0_k_rope", {text_len, 256}, k_rotated});
         PUSH_GGML_TRACE({"ve_attn0_ctx", {L, 256}, att0_ctx_trace});
         PUSH_GGML_TRACE({"ve_attn0_out", {L, C}, attn_out_ggml});
 
@@ -2255,51 +3801,52 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             "attn0_residual_style_qkv",
             include_ggml_trace ? &ggml_trace : nullptr);
         std::vector<float> post_ggml = std::move(style0_res_qkv.post);
-        std::vector<float> sq_out = std::move(style0_res_qkv.sq);
-        std::vector<float> sk_out = std::move(style0_res_qkv.sk);
-        std::vector<float> sv_out = std::move(style0_res_qkv.sv);
+        // QVAC-18605 round 9 — style flash-attn GPU bridge for
+        // style0 (front-block style residual).  Same dispatch
+        // pattern as the round-8 front-block attn0 bridge:
+        // production path uses `run_text_attention_cache_gpu`
+        // with the GPU handles from the res-style-qkv cache,
+        // trace mode falls back to the legacy host bridge so
+        // the trace harness still gets the host vectors.
         thread_local vector_text_attention_cache style0_attn_cache;
         std::vector<float> style0_ctx_trace;
-        std::vector<float> style_out_ggml = run_text_attention_cache(style0_attn_cache, model, sq_out, sk_out, sv_out,
-            L, 50, 2, 128,
-            "vector_estimator:onnx::MatMul_3119",
-            "vector_estimator:tts.ttl.vector_field.main_blocks.5.attention.out_fc.linear.bias",
-            current_step, "style0_flash",
-            include_ggml_trace ? &style0_ctx_trace : nullptr);
+        std::vector<float> style_out_ggml;
+        const bool style0_use_gpu_bridge = !include_ggml_trace
+            && style0_res_qkv.sq_gpu && style0_res_qkv.sk_gpu && style0_res_qkv.sv_gpu;
+        if (style0_use_gpu_bridge) {
+            style_out_ggml = run_text_attention_cache_gpu(style0_attn_cache, model,
+                style0_res_qkv.sq_gpu, style0_res_qkv.sk_gpu, style0_res_qkv.sv_gpu,
+                L, 50, 2, 128,
+                "vector_estimator:onnx::MatMul_3119",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.5.attention.out_fc.linear.bias",
+                current_step, "style0_flash",
+                /*ctx_trace=*/ nullptr);
+        } else {
+            std::vector<float> sq_out = std::move(style0_res_qkv.sq);
+            std::vector<float> sk_out = std::move(style0_res_qkv.sk);
+            std::vector<float> sv_out = std::move(style0_res_qkv.sv);
+            style_out_ggml = run_text_attention_cache(style0_attn_cache, model, sq_out, sk_out, sv_out,
+                L, 50, 2, 128,
+                "vector_estimator:onnx::MatMul_3119",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.5.attention.out_fc.linear.bias",
+                current_step, "style0_flash",
+                include_ggml_trace ? &style0_ctx_trace : nullptr);
+        }
         PUSH_GGML_TRACE({"ve_style0_ctx", {L, 256}, style0_ctx_trace});
         PUSH_GGML_TRACE({"ve_style0_out", {L, C}, style_out_ggml});
-        constexpr int STYLE_RES_NODES = 128;
-        static size_t style_res_buf_size = ggml_tensor_overhead() * STYLE_RES_NODES +
-                                           ggml_graph_overhead_custom(STYLE_RES_NODES, false);
-        thread_local std::vector<uint8_t> style_res_buf(style_res_buf_size);
-        ggml_init_params srp = { style_res_buf_size, style_res_buf.data(), true };
-        ggml_context * srctx = ggml_init(srp);
-        ggml_cgraph * srgf = ggml_new_graph_custom(srctx, STYLE_RES_NODES, false);
-        ggml_tensor * style_out_in = ggml_new_tensor_2d(srctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(style_out_in, "style_out_in"); ggml_set_input(style_out_in);
-        ggml_tensor * style_lhs_in = ggml_new_tensor_2d(srctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(style_lhs_in, "style_lhs_in"); ggml_set_input(style_lhs_in);
-        ggml_tensor * style_res = ggml_add(srctx, style_lhs_in, style_out_in);
-        ggml_set_name(style_res, "ve_style0_residual");
-        if (include_ggml_trace) {
-            ggml_set_output(style_res);
-            ggml_build_forward_expand(srgf, style_res);
-        }
-        ggml_tensor * style_norm = layer_norm_ggml(srctx, style_res,
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.5.norm.norm.weight"),
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.5.norm.norm.bias"));
-        ggml_set_name(style_norm, "ve_style0_norm"); ggml_set_output(style_norm);
-        ggml_build_forward_expand(srgf, style_norm);
-        supertonic_sched_alloc(model, srgf);
-        std::vector<float> style_out_raw = pack_time_channel_for_ggml(style_out_ggml, L, C);
-        std::vector<float> style_lhs_raw = pack_time_channel_for_ggml(post_ggml, L, C);
-        ggml_backend_tensor_set(style_out_in, style_out_raw.data(), 0, style_out_raw.size()*sizeof(float));
-        ggml_backend_tensor_set(style_lhs_in, style_lhs_raw.data(), 0, style_lhs_raw.size()*sizeof(float));
-        profile_vector_compute(model, srgf, current_step, "style0_residual");
-        PUSH_GGML_TRACE({"ve_style0_residual", {L, C}, tensor_to_time_channel(ggml_graph_get_tensor(srgf, "ve_style0_residual"))});
-        std::vector<float> style_norm_ggml = tensor_to_time_channel(ggml_graph_get_tensor(srgf, "ve_style0_norm"));
+        // F8: cached style-residual graph (lhs + out → add → LN).
+        // norm_block = 5 for the front-block style residual.
+        // QVAC-18605 round 12 — `run_style_residual_cache` keeps a
+        // thread_local graph across calls; master's inline-build
+        // equivalent has been deliberately replaced by the cache.
+        thread_local vector_style_residual_graph_cache style0_res_cache;
+        std::vector<float> style0_res_trace;
+        std::vector<float> style_norm_ggml = run_style_residual_cache(
+            style0_res_cache, model, post_ggml, style_out_ggml,
+            L, C, /*norm_block=*/5, current_step, "style0_residual",
+            include_ggml_trace ? &style0_res_trace : nullptr);
+        PUSH_GGML_TRACE({"ve_style0_residual", {L, C}, style0_res_trace});
         PUSH_GGML_TRACE({"ve_style0_norm", {L, C}, style_norm_ggml});
-        ggml_free(srctx);
 
         thread_local vector_group_graph_cache g1_group_cache;
         vector_group_graph_result g1_group = run_group_graph_cache(g1_group_cache, model, style_norm_ggml,
@@ -2311,22 +3858,48 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             "ve_g1_attn_q", "ve_g1_attn_k", "ve_g1_attn_v",
             "group1_conv_attn_qkv", include_ggml_trace ? &ggml_trace : nullptr);
         std::vector<float> g1_block8 = std::move(g1_group.post);
-        std::vector<float> g1q_out = std::move(g1_group.q);
-        std::vector<float> g1k_out = std::move(g1_group.k);
-        std::vector<float> g1v_out = std::move(g1_group.v);
-        f32_tensor theta_g1 = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-        apply_rope(theta_g1.data.data(), g1q_out, L, 4, 64);
-        apply_rope(theta_g1.data.data(), g1k_out, text_len, 4, 64);
+        // 2C-lite — production fast path: pass GPU tensor handles
+        // straight from the group cache into the attention cache
+        // via `ggml_backend_tensor_copy`.  Host vectors for
+        // q/k/v/q_rope/k_rope are empty in production (gated on
+        // `trace != nullptr` inside `run_group_graph_cache`), so
+        // we MUST use the *_gpu pointers when present.  Falls
+        // back to the legacy host rotation path when the cache
+        // didn't wire RoPE in graph (e.g. malformed GGUF).
         thread_local vector_text_attention_cache g1_attn_cache;
         std::vector<float> g1_attn_ctx_trace;
-        std::vector<float> g1_attn_out = run_text_attention_cache(g1_attn_cache, model, g1q_out, g1k_out, g1v_out,
-            L, text_len, 4, 64,
-            "vector_estimator:onnx::MatMul_3155",
-            "vector_estimator:tts.ttl.vector_field.main_blocks.9.attn.out_fc.linear.bias",
-            current_step, "g1_attn_flash",
-            include_ggml_trace ? &g1_attn_ctx_trace : nullptr);
-        PUSH_GGML_TRACE({"ve_g1_attn_q_rope", {L, 256}, g1q_out});
-        PUSH_GGML_TRACE({"ve_g1_attn_k_rope", {text_len, 256}, g1k_out});
+        std::vector<float> g1_attn_out;
+        if (g1_group.q_rope_gpu && g1_group.k_rope_gpu && g1_group.v_gpu) {
+            g1_attn_out = run_text_attention_cache_gpu(g1_attn_cache, model,
+                g1_group.q_rope_gpu, g1_group.k_rope_gpu, g1_group.v_gpu,
+                L, text_len, 4, 64,
+                "vector_estimator:onnx::MatMul_3155",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.9.attn.out_fc.linear.bias",
+                current_step, "g1_attn_flash",
+                include_ggml_trace ? &g1_attn_ctx_trace : nullptr);
+        } else {
+            std::vector<float> g1q_out = std::move(g1_group.q);
+            std::vector<float> g1k_out = std::move(g1_group.k);
+            std::vector<float> g1v_out = std::move(g1_group.v);
+            std::vector<float> g1q_rotated = g1q_out;
+            std::vector<float> g1k_rotated = g1k_out;
+            const float * theta_g1 = model.vector_rope_theta.data();
+            apply_rope(theta_g1, g1q_rotated, L, 4, 64);
+            apply_rope(theta_g1, g1k_rotated, text_len, 4, 64);
+            g1_attn_out = run_text_attention_cache(g1_attn_cache, model,
+                g1q_rotated, g1k_rotated, g1v_out,
+                L, text_len, 4, 64,
+                "vector_estimator:onnx::MatMul_3155",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.9.attn.out_fc.linear.bias",
+                current_step, "g1_attn_flash",
+                include_ggml_trace ? &g1_attn_ctx_trace : nullptr);
+        }
+        // Trace pushes — use the host vectors the group cache
+        // downloaded under its `if (trace)` guard.  Empty when
+        // include_ggml_trace is false (PUSH_GGML_TRACE is a no-op
+        // in that case).
+        PUSH_GGML_TRACE({"ve_g1_attn_q_rope", {L, 256}, g1_group.q_rope});
+        PUSH_GGML_TRACE({"ve_g1_attn_k_rope", {text_len, 256}, g1_group.k_rope});
         PUSH_GGML_TRACE({"ve_g1_attn_ctx", {L, 256}, g1_attn_ctx_trace});
         PUSH_GGML_TRACE({"ve_g1_attn_out", {L, C}, g1_attn_out});
 
@@ -2347,52 +3920,45 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             "g1_attn_residual_style_qkv",
             include_ggml_trace ? &ggml_trace : nullptr);
         std::vector<float> g1_block10 = std::move(g1_res_qkv.post);
-        std::vector<float> g1sq_out = std::move(g1_res_qkv.sq);
-        std::vector<float> g1sk_out = std::move(g1_res_qkv.sk);
-        std::vector<float> g1sv_out = std::move(g1_res_qkv.sv);
+        // QVAC-18605 round 9 — style flash-attn GPU bridge for g1.
         thread_local vector_text_attention_cache g1_style_attn_cache;
         std::vector<float> g1_style_ctx_trace;
-        std::vector<float> g1_style_out = run_text_attention_cache(g1_style_attn_cache, model, g1sq_out, g1sk_out, g1sv_out,
-            L, 50, 2, 128,
-            "vector_estimator:onnx::MatMul_3164",
-            "vector_estimator:tts.ttl.vector_field.main_blocks.11.attention.out_fc.linear.bias",
-            current_step, "g1_style_flash",
-            include_ggml_trace ? &g1_style_ctx_trace : nullptr);
+        std::vector<float> g1_style_out;
+        const bool g1_style_use_gpu_bridge = !include_ggml_trace
+            && g1_res_qkv.sq_gpu && g1_res_qkv.sk_gpu && g1_res_qkv.sv_gpu;
+        if (g1_style_use_gpu_bridge) {
+            g1_style_out = run_text_attention_cache_gpu(g1_style_attn_cache, model,
+                g1_res_qkv.sq_gpu, g1_res_qkv.sk_gpu, g1_res_qkv.sv_gpu,
+                L, 50, 2, 128,
+                "vector_estimator:onnx::MatMul_3164",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.11.attention.out_fc.linear.bias",
+                current_step, "g1_style_flash",
+                /*ctx_trace=*/ nullptr);
+        } else {
+            std::vector<float> g1sq_out = std::move(g1_res_qkv.sq);
+            std::vector<float> g1sk_out = std::move(g1_res_qkv.sk);
+            std::vector<float> g1sv_out = std::move(g1_res_qkv.sv);
+            g1_style_out = run_text_attention_cache(g1_style_attn_cache, model, g1sq_out, g1sk_out, g1sv_out,
+                L, 50, 2, 128,
+                "vector_estimator:onnx::MatMul_3164",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.11.attention.out_fc.linear.bias",
+                current_step, "g1_style_flash",
+                include_ggml_trace ? &g1_style_ctx_trace : nullptr);
+        }
         PUSH_GGML_TRACE({"ve_g1_style_ctx", {L, 256}, g1_style_ctx_trace});
         PUSH_GGML_TRACE({"ve_g1_style_out", {L, C}, g1_style_out});
 
-        constexpr int G1_STYLE_RES_NODES = 128;
-        static size_t g1_style_res_buf_size = ggml_tensor_overhead() * G1_STYLE_RES_NODES +
-                                              ggml_graph_overhead_custom(G1_STYLE_RES_NODES, false);
-        thread_local std::vector<uint8_t> g1_style_res_buf(g1_style_res_buf_size);
-        ggml_init_params g1srp = { g1_style_res_buf_size, g1_style_res_buf.data(), true };
-        ggml_context * g1srctx = ggml_init(g1srp);
-        ggml_cgraph * g1srgf = ggml_new_graph_custom(g1srctx, G1_STYLE_RES_NODES, false);
-        ggml_tensor * g1_style_lhs = ggml_new_tensor_2d(g1srctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(g1_style_lhs, "g1_style_lhs"); ggml_set_input(g1_style_lhs);
-        ggml_tensor * g1_style_out_in = ggml_new_tensor_2d(g1srctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(g1_style_out_in, "g1_style_out_in"); ggml_set_input(g1_style_out_in);
-        ggml_tensor * g1_style_res = ggml_add(g1srctx, g1_style_lhs, g1_style_out_in);
-        ggml_set_name(g1_style_res, "ve_g1_style_residual");
-        if (include_ggml_trace) {
-            ggml_set_output(g1_style_res);
-            ggml_build_forward_expand(g1srgf, g1_style_res);
-        }
-        ggml_tensor * g1_style_norm = layer_norm_ggml(g1srctx, g1_style_res,
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.11.norm.norm.weight"),
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.11.norm.norm.bias"));
-        ggml_set_name(g1_style_norm, "ve_g1_style_norm"); ggml_set_output(g1_style_norm);
-        ggml_build_forward_expand(g1srgf, g1_style_norm);
-        supertonic_sched_alloc(model, g1srgf);
-        std::vector<float> g1_style_lhs_raw = pack_time_channel_for_ggml(g1_block10, L, C);
-        std::vector<float> g1_style_out_raw = pack_time_channel_for_ggml(g1_style_out, L, C);
-        ggml_backend_tensor_set(g1_style_lhs, g1_style_lhs_raw.data(), 0, g1_style_lhs_raw.size()*sizeof(float));
-        ggml_backend_tensor_set(g1_style_out_in, g1_style_out_raw.data(), 0, g1_style_out_raw.size()*sizeof(float));
-        profile_vector_compute(model, g1srgf, current_step, "g1_style_residual");
-        PUSH_GGML_TRACE({"ve_g1_style_residual", {L, C}, tensor_to_time_channel(ggml_graph_get_tensor(g1srgf, "ve_g1_style_residual"))});
-        std::vector<float> g1_style_norm_vec = tensor_to_time_channel(ggml_graph_get_tensor(g1srgf, "ve_g1_style_norm"));
+        // F8: cached style-residual graph (norm_block = 11 for group 1).
+        // Mirror of style0_residual block; HEAD's cache reused across
+        // calls, master's inline-build equivalent dropped.
+        thread_local vector_style_residual_graph_cache g1_style_res_cache;
+        std::vector<float> g1_style_res_trace;
+        std::vector<float> g1_style_norm_vec = run_style_residual_cache(
+            g1_style_res_cache, model, g1_block10, g1_style_out,
+            L, C, /*norm_block=*/11, current_step, "g1_style_residual",
+            include_ggml_trace ? &g1_style_res_trace : nullptr);
+        PUSH_GGML_TRACE({"ve_g1_style_residual", {L, C}, g1_style_res_trace});
         PUSH_GGML_TRACE({"ve_g1_style_norm", {L, C}, g1_style_norm_vec});
-        ggml_free(g1srctx);
 
         thread_local vector_group_graph_cache g2_group_cache;
         vector_group_graph_result g2_group = run_group_graph_cache(g2_group_cache, model, g1_style_norm_vec,
@@ -2404,22 +3970,37 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             "ve_g2_attn_q", "ve_g2_attn_k", "ve_g2_attn_v",
             "group2_conv_attn_qkv", include_ggml_trace ? &ggml_trace : nullptr);
         std::vector<float> g2_block14 = std::move(g2_group.post);
-        std::vector<float> g2q_out = std::move(g2_group.q);
-        std::vector<float> g2k_out = std::move(g2_group.k);
-        std::vector<float> g2v_out = std::move(g2_group.v);
-        f32_tensor theta_g2 = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-        apply_rope(theta_g2.data.data(), g2q_out, L, 4, 64);
-        apply_rope(theta_g2.data.data(), g2k_out, text_len, 4, 64);
+        // 2C-lite — same GPU fast-path / host-fallback pattern as g1.
         thread_local vector_text_attention_cache g2_attn_cache;
         std::vector<float> g2_attn_ctx_trace;
-        std::vector<float> g2_attn_out = run_text_attention_cache(g2_attn_cache, model, g2q_out, g2k_out, g2v_out,
-            L, text_len, 4, 64,
-            "vector_estimator:onnx::MatMul_3200",
-            "vector_estimator:tts.ttl.vector_field.main_blocks.15.attn.out_fc.linear.bias",
-            current_step, "g2_attn_flash",
-            include_ggml_trace ? &g2_attn_ctx_trace : nullptr);
-        PUSH_GGML_TRACE({"ve_g2_attn_q_rope", {L, 256}, g2q_out});
-        PUSH_GGML_TRACE({"ve_g2_attn_k_rope", {text_len, 256}, g2k_out});
+        std::vector<float> g2_attn_out;
+        if (g2_group.q_rope_gpu && g2_group.k_rope_gpu && g2_group.v_gpu) {
+            g2_attn_out = run_text_attention_cache_gpu(g2_attn_cache, model,
+                g2_group.q_rope_gpu, g2_group.k_rope_gpu, g2_group.v_gpu,
+                L, text_len, 4, 64,
+                "vector_estimator:onnx::MatMul_3200",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.15.attn.out_fc.linear.bias",
+                current_step, "g2_attn_flash",
+                include_ggml_trace ? &g2_attn_ctx_trace : nullptr);
+        } else {
+            std::vector<float> g2q_out = std::move(g2_group.q);
+            std::vector<float> g2k_out = std::move(g2_group.k);
+            std::vector<float> g2v_out = std::move(g2_group.v);
+            std::vector<float> g2q_rotated = g2q_out;
+            std::vector<float> g2k_rotated = g2k_out;
+            const float * theta_g2 = model.vector_rope_theta.data();
+            apply_rope(theta_g2, g2q_rotated, L, 4, 64);
+            apply_rope(theta_g2, g2k_rotated, text_len, 4, 64);
+            g2_attn_out = run_text_attention_cache(g2_attn_cache, model,
+                g2q_rotated, g2k_rotated, g2v_out,
+                L, text_len, 4, 64,
+                "vector_estimator:onnx::MatMul_3200",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.15.attn.out_fc.linear.bias",
+                current_step, "g2_attn_flash",
+                include_ggml_trace ? &g2_attn_ctx_trace : nullptr);
+        }
+        PUSH_GGML_TRACE({"ve_g2_attn_q_rope", {L, 256}, g2_group.q_rope});
+        PUSH_GGML_TRACE({"ve_g2_attn_k_rope", {text_len, 256}, g2_group.k_rope});
         PUSH_GGML_TRACE({"ve_g2_attn_ctx", {L, 256}, g2_attn_ctx_trace});
         PUSH_GGML_TRACE({"ve_g2_attn_out", {L, C}, g2_attn_out});
 
@@ -2440,52 +4021,43 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             "g2_attn_residual_style_qkv",
             include_ggml_trace ? &ggml_trace : nullptr);
         std::vector<float> g2_block16 = std::move(g2_res_qkv.post);
-        std::vector<float> g2sq_out = std::move(g2_res_qkv.sq);
-        std::vector<float> g2sk_out = std::move(g2_res_qkv.sk);
-        std::vector<float> g2sv_out = std::move(g2_res_qkv.sv);
+        // QVAC-18605 round 9 — style flash-attn GPU bridge for g2.
         thread_local vector_text_attention_cache g2_style_attn_cache;
         std::vector<float> g2_style_ctx_trace;
-        std::vector<float> g2_style_out = run_text_attention_cache(g2_style_attn_cache, model, g2sq_out, g2sk_out, g2sv_out,
-            L, 50, 2, 128,
-            "vector_estimator:onnx::MatMul_3209",
-            "vector_estimator:tts.ttl.vector_field.main_blocks.17.attention.out_fc.linear.bias",
-            current_step, "g2_style_flash",
-            include_ggml_trace ? &g2_style_ctx_trace : nullptr);
+        std::vector<float> g2_style_out;
+        const bool g2_style_use_gpu_bridge = !include_ggml_trace
+            && g2_res_qkv.sq_gpu && g2_res_qkv.sk_gpu && g2_res_qkv.sv_gpu;
+        if (g2_style_use_gpu_bridge) {
+            g2_style_out = run_text_attention_cache_gpu(g2_style_attn_cache, model,
+                g2_res_qkv.sq_gpu, g2_res_qkv.sk_gpu, g2_res_qkv.sv_gpu,
+                L, 50, 2, 128,
+                "vector_estimator:onnx::MatMul_3209",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.17.attention.out_fc.linear.bias",
+                current_step, "g2_style_flash",
+                /*ctx_trace=*/ nullptr);
+        } else {
+            std::vector<float> g2sq_out = std::move(g2_res_qkv.sq);
+            std::vector<float> g2sk_out = std::move(g2_res_qkv.sk);
+            std::vector<float> g2sv_out = std::move(g2_res_qkv.sv);
+            g2_style_out = run_text_attention_cache(g2_style_attn_cache, model, g2sq_out, g2sk_out, g2sv_out,
+                L, 50, 2, 128,
+                "vector_estimator:onnx::MatMul_3209",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.17.attention.out_fc.linear.bias",
+                current_step, "g2_style_flash",
+                include_ggml_trace ? &g2_style_ctx_trace : nullptr);
+        }
         PUSH_GGML_TRACE({"ve_g2_style_ctx", {L, 256}, g2_style_ctx_trace});
         PUSH_GGML_TRACE({"ve_g2_style_out", {L, C}, g2_style_out});
 
-        constexpr int G2_STYLE_RES_NODES = 128;
-        static size_t g2_style_res_buf_size = ggml_tensor_overhead() * G2_STYLE_RES_NODES +
-                                              ggml_graph_overhead_custom(G2_STYLE_RES_NODES, false);
-        thread_local std::vector<uint8_t> g2_style_res_buf(g2_style_res_buf_size);
-        ggml_init_params g2srp = { g2_style_res_buf_size, g2_style_res_buf.data(), true };
-        ggml_context * g2srctx = ggml_init(g2srp);
-        ggml_cgraph * g2srgf = ggml_new_graph_custom(g2srctx, G2_STYLE_RES_NODES, false);
-        ggml_tensor * g2_style_lhs = ggml_new_tensor_2d(g2srctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(g2_style_lhs, "g2_style_lhs"); ggml_set_input(g2_style_lhs);
-        ggml_tensor * g2_style_out_in = ggml_new_tensor_2d(g2srctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(g2_style_out_in, "g2_style_out_in"); ggml_set_input(g2_style_out_in);
-        ggml_tensor * g2_style_res = ggml_add(g2srctx, g2_style_lhs, g2_style_out_in);
-        ggml_set_name(g2_style_res, "ve_g2_style_residual");
-        if (include_ggml_trace) {
-            ggml_set_output(g2_style_res);
-            ggml_build_forward_expand(g2srgf, g2_style_res);
-        }
-        ggml_tensor * g2_style_norm = layer_norm_ggml(g2srctx, g2_style_res,
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.17.norm.norm.weight"),
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.17.norm.norm.bias"));
-        ggml_set_name(g2_style_norm, "ve_g2_style_norm"); ggml_set_output(g2_style_norm);
-        ggml_build_forward_expand(g2srgf, g2_style_norm);
-        supertonic_sched_alloc(model, g2srgf);
-        std::vector<float> g2_style_lhs_raw = pack_time_channel_for_ggml(g2_block16, L, C);
-        std::vector<float> g2_style_out_raw = pack_time_channel_for_ggml(g2_style_out, L, C);
-        ggml_backend_tensor_set(g2_style_lhs, g2_style_lhs_raw.data(), 0, g2_style_lhs_raw.size()*sizeof(float));
-        ggml_backend_tensor_set(g2_style_out_in, g2_style_out_raw.data(), 0, g2_style_out_raw.size()*sizeof(float));
-        profile_vector_compute(model, g2srgf, current_step, "g2_style_residual");
-        PUSH_GGML_TRACE({"ve_g2_style_residual", {L, C}, tensor_to_time_channel(ggml_graph_get_tensor(g2srgf, "ve_g2_style_residual"))});
-        std::vector<float> g2_style_norm_vec = tensor_to_time_channel(ggml_graph_get_tensor(g2srgf, "ve_g2_style_norm"));
+        // F8: cached style-residual graph (norm_block = 17 for group 2).
+        thread_local vector_style_residual_graph_cache g2_style_res_cache;
+        std::vector<float> g2_style_res_trace;
+        std::vector<float> g2_style_norm_vec = run_style_residual_cache(
+            g2_style_res_cache, model, g2_block16, g2_style_out,
+            L, C, /*norm_block=*/17, current_step, "g2_style_residual",
+            include_ggml_trace ? &g2_style_res_trace : nullptr);
+        PUSH_GGML_TRACE({"ve_g2_style_residual", {L, C}, g2_style_res_trace});
         PUSH_GGML_TRACE({"ve_g2_style_norm", {L, C}, g2_style_norm_vec});
-        ggml_free(g2srctx);
 
         thread_local vector_group_graph_cache g3_group_cache;
         vector_group_graph_result g3_group = run_group_graph_cache(g3_group_cache, model, g2_style_norm_vec,
@@ -2497,22 +4069,37 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             "ve_g3_attn_q", "ve_g3_attn_k", "ve_g3_attn_v",
             "group3_conv_attn_qkv", include_ggml_trace ? &ggml_trace : nullptr);
         std::vector<float> g3_block20 = std::move(g3_group.post);
-        std::vector<float> g3q_out = std::move(g3_group.q);
-        std::vector<float> g3k_out = std::move(g3_group.k);
-        std::vector<float> g3v_out = std::move(g3_group.v);
-        f32_tensor theta_g3 = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
-        apply_rope(theta_g3.data.data(), g3q_out, L, 4, 64);
-        apply_rope(theta_g3.data.data(), g3k_out, text_len, 4, 64);
+        // 2C-lite — same GPU fast-path / host-fallback pattern as g1, g2.
         thread_local vector_text_attention_cache g3_attn_cache;
         std::vector<float> g3_attn_ctx_trace;
-        std::vector<float> g3_attn_out = run_text_attention_cache(g3_attn_cache, model, g3q_out, g3k_out, g3v_out,
-            L, text_len, 4, 64,
-            "vector_estimator:onnx::MatMul_3245",
-            "vector_estimator:tts.ttl.vector_field.main_blocks.21.attn.out_fc.linear.bias",
-            current_step, "g3_attn_flash",
-            include_ggml_trace ? &g3_attn_ctx_trace : nullptr);
-        PUSH_GGML_TRACE({"ve_g3_attn_q_rope", {L, 256}, g3q_out});
-        PUSH_GGML_TRACE({"ve_g3_attn_k_rope", {text_len, 256}, g3k_out});
+        std::vector<float> g3_attn_out;
+        if (g3_group.q_rope_gpu && g3_group.k_rope_gpu && g3_group.v_gpu) {
+            g3_attn_out = run_text_attention_cache_gpu(g3_attn_cache, model,
+                g3_group.q_rope_gpu, g3_group.k_rope_gpu, g3_group.v_gpu,
+                L, text_len, 4, 64,
+                "vector_estimator:onnx::MatMul_3245",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.21.attn.out_fc.linear.bias",
+                current_step, "g3_attn_flash",
+                include_ggml_trace ? &g3_attn_ctx_trace : nullptr);
+        } else {
+            std::vector<float> g3q_out = std::move(g3_group.q);
+            std::vector<float> g3k_out = std::move(g3_group.k);
+            std::vector<float> g3v_out = std::move(g3_group.v);
+            std::vector<float> g3q_rotated = g3q_out;
+            std::vector<float> g3k_rotated = g3k_out;
+            const float * theta_g3 = model.vector_rope_theta.data();
+            apply_rope(theta_g3, g3q_rotated, L, 4, 64);
+            apply_rope(theta_g3, g3k_rotated, text_len, 4, 64);
+            g3_attn_out = run_text_attention_cache(g3_attn_cache, model,
+                g3q_rotated, g3k_rotated, g3v_out,
+                L, text_len, 4, 64,
+                "vector_estimator:onnx::MatMul_3245",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.21.attn.out_fc.linear.bias",
+                current_step, "g3_attn_flash",
+                include_ggml_trace ? &g3_attn_ctx_trace : nullptr);
+        }
+        PUSH_GGML_TRACE({"ve_g3_attn_q_rope", {L, 256}, g3_group.q_rope});
+        PUSH_GGML_TRACE({"ve_g3_attn_k_rope", {text_len, 256}, g3_group.k_rope});
         PUSH_GGML_TRACE({"ve_g3_attn_ctx", {L, 256}, g3_attn_ctx_trace});
         PUSH_GGML_TRACE({"ve_g3_attn_out", {L, C}, g3_attn_out});
 
@@ -2533,52 +4120,43 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             "g3_attn_residual_style_qkv",
             include_ggml_trace ? &ggml_trace : nullptr);
         std::vector<float> g3_block22 = std::move(g3_res_qkv.post);
-        std::vector<float> g3sq_out = std::move(g3_res_qkv.sq);
-        std::vector<float> g3sk_out = std::move(g3_res_qkv.sk);
-        std::vector<float> g3sv_out = std::move(g3_res_qkv.sv);
+        // QVAC-18605 round 9 — style flash-attn GPU bridge for g3.
         thread_local vector_text_attention_cache g3_style_attn_cache;
         std::vector<float> g3_style_ctx_trace;
-        std::vector<float> g3_style_out = run_text_attention_cache(g3_style_attn_cache, model, g3sq_out, g3sk_out, g3sv_out,
-            L, 50, 2, 128,
-            "vector_estimator:onnx::MatMul_3254",
-            "vector_estimator:tts.ttl.vector_field.main_blocks.23.attention.out_fc.linear.bias",
-            current_step, "g3_style_flash",
-            include_ggml_trace ? &g3_style_ctx_trace : nullptr);
+        std::vector<float> g3_style_out;
+        const bool g3_style_use_gpu_bridge = !include_ggml_trace
+            && g3_res_qkv.sq_gpu && g3_res_qkv.sk_gpu && g3_res_qkv.sv_gpu;
+        if (g3_style_use_gpu_bridge) {
+            g3_style_out = run_text_attention_cache_gpu(g3_style_attn_cache, model,
+                g3_res_qkv.sq_gpu, g3_res_qkv.sk_gpu, g3_res_qkv.sv_gpu,
+                L, 50, 2, 128,
+                "vector_estimator:onnx::MatMul_3254",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.23.attention.out_fc.linear.bias",
+                current_step, "g3_style_flash",
+                /*ctx_trace=*/ nullptr);
+        } else {
+            std::vector<float> g3sq_out = std::move(g3_res_qkv.sq);
+            std::vector<float> g3sk_out = std::move(g3_res_qkv.sk);
+            std::vector<float> g3sv_out = std::move(g3_res_qkv.sv);
+            g3_style_out = run_text_attention_cache(g3_style_attn_cache, model, g3sq_out, g3sk_out, g3sv_out,
+                L, 50, 2, 128,
+                "vector_estimator:onnx::MatMul_3254",
+                "vector_estimator:tts.ttl.vector_field.main_blocks.23.attention.out_fc.linear.bias",
+                current_step, "g3_style_flash",
+                include_ggml_trace ? &g3_style_ctx_trace : nullptr);
+        }
         PUSH_GGML_TRACE({"ve_g3_style_ctx", {L, 256}, g3_style_ctx_trace});
         PUSH_GGML_TRACE({"ve_g3_style_out", {L, C}, g3_style_out});
 
-        constexpr int G3_STYLE_RES_NODES = 128;
-        static size_t g3_style_res_buf_size = ggml_tensor_overhead() * G3_STYLE_RES_NODES +
-                                              ggml_graph_overhead_custom(G3_STYLE_RES_NODES, false);
-        thread_local std::vector<uint8_t> g3_style_res_buf(g3_style_res_buf_size);
-        ggml_init_params g3srp = { g3_style_res_buf_size, g3_style_res_buf.data(), true };
-        ggml_context * g3srctx = ggml_init(g3srp);
-        ggml_cgraph * g3srgf = ggml_new_graph_custom(g3srctx, G3_STYLE_RES_NODES, false);
-        ggml_tensor * g3_style_lhs = ggml_new_tensor_2d(g3srctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(g3_style_lhs, "g3_style_lhs"); ggml_set_input(g3_style_lhs);
-        ggml_tensor * g3_style_out_in = ggml_new_tensor_2d(g3srctx, GGML_TYPE_F32, L, C);
-        ggml_set_name(g3_style_out_in, "g3_style_out_in"); ggml_set_input(g3_style_out_in);
-        ggml_tensor * g3_style_res = ggml_add(g3srctx, g3_style_lhs, g3_style_out_in);
-        ggml_set_name(g3_style_res, "ve_g3_style_residual");
-        if (include_ggml_trace) {
-            ggml_set_output(g3_style_res);
-            ggml_build_forward_expand(g3srgf, g3_style_res);
-        }
-        ggml_tensor * g3_style_norm = layer_norm_ggml(g3srctx, g3_style_res,
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.23.norm.norm.weight"),
-            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.23.norm.norm.bias"));
-        ggml_set_name(g3_style_norm, "ve_g3_style_norm"); ggml_set_output(g3_style_norm);
-        ggml_build_forward_expand(g3srgf, g3_style_norm);
-        supertonic_sched_alloc(model, g3srgf);
-        std::vector<float> g3_style_lhs_raw = pack_time_channel_for_ggml(g3_block22, L, C);
-        std::vector<float> g3_style_out_raw = pack_time_channel_for_ggml(g3_style_out, L, C);
-        ggml_backend_tensor_set(g3_style_lhs, g3_style_lhs_raw.data(), 0, g3_style_lhs_raw.size()*sizeof(float));
-        ggml_backend_tensor_set(g3_style_out_in, g3_style_out_raw.data(), 0, g3_style_out_raw.size()*sizeof(float));
-        profile_vector_compute(model, g3srgf, current_step, "g3_style_residual");
-        PUSH_GGML_TRACE({"ve_g3_style_residual", {L, C}, tensor_to_time_channel(ggml_graph_get_tensor(g3srgf, "ve_g3_style_residual"))});
-        std::vector<float> g3_style_norm_vec = tensor_to_time_channel(ggml_graph_get_tensor(g3srgf, "ve_g3_style_norm"));
+        // F8: cached style-residual graph (norm_block = 23 for group 3).
+        thread_local vector_style_residual_graph_cache g3_style_res_cache;
+        std::vector<float> g3_style_res_trace;
+        std::vector<float> g3_style_norm_vec = run_style_residual_cache(
+            g3_style_res_cache, model, g3_block22, g3_style_out,
+            L, C, /*norm_block=*/23, current_step, "g3_style_residual",
+            include_ggml_trace ? &g3_style_res_trace : nullptr);
+        PUSH_GGML_TRACE({"ve_g3_style_residual", {L, C}, g3_style_res_trace});
         PUSH_GGML_TRACE({"ve_g3_style_norm", {L, C}, g3_style_norm_vec});
-        ggml_free(g3srctx);
 
         thread_local vector_tail_graph_cache tail_cache;
         std::vector<float> next_latent_tc = run_tail_graph_cache(tail_cache, model, g3_style_norm_vec,
@@ -2586,7 +4164,8 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
             include_ggml_trace ? &ggml_trace : nullptr);
         if (next_latent_tc_out) *next_latent_tc_out = next_latent_tc;
 
-        ggml_free(ctx);
+        // F19: front-block ctx + allocr live in `front_cache` and
+        // survive across denoise steps; no per-call ctx to free.
         profile_vector_step_end(current_step);
         if (error) error->clear();
 #undef PUSH_GGML_TRACE
@@ -2597,6 +4176,912 @@ bool supertonic_vector_trace_proj_ggml(const supertonic_model & model,
     }
 }
 
+// Apply Supertonic's non-standard RoPE in-graph.
+// Supertonic uses angle = (t/L) * theta[d_half], where theta is loaded from
+// the GGUF and L is the per-call sequence length.  ggml_rope_ext's formula
+// expands to angle = (pos / freq_factors[d/2]) * freq_scale * freq_base^(-d/n_dims).
+// Setting freq_base=1, freq_scale=1, freq_factors[d_half] = L / theta[d_half],
+// positions = [0..L) reproduces the Supertonic formula exactly.  NEOX mode
+// matches apply_rope's split-pairs layout (x[d] rotates with x[d+D/2]) at
+// supertonic_vector_estimator.cpp:1416.
+//
+// x_tc must be a contiguous 2D tensor of shape ne=[H*D, q_len] (width-major).
+// `positions` is int32 [q_len], `freq_factors` is f32 [D/2]; both are caller-
+// owned input tensors set via ggml_backend_tensor_set before compute.
+ggml_tensor * apply_supertonic_rope_ggml(ggml_context * ctx,
+                                          ggml_tensor * x_tc,
+                                          ggml_tensor * positions,
+                                          ggml_tensor * freq_factors,
+                                          int q_len,
+                                          int H,
+                                          int D) {
+    GGML_ASSERT(x_tc->ne[0] == (int64_t)(H*D));
+    GGML_ASSERT(x_tc->ne[1] == (int64_t)q_len);
+    const size_t row_bytes = (size_t)(H*D) * sizeof(float);
+    const size_t head_bytes = (size_t)D * sizeof(float);
+    // View [H*D, q_len] as [D, H, q_len] so rope's outer dim is time.
+    // Strides: nb1 = head step (D floats), nb2 = time step (H*D floats).
+    // This view is naturally contiguous (nb[0]=elem_size, nb[1]=D*elem_size,
+    // nb[2]=H*D*elem_size = ne[0]*ne[1]*elem_size) so we can skip the
+    // ggml_cont copy that earlier versions inserted defensively.
+    ggml_tensor * x_view = ggml_view_3d(ctx, x_tc, D, H, q_len,
+                                         head_bytes, row_bytes, 0);
+    ggml_tensor * roped = ggml_rope_ext(ctx, x_view, positions, freq_factors,
+                                         D, GGML_ROPE_TYPE_NEOX, 0,
+                                         /*freq_base=*/1.0f,
+                                         /*freq_scale=*/1.0f,
+                                         /*ext_factor=*/0.0f,
+                                         /*attn_factor=*/1.0f,
+                                         /*beta_fast=*/0.0f,
+                                         /*beta_slow=*/0.0f);
+    return ggml_reshape_2d(ctx, roped, (int64_t) H * D, q_len);
+}
+
+// Append a text-attention subgraph (Q, K, V flash-attention + out projection +
+// bias add) to the parent (ctx, gf).  Mirrors build_text_attention_cache but
+// composes into the caller's context instead of owning one.
+//
+// Inputs:
+//   q_tc, k_tc, v_tc: contiguous [H*D, *_len] tensors
+//   out_w_tensor: model tensor for the out projection weight
+//   out_b_tensor: model tensor for the out projection bias
+// Returns: out_tc tensor of shape [out_dim, q_len].
+ggml_tensor * append_text_attention_subgraph(ggml_context * ctx,
+                                              const supertonic_model & model,
+                                              ggml_tensor * q_tc,
+                                              ggml_tensor * k_tc,
+                                              ggml_tensor * v_tc,
+                                              int q_len, int kv_len,
+                                              int n_heads, int head_dim,
+                                              ggml_tensor * out_w_tensor,
+                                              ggml_tensor * out_b_tensor,
+                                              float scale) {
+    const int width = n_heads * head_dim;
+    const size_t time_stride = (size_t)width * sizeof(float);
+    const size_t head_stride = (size_t)head_dim * sizeof(float);
+    ggml_tensor * q_in = ggml_view_3d(ctx, q_tc,
+        head_dim, q_len, n_heads, time_stride, head_stride, 0);
+    ggml_tensor * k_in = ggml_view_3d(ctx, k_tc,
+        head_dim, kv_len, n_heads, time_stride, head_stride, 0);
+    ggml_tensor * v_in = ggml_view_3d(ctx, v_tc,
+        head_dim, kv_len, n_heads, time_stride, head_stride, 0);
+    ggml_tensor * attn = ggml_flash_attn_ext(ctx, q_in, k_in, v_in,
+                                              nullptr, scale, 0.0f, 0.0f);
+    attn = ggml_reshape_2d(ctx, attn, (int64_t) n_heads * head_dim, q_len);
+    ggml_tensor * ctx_tc = ggml_cont(ctx, ggml_transpose(ctx, attn));
+    return dense_matmul_time_pretransposed_ggml(ctx, model, ctx_tc, out_w_tensor, out_b_tensor);
+}
+
+// Per-group MatMul tensor name suffixes (groups 0..3).  See per-group source
+// names in trace_proj_ggml; these tables centralise them for the consolidated
+// path.
+struct vector_step_group_names {
+    int t_linear;    // time-linear (matmul for time embedding projection)
+    int attn_q;
+    int attn_k;
+    int attn_v;
+    int attn_out;
+    int style_q;
+    int style_k;
+    int style_v;
+    int style_out;
+};
+
+static const vector_step_group_names kGroupNames[4] = {
+    {3095, 3101, 3102, 3103, 3110, 3116, 3117, 3118, 3119},
+    {3140, 3146, 3147, 3148, 3155, 3161, 3162, 3163, 3164},
+    {3185, 3191, 3192, 3193, 3200, 3206, 3207, 3208, 3209},
+    {3230, 3236, 3237, 3238, 3245, 3251, 3252, 3253, 3254},
+};
+
+static std::string matmul_name(int suffix) {
+    return "vector_estimator:onnx::MatMul_" + std::to_string(suffix);
+}
+
+// Bundle of input tensors a single CFM step subgraph needs.  Used both by
+// the per-step cache (one step per ggml_cgraph) and by the
+// 5-steps-unrolled-into-one-graph cache (Phase A1+A2).
+//
+// `x_in` / `noise_in` vary per step (x_in = latent for this step,
+// noise_in is the "residual" we add the velocity to — for Supertonic's
+// CFM equation `next = noise_in + velocity * (1 / total_steps)` they
+// happen to be the same tensor for a single step but become DIFFERENT
+// tensors when steps are chained: step N's x_in is step N-1's output,
+// while noise_in is still the original noisy latent that step.  In the
+// per-step path we bind them to the same external buffer; in the
+// unrolled-loop path we wire them as graph edges between steps).
+//
+// `t_emb_in` varies per step (one time embedding per CFM step index).
+// All other inputs are constant across the 5 CFM steps and bind to a
+// single shared input tensor regardless of which path is used.
+struct vector_step_inputs {
+    ggml_tensor * x_in           = nullptr;  // ne=[L, Cin]    f32
+    ggml_tensor * mask_in        = nullptr;  // ne=[L]         f32
+    ggml_tensor * t_emb_in       = nullptr;  // ne=[64]        f32  (per-step)
+    ggml_tensor * text_in        = nullptr;  // ne=[text_len, 256] f32
+    ggml_tensor * style_v_raw_in = nullptr;  // ne=[50, 256]   f32
+    ggml_tensor * style_kctx_in  = nullptr;  // ne=[50, 256]   f32
+    ggml_tensor * noise_in       = nullptr;  // ne=[L, Cin]    f32  (per-step)
+    ggml_tensor * pos_q          = nullptr;  // ne=[L]         i32
+    ggml_tensor * pos_k          = nullptr;  // ne=[text_len]  i32
+    ggml_tensor * freq_factors_q = nullptr;  // ne=[D/2]       f32
+    ggml_tensor * freq_factors_k = nullptr;  // ne=[D/2]       f32
+};
+
+// Append one CFM step's subgraph (proj_in → 4 groups → tail → proj_out
+// → velocity → next = noise + velocity / total_steps) to `gf`.  All
+// inputs are pre-bound by the caller; this function only builds the
+// dataflow and returns the `next` tensor (ne=[L, Cin]) so the caller
+// can either set it as a graph output or feed it as the next step's
+// `x_in`.  The function does NOT call `ggml_set_output` /
+// `ggml_build_forward_expand` on the result — that's the caller's
+// decision.
+//
+// `L`, `text_len` and `total_steps` are passed explicitly because they're
+// used in several places.  CPU vs GPU dispatch lives on the thread-local
+// `supertonic_use_cpu_custom_ops()` flag set by the outer
+// `supertonic_op_dispatch_scope` at the public entry point.
+ggml_tensor * append_supertonic_vector_step_subgraph(
+        ggml_context * gctx,
+        ggml_cgraph * gf,
+        const supertonic_model & model,
+        const vector_step_inputs & inputs,
+        int L,
+        int text_len,
+        int total_steps);
+
+// Consolidated per-step cache: one ctx, one cgraph, one gallocr for the entire
+// per-step computation.  Replaces the ~17 sub-graph dispatches the trace_proj
+// orchestrator emits with a single ggml_backend_graph_compute call.
+struct vector_step_one_graph_cache {
+    const supertonic_model * model = nullptr;
+    uint64_t generation_id = 0;
+    int L = 0;
+    int text_len = 0;
+    int total_steps = 0;
+
+    std::vector<uint8_t> buf;
+    ggml_context * ctx = nullptr;
+    ggml_cgraph * gf = nullptr;
+    ggml_gallocr_t allocr = nullptr;
+
+    // Per-call inputs
+    ggml_tensor * x_in = nullptr;          // noisy_latent (L, Cin) ggml-shape: ne=[L, Cin]
+    ggml_tensor * mask_in = nullptr;       // [L]
+    ggml_tensor * t_emb_in = nullptr;      // [64]
+    ggml_tensor * text_in = nullptr;       // [text_len, 256]
+    ggml_tensor * style_v_raw_in = nullptr; // [50, 256] (style_ttl repacked)
+    ggml_tensor * style_kctx_in = nullptr;  // [50, 256] (model's /Expand_output_0)
+    ggml_tensor * noise_in = nullptr;       // (L, Cin) (same data as x_in but indep slot for tail)
+
+    // Per-build (rope) inputs
+    ggml_tensor * pos_q = nullptr;          // int32 [L]
+    ggml_tensor * pos_k = nullptr;          // int32 [text_len]
+    ggml_tensor * freq_factors_q = nullptr; // f32 [32] (head_dim/2)
+    ggml_tensor * freq_factors_k = nullptr; // f32 [32]
+
+    // Output
+    ggml_tensor * next_latent_out = nullptr; // ne=[L, Cin] in (t, c) order
+};
+
+void free_vector_step_one_graph_cache(vector_step_one_graph_cache & cache) {
+    if (cache.allocr) {
+        supertonic_safe_gallocr_free(cache.allocr, cache.model ? cache.model->generation_id : 0);
+        cache.allocr = nullptr;
+    }
+    if (cache.ctx) {
+        ggml_free(cache.ctx);
+        cache.ctx = nullptr;
+    }
+    cache.gf = nullptr;
+    cache.buf.clear();
+    cache.model = nullptr;
+    cache.generation_id = 0;
+    cache.L = 0;
+    cache.text_len = 0;
+    cache.total_steps = 0;
+    cache.x_in = cache.mask_in = cache.t_emb_in = cache.text_in = nullptr;
+    cache.style_v_raw_in = cache.style_kctx_in = cache.noise_in = nullptr;
+    cache.pos_q = cache.pos_k = cache.freq_factors_q = cache.freq_factors_k = nullptr;
+    cache.next_latent_out = nullptr;
+}
+
+ggml_tensor * append_supertonic_vector_step_subgraph(
+        ggml_context * gctx,
+        ggml_cgraph * gf,
+        const supertonic_model & model,
+        const vector_step_inputs & inputs,
+        int L,
+        int text_len,
+        int total_steps) {
+    const bool use_cpu_custom = supertonic_use_cpu_custom_ops();
+    // Shape constants that aren't dependent on L / text_len.  Mirror the
+    // values from supertonic_vector_step_one_graph_ggml.
+    const int C = 512;
+    const int H = 4;        // text-attention heads
+    const int D = 64;       // text-attention head_dim
+    const int SH = 2;       // style-attention heads
+    const int SD = 128;     // style-attention head_dim
+    const int kv_style = 50; // fixed by /Expand_output_0
+    (void)H; (void)D; (void)SH; (void)SD; (void)kv_style;
+
+    // ===== PHASE 0: proj_in + mask =====
+    ggml_tensor * cur = conv1d_f32(gctx,
+        require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.proj_in.net.weight"),
+        inputs.x_in, 1, 0, 1);
+    cur = ggml_mul(gctx, cur, repeat_like(gctx, inputs.mask_in, cur));
+
+    // ===== PHASE 1: Group 0 prologue — ConvNeXt × 4 on main_blocks.0 + time_add (1) + ConvNeXt (2) =====
+    int dils[4] = {1, 2, 4, 8};
+    // Phase B2 full: permute to [C, T] once before the 4-block chain, run
+    // the chain in [C, T] (which lets each block's two pointwise convs
+    // become a direct ggml_mul_mat with no im2col), permute back to
+    // [T, C] for the downstream time-add.  Saves 2 im2col dispatches per
+    // block × 4 blocks × 5 steps − 2 permutes per chain × 5 steps =
+    // 30 dispatches eliminated per synth.  Override:
+    // SUPERTONIC_DISABLE_CT_CONVNEXT=1.
+    static const bool disable_ct_convnext =
+        std::getenv("SUPERTONIC_DISABLE_CT_CONVNEXT") != nullptr;
+    const bool use_ct_convnext = !disable_ct_convnext && !use_cpu_custom;
+    if (use_ct_convnext) {
+        ggml_tensor * cur_ct = ggml_cont(gctx, ggml_permute(gctx, cur, 1, 0, 2, 3));
+        for (int j = 0; j < 4; ++j) {
+            cur_ct = vector_convnext_ggml_ct(gctx, model,
+                "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext." + std::to_string(j),
+                cur_ct, dils[j]);
+        }
+        cur = ggml_cont(gctx, ggml_permute(gctx, cur_ct, 1, 0, 2, 3));
+    } else {
+        for (int j = 0; j < 4; ++j) {
+            cur = vector_convnext_ggml(gctx, model,
+                "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext." + std::to_string(j),
+                cur, dils[j]);
+        }
+    }
+    // Time-add for group 0.
+    {
+        ggml_tensor * w = require_source_tensor(model, matmul_name(kGroupNames[0].t_linear));
+        ggml_tensor * b = require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks.1.linear.linear.bias");
+        ggml_tensor * w_t = try_pretransposed_weight(model, w);
+        if (!w_t) w_t = ggml_cont(gctx, ggml_transpose(gctx, w));
+        ggml_tensor * t_proj = ggml_mul_mat(gctx, w_t, ggml_reshape_2d(gctx, inputs.t_emb_in, 64, 1));
+        t_proj = ggml_add(gctx, t_proj, ggml_reshape_2d(gctx, b, C, 1));
+        cur = ggml_add(gctx, cur, repeat_like(gctx, t_proj, cur));
+    }
+    cur = vector_convnext_ggml(gctx, model,
+        "vector_estimator:tts.ttl.vector_field.main_blocks.2.convnext.0",
+        cur, 1);
+    ggml_tensor * block_pre_attn = cur;
+
+    // Per-group attention block.
+    auto run_group = [&](ggml_tensor * x, int group, ggml_tensor * x_pre_attn) -> ggml_tensor * {
+        const auto & names = kGroupNames[group];
+        const int attn_block = group * 6 + 3;
+        const int post_attn_block = group * 6 + 4;
+        const int style_block = group * 6 + 5;
+
+        // Text attention QKV — output directly in [A, T] (width-major)
+        // layout so the cont(transpose) before rope/flash_attn is gone.
+        // The kernel-as-src0 ordering also dispatches the optimized
+        // kernel_mul_mm_q8_0_f32 when weights are q8_0.
+        ggml_tensor * q_wt = dense_matmul_time_wt_pretransposed_ggml(gctx, model, x_pre_attn,
+            require_source_tensor(model, matmul_name(names.attn_q)),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(attn_block) + ".attn.W_query.linear.bias"));
+        ggml_tensor * k_wt = dense_matmul_time_wt_pretransposed_ggml(gctx, model, inputs.text_in,
+            require_source_tensor(model, matmul_name(names.attn_k)),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(attn_block) + ".attn.W_key.linear.bias"));
+        ggml_tensor * v_wt = dense_matmul_time_wt_pretransposed_ggml(gctx, model, inputs.text_in,
+            require_source_tensor(model, matmul_name(names.attn_v)),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(attn_block) + ".attn.W_value.linear.bias"));
+
+        q_wt = apply_supertonic_rope_ggml(gctx, q_wt, inputs.pos_q, inputs.freq_factors_q, L, H, D);
+        k_wt = apply_supertonic_rope_ggml(gctx, k_wt, inputs.pos_k, inputs.freq_factors_k, text_len, H, D);
+
+        ggml_tensor * attn_out = append_text_attention_subgraph(gctx, model,
+            q_wt, k_wt, v_wt, L, text_len, H, D,
+            require_source_tensor(model, matmul_name(names.attn_out)),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(attn_block) + ".attn.out_fc.linear.bias"),
+            1.0f / 16.0f);
+
+        ggml_tensor * residual = ggml_add(gctx, x_pre_attn, attn_out);
+        ggml_tensor * normed = layer_norm_ggml(gctx, residual,
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(attn_block) + ".norm.norm.weight"),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(attn_block) + ".norm.norm.bias"));
+
+        ggml_tensor * post = vector_convnext_ggml(gctx, model,
+            "vector_estimator:tts.ttl.vector_field.main_blocks." +
+            std::to_string(post_attn_block) + ".convnext.0",
+            normed, 1);
+
+        ggml_tensor * masked_post = ggml_mul(gctx, post, repeat_like(gctx, inputs.mask_in, post));
+
+        // Style attention QKV — output directly in [A, T] layout.
+        ggml_tensor * sq_wt = dense_matmul_time_wt_pretransposed_ggml(gctx, model, masked_post,
+            require_source_tensor(model, matmul_name(names.style_q)),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(style_block) + ".attention.W_query.linear.bias"));
+        ggml_tensor * sk_wt = dense_matmul_time_wt_pretransposed_ggml(gctx, model, inputs.style_kctx_in,
+            require_source_tensor(model, matmul_name(names.style_k)),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(style_block) + ".attention.W_key.linear.bias"));
+        sk_wt = ggml_tanh(gctx, sk_wt);
+        ggml_tensor * sv_wt = dense_matmul_time_wt_pretransposed_ggml(gctx, model, inputs.style_v_raw_in,
+            require_source_tensor(model, matmul_name(names.style_v)),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(style_block) + ".attention.W_value.linear.bias"));
+
+        ggml_tensor * style_out = append_text_attention_subgraph(gctx, model,
+            sq_wt, sk_wt, sv_wt, L, kv_style, SH, SD,
+            require_source_tensor(model, matmul_name(names.style_out)),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(style_block) + ".attention.out_fc.linear.bias"),
+            1.0f / 16.0f);
+
+        ggml_tensor * style_residual = ggml_add(gctx, post, style_out);
+        ggml_tensor * style_normed = layer_norm_ggml(gctx, style_residual,
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(style_block) + ".norm.norm.weight"),
+            require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                                         std::to_string(style_block) + ".norm.norm.bias"));
+        (void)x;
+        return style_normed;
+    };
+
+    // Group prep for groups 1-3.
+    auto group_prep = [&](ggml_tensor * x, int group) -> ggml_tensor * {
+        const int conv_block = group * 6 + 0;
+        const int linear_block = group * 6 + 1;
+        const int post_block = group * 6 + 2;
+        int dils2[4] = {1, 2, 4, 8};
+        ggml_tensor * y = x;
+        if (use_ct_convnext) {
+            ggml_tensor * y_ct = ggml_cont(gctx, ggml_permute(gctx, y, 1, 0, 2, 3));
+            for (int j = 0; j < 4; ++j) {
+                y_ct = vector_convnext_ggml_ct(gctx, model,
+                    "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                    std::to_string(conv_block) + ".convnext." + std::to_string(j),
+                    y_ct, dils2[j]);
+            }
+            y = ggml_cont(gctx, ggml_permute(gctx, y_ct, 1, 0, 2, 3));
+        } else {
+            for (int j = 0; j < 4; ++j) {
+                y = vector_convnext_ggml(gctx, model,
+                    "vector_estimator:tts.ttl.vector_field.main_blocks." +
+                    std::to_string(conv_block) + ".convnext." + std::to_string(j),
+                    y, dils2[j]);
+            }
+        }
+        ggml_tensor * w = require_source_tensor(model, matmul_name(kGroupNames[group].t_linear));
+        ggml_tensor * b = require_source_tensor(model,
+            "vector_estimator:tts.ttl.vector_field.main_blocks." +
+            std::to_string(linear_block) + ".linear.linear.bias");
+        ggml_tensor * w_t = try_pretransposed_weight(model, w);
+        if (!w_t) w_t = ggml_cont(gctx, ggml_transpose(gctx, w));
+        ggml_tensor * t_proj = ggml_mul_mat(gctx, w_t, ggml_reshape_2d(gctx, inputs.t_emb_in, 64, 1));
+        t_proj = ggml_add(gctx, t_proj, ggml_reshape_2d(gctx, b, C, 1));
+        y = ggml_add(gctx, y, repeat_like(gctx, t_proj, y));
+        y = vector_convnext_ggml(gctx, model,
+            "vector_estimator:tts.ttl.vector_field.main_blocks." +
+            std::to_string(post_block) + ".convnext.0",
+            y, 1);
+        return y;
+    };
+
+    ggml_tensor * x_after_g0 = run_group(cur, 0, block_pre_attn);
+    ggml_tensor * x_pre_g1 = group_prep(x_after_g0, 1);
+    ggml_tensor * x_after_g1 = run_group(x_after_g0, 1, x_pre_g1);
+    ggml_tensor * x_pre_g2 = group_prep(x_after_g1, 2);
+    ggml_tensor * x_after_g2 = run_group(x_after_g1, 2, x_pre_g2);
+    ggml_tensor * x_pre_g3 = group_prep(x_after_g2, 3);
+    ggml_tensor * x_after_g3 = run_group(x_after_g2, 3, x_pre_g3);
+
+    // Tail: last_convnext × 4 + proj_out + mask + noise add.
+    ggml_tensor * tail = x_after_g3;
+    if (use_ct_convnext) {
+        ggml_tensor * tail_ct = ggml_cont(gctx, ggml_permute(gctx, tail, 1, 0, 2, 3));
+        for (int j = 0; j < 4; ++j) {
+            tail_ct = vector_convnext_ggml_ct(gctx, model,
+                "vector_estimator:tts.ttl.vector_field.last_convnext.convnext." + std::to_string(j),
+                tail_ct, 1);
+        }
+        tail = ggml_cont(gctx, ggml_permute(gctx, tail_ct, 1, 0, 2, 3));
+    } else {
+        for (int j = 0; j < 4; ++j) {
+            tail = vector_convnext_ggml(gctx, model,
+                "vector_estimator:tts.ttl.vector_field.last_convnext.convnext." + std::to_string(j),
+                tail, 1);
+        }
+    }
+    ggml_tensor * velocity = conv1d_f32(gctx,
+        require_source_tensor(model, "vector_estimator:tts.ttl.vector_field.proj_out.net.weight"),
+        tail, 1, 0, 1);
+    ggml_tensor * masked_velocity = ggml_mul(gctx, velocity, repeat_like(gctx, inputs.mask_in, velocity));
+    ggml_tensor * scaled = ggml_scale(gctx, masked_velocity, 1.0f / (float)total_steps);
+    ggml_tensor * next = ggml_add(gctx, inputs.noise_in, scaled);
+
+    // Mark gf as used so the unused-parameter warning doesn't fire — the
+    // graph build is via the tensors above which inherit gf via ctx.
+    (void)gf;
+    return next;
+}
+
+
+// Compute one CFM denoising step as ONE ggml graph.  Used only when the
+// model's backend isn't CPU (Metal / CUDA / Vulkan / OpenCL).  Replaces the
+// ~21 sub-graph dispatches the trace_proj orchestrator emits with a single
+// ggml_backend_graph_compute call.
+bool supertonic_vector_step_one_graph_ggml(const supertonic_model & model,
+                                            const float * noisy_latent,
+                                            int latent_len,
+                                            const float * text_emb,
+                                            int text_len,
+                                            const float * style_ttl,
+                                            const float * latent_mask,
+                                            int current_step,
+                                            int total_steps,
+                                            std::vector<float> & next_latent_out,
+                                            std::string * error) {
+    // The outer entry point sets `supertonic_op_dispatch_scope`; this
+    // function is only called on non-CPU backends, so the thread-local
+    // `supertonic_use_cpu_custom_ops()` reads false inside the helpers.
+    try {
+        const int L = latent_len;
+        const int Cin = model.hparams.latent_channels;  // typically 16
+        const int C = 512;
+        const int text_C = 256;
+        const int H = 4;        // text-attention heads
+        const int D = 64;       // text-attention head_dim
+        const int A = H * D;    // 256 = attention width
+        const int SH = 2;       // style-attention heads
+        const int SD = 128;     // style-attention head_dim
+        const int kv_style = 50; // style attention kv length (fixed by /Expand_output_0)
+
+        thread_local vector_step_one_graph_cache cache;
+        const bool need_rebuild = cache.model != &model ||
+                                  cache.generation_id != model.generation_id ||
+                                  cache.L != L ||
+                                  cache.text_len != text_len ||
+                                  cache.total_steps != total_steps;
+        if (need_rebuild) {
+            free_vector_step_one_graph_cache(cache);
+            cache.model = &model;
+            cache.generation_id = model.generation_id;
+            cache.L = L;
+            cache.text_len = text_len;
+            cache.total_steps = total_steps;
+
+            // Memory budget for the consolidated graph.  The original
+            // sub-graphs each used 128-512 nodes; the full per-step graph is
+            // roughly the sum (4 groups x ~700 ops/group + tail + front).
+            // Round up generously.
+            constexpr int MAX_NODES = 8192;
+            const size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
+                                     ggml_graph_overhead_custom(MAX_NODES, false);
+            cache.buf.assign(buf_size, 0);
+            ggml_init_params p = { buf_size, cache.buf.data(), true };
+            cache.ctx = ggml_init(p);
+            cache.gf = ggml_new_graph_custom(cache.ctx, MAX_NODES, false);
+
+            // --- Per-call inputs ---
+            cache.x_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, Cin);
+            ggml_set_name(cache.x_in, "step_x_in"); ggml_set_input(cache.x_in);
+            cache.mask_in = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, L);
+            ggml_set_name(cache.mask_in, "step_mask"); ggml_set_input(cache.mask_in);
+            cache.t_emb_in = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, 64);
+            ggml_set_name(cache.t_emb_in, "step_temb"); ggml_set_input(cache.t_emb_in);
+            cache.text_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, text_len, text_C);
+            ggml_set_name(cache.text_in, "step_text_in"); ggml_set_input(cache.text_in);
+            cache.style_v_raw_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, kv_style, text_C);
+            ggml_set_name(cache.style_v_raw_in, "step_style_v"); ggml_set_input(cache.style_v_raw_in);
+            cache.style_kctx_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, kv_style, text_C);
+            ggml_set_name(cache.style_kctx_in, "step_style_kctx"); ggml_set_input(cache.style_kctx_in);
+            cache.noise_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, Cin);
+            ggml_set_name(cache.noise_in, "step_noise_in"); ggml_set_input(cache.noise_in);
+
+            // --- RoPE inputs ---
+            cache.pos_q = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_I32, L);
+            ggml_set_name(cache.pos_q, "step_pos_q"); ggml_set_input(cache.pos_q);
+            cache.pos_k = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_I32, text_len);
+            ggml_set_name(cache.pos_k, "step_pos_k"); ggml_set_input(cache.pos_k);
+            cache.freq_factors_q = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, D / 2);
+            ggml_set_name(cache.freq_factors_q, "step_ff_q"); ggml_set_input(cache.freq_factors_q);
+            cache.freq_factors_k = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, D / 2);
+            ggml_set_name(cache.freq_factors_k, "step_ff_k"); ggml_set_input(cache.freq_factors_k);
+
+            ggml_context * gctx = cache.ctx;
+            ggml_cgraph * gf = cache.gf;
+
+            vector_step_inputs inputs;
+            inputs.x_in           = cache.x_in;
+            inputs.mask_in        = cache.mask_in;
+            inputs.t_emb_in       = cache.t_emb_in;
+            inputs.text_in        = cache.text_in;
+            inputs.style_v_raw_in = cache.style_v_raw_in;
+            inputs.style_kctx_in  = cache.style_kctx_in;
+            inputs.noise_in       = cache.noise_in;
+            inputs.pos_q          = cache.pos_q;
+            inputs.pos_k          = cache.pos_k;
+            inputs.freq_factors_q = cache.freq_factors_q;
+            inputs.freq_factors_k = cache.freq_factors_k;
+
+            ggml_tensor * next = append_supertonic_vector_step_subgraph(
+                gctx, gf, model, inputs, L, text_len, total_steps);
+
+            ggml_set_name(next, "step_next_latent");
+            ggml_set_output(next);
+            ggml_build_forward_expand(gf, next);
+            cache.next_latent_out = next;
+
+
+            // Allocate.
+            cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
+            if (!cache.allocr) throw std::runtime_error("ggml_gallocr_new vector step one-graph failed");
+            if (!ggml_gallocr_reserve(cache.allocr, gf)) {
+                throw std::runtime_error("ggml_gallocr_reserve vector step one-graph failed");
+            }
+            ggml_gallocr_alloc_graph(cache.allocr, gf);
+        }
+
+        // ===== Per-call inputs =====
+        // The existing trace_proj_ggml at lines 2143/2151 sets these tensors
+        // DIRECTLY from the caller-provided channel-major buffers (no host
+        // transpose), and the views downstream interpret memory accordingly.
+        // Copy that pattern exactly — my earlier transpose loops were a bug
+        // (correlation 0.003 vs CPU reference; root-caused 2026-05-11).
+        ggml_backend_tensor_set(cache.x_in, noisy_latent, 0, (size_t)L * Cin * sizeof(float));
+        ggml_backend_tensor_set(cache.noise_in, noisy_latent, 0, (size_t)L * Cin * sizeof(float));
+        ggml_backend_tensor_set(cache.mask_in, latent_mask, 0, (size_t)L * sizeof(float));
+
+        std::vector<float> te_host = time_embedding(model, current_step, total_steps);
+        ggml_backend_tensor_set(cache.t_emb_in, te_host.data(), 0, te_host.size() * sizeof(float));
+
+        // text_emb is in (C=256, text_len) channel-major; the tensor has
+        // ne=[text_len, 256] which puts t_len fast in memory.  Same raw layout,
+        // so direct memcpy (matches trace_proj_ggml).
+        ggml_backend_tensor_set(cache.text_in, text_emb, 0, (size_t)text_len * 256 * sizeof(float));
+
+        // Style inputs (cached host buffers from existing helper).
+        const std::vector<float> * style_v_raw_ptr = nullptr;
+        const std::vector<float> * kctx_raw_ptr = nullptr;
+        cached_style_layouts(model, style_ttl, style_v_raw_ptr, kctx_raw_ptr);
+        ggml_backend_tensor_set(cache.style_v_raw_in, style_v_raw_ptr->data(), 0, style_v_raw_ptr->size() * sizeof(float));
+        ggml_backend_tensor_set(cache.style_kctx_in, kctx_raw_ptr->data(), 0, kctx_raw_ptr->size() * sizeof(float));
+
+        // RoPE positions + freq_factors.  theta is loaded from the model and
+        // depends on L (sequence length); recompute per call.
+        {
+            std::vector<int32_t> pos_q_host(L);
+            for (int i = 0; i < L; ++i) pos_q_host[i] = i;
+            ggml_backend_tensor_set(cache.pos_q, pos_q_host.data(), 0, pos_q_host.size() * sizeof(int32_t));
+            std::vector<int32_t> pos_k_host(text_len);
+            for (int i = 0; i < text_len; ++i) pos_k_host[i] = i;
+            ggml_backend_tensor_set(cache.pos_k, pos_k_host.data(), 0, pos_k_host.size() * sizeof(int32_t));
+
+            const int half = 32;  // D/2 = 64/2
+            f32_tensor theta = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
+            if ((int)theta.data.size() < half) {
+                throw std::runtime_error("theta tensor has fewer than D/2 elements");
+            }
+            std::vector<float> ff_q(half), ff_k(half);
+            for (int d = 0; d < half; ++d) {
+                ff_q[d] = (float)L / theta.data[d];
+                ff_k[d] = (float)text_len / theta.data[d];
+            }
+            ggml_backend_tensor_set(cache.freq_factors_q, ff_q.data(), 0, ff_q.size() * sizeof(float));
+            ggml_backend_tensor_set(cache.freq_factors_k, ff_k.data(), 0, ff_k.size() * sizeof(float));
+        }
+
+        // ===== ONE compute call =====
+        supertonic_graph_compute(model, cache.gf);
+
+        // ===== Read output =====
+        // The output tensor has ne=[L, Cin] with element (i=t, j=c) at offset
+        // c*L+t — exactly the (c, t) channel-major layout the caller expects.
+        // Direct memcpy, no transpose.
+        next_latent_out.assign((size_t)Cin * L, 0.0f);
+        ggml_backend_tensor_get(cache.next_latent_out, next_latent_out.data(), 0,
+                                 (size_t)Cin * L * sizeof(float));
+        if (error) error->clear();
+        return true;
+    } catch (const std::exception & e) {
+        if (error) *error = e.what();
+        return false;
+    }
+}
+
+// =====================================================================
+// Phase A1+A2 — single-graph CFM loop
+// =====================================================================
+//
+// Unroll all `total_steps` CFM denoising steps into ONE ggml_cgraph and
+// dispatch with a single ggml_backend_graph_compute call.  Each step's
+// `x_in` and `noise_in` is the previous step's output node (no host
+// round-trip), and only `t_emb_in` differs per step (N inputs, one
+// per CFM step).  Replaces the engine's `for (step ...) {
+// supertonic_vector_step_ggml(...) }` loop on non-CPU backends.
+//
+// CPU keeps the per-step path because its cblas fastpaths benefit from
+// the cache-per-shape boundary and the host-side rope/style helpers in
+// trace_proj_ggml expect to see per-step outputs.
+
+struct vector_loop_one_graph_cache {
+    const supertonic_model * model = nullptr;
+    uint64_t generation_id = 0;
+    int L = 0;
+    int text_len = 0;
+    int total_steps = 0;
+
+    std::vector<uint8_t> buf;
+    ggml_context * ctx = nullptr;
+    ggml_cgraph * gf = nullptr;
+    ggml_gallocr_t allocr = nullptr;
+
+    // Shared inputs (constant across CFM steps).
+    ggml_tensor * x0_in = nullptr;          // ne=[L, Cin]  initial noisy latent
+    ggml_tensor * mask_in = nullptr;        // ne=[L]
+    ggml_tensor * text_in = nullptr;        // ne=[text_len, 256]
+    ggml_tensor * style_v_raw_in = nullptr; // ne=[50, 256]
+    ggml_tensor * style_kctx_in = nullptr;  // ne=[50, 256]
+
+    // RoPE inputs (constant across steps).
+    ggml_tensor * pos_q = nullptr;
+    ggml_tensor * pos_k = nullptr;
+    ggml_tensor * freq_factors_q = nullptr;
+    ggml_tensor * freq_factors_k = nullptr;
+
+    // Per-step time embedding (one tensor per CFM step).
+    std::vector<ggml_tensor *> t_emb_in;
+
+    // Final output — last step's `next` tensor.
+    ggml_tensor * final_latent_out = nullptr;
+};
+
+void free_vector_loop_one_graph_cache(vector_loop_one_graph_cache & cache) {
+    if (cache.allocr) {
+        supertonic_safe_gallocr_free(cache.allocr, cache.model ? cache.model->generation_id : 0);
+        cache.allocr = nullptr;
+    }
+    if (cache.ctx) {
+        ggml_free(cache.ctx);
+        cache.ctx = nullptr;
+    }
+    cache.gf = nullptr;
+    cache.buf.clear();
+    cache.model = nullptr;
+    cache.generation_id = 0;
+    cache.L = 0;
+    cache.text_len = 0;
+    cache.total_steps = 0;
+    cache.x0_in = cache.mask_in = cache.text_in = nullptr;
+    cache.style_v_raw_in = cache.style_kctx_in = nullptr;
+    cache.pos_q = cache.pos_k = cache.freq_factors_q = cache.freq_factors_k = nullptr;
+    cache.t_emb_in.clear();
+    cache.final_latent_out = nullptr;
+}
+
+bool supertonic_vector_loop_one_graph_ggml(const supertonic_model & model,
+                                            const float * initial_noisy_latent,
+                                            int latent_len,
+                                            const float * text_emb,
+                                            int text_len,
+                                            const float * style_ttl,
+                                            const float * latent_mask,
+                                            int total_steps,
+                                            std::vector<float> & final_latent_out,
+                                            std::string * error) {
+    // Public entry point — set the thread-local dispatch flag so the
+    // helpers' `supertonic_use_cpu_custom_ops()` reads consistently
+    // (false on non-CPU backends, true on CPU + accelerate/cblas).
+    supertonic_op_dispatch_scope dispatch(model);
+    try {
+        const int L = latent_len;
+        const int Cin = model.hparams.latent_channels;
+        const int text_C = 256;
+        const int D = 64;
+        const int kv_style = 50;
+
+        thread_local vector_loop_one_graph_cache cache;
+        const bool need_rebuild = cache.model != &model ||
+                                  cache.generation_id != model.generation_id ||
+                                  cache.L != L ||
+                                  cache.text_len != text_len ||
+                                  cache.total_steps != total_steps;
+        if (need_rebuild) {
+            free_vector_loop_one_graph_cache(cache);
+            cache.model = &model;
+            cache.generation_id = model.generation_id;
+            cache.L = L;
+            cache.text_len = text_len;
+            cache.total_steps = total_steps;
+
+            // ~5x the per-step node budget.  Each per-step build registered ~1056
+            // ggml nodes pre-Tier-2; post-Tier-2 it's ~928.  Round up to 8192/step
+            // × total_steps = ~40k.  Plus the shared inputs (a few dozen) +
+            // per-step temb input tensors.
+            const int MAX_NODES = 8192 * std::max(1, total_steps) + 256;
+            const size_t buf_size = ggml_tensor_overhead() * (size_t) MAX_NODES +
+                                     ggml_graph_overhead_custom(MAX_NODES, false);
+            cache.buf.assign(buf_size, 0);
+            ggml_init_params p = { buf_size, cache.buf.data(), true };
+            cache.ctx = ggml_init(p);
+            cache.gf = ggml_new_graph_custom(cache.ctx, MAX_NODES, false);
+
+            // --- Shared inputs ---
+            cache.x0_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, L, Cin);
+            ggml_set_name(cache.x0_in, "loop_x0_in"); ggml_set_input(cache.x0_in);
+            cache.mask_in = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, L);
+            ggml_set_name(cache.mask_in, "loop_mask"); ggml_set_input(cache.mask_in);
+            cache.text_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, text_len, text_C);
+            ggml_set_name(cache.text_in, "loop_text_in"); ggml_set_input(cache.text_in);
+            cache.style_v_raw_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, kv_style, text_C);
+            ggml_set_name(cache.style_v_raw_in, "loop_style_v"); ggml_set_input(cache.style_v_raw_in);
+            cache.style_kctx_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, kv_style, text_C);
+            ggml_set_name(cache.style_kctx_in, "loop_style_kctx"); ggml_set_input(cache.style_kctx_in);
+
+            cache.pos_q = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_I32, L);
+            ggml_set_name(cache.pos_q, "loop_pos_q"); ggml_set_input(cache.pos_q);
+            cache.pos_k = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_I32, text_len);
+            ggml_set_name(cache.pos_k, "loop_pos_k"); ggml_set_input(cache.pos_k);
+            cache.freq_factors_q = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, D / 2);
+            ggml_set_name(cache.freq_factors_q, "loop_ff_q"); ggml_set_input(cache.freq_factors_q);
+            cache.freq_factors_k = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, D / 2);
+            ggml_set_name(cache.freq_factors_k, "loop_ff_k"); ggml_set_input(cache.freq_factors_k);
+
+            cache.t_emb_in.resize(total_steps, nullptr);
+            for (int s = 0; s < total_steps; ++s) {
+                cache.t_emb_in[s] = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, 64);
+                const std::string name_te = "loop_temb_" + std::to_string(s);
+                ggml_set_name(cache.t_emb_in[s], name_te.c_str());
+                ggml_set_input(cache.t_emb_in[s]);
+            }
+
+            // --- Chain N CFM steps together ---
+            ggml_tensor * cur_latent = cache.x0_in;
+            for (int s = 0; s < total_steps; ++s) {
+                vector_step_inputs inputs;
+                inputs.x_in           = cur_latent;       // previous step's output
+                inputs.mask_in        = cache.mask_in;
+                inputs.t_emb_in       = cache.t_emb_in[s];
+                inputs.text_in        = cache.text_in;
+                inputs.style_v_raw_in = cache.style_v_raw_in;
+                inputs.style_kctx_in  = cache.style_kctx_in;
+                inputs.noise_in       = cur_latent;       // CFM: next = noise_in + v/N
+                inputs.pos_q          = cache.pos_q;
+                inputs.pos_k          = cache.pos_k;
+                inputs.freq_factors_q = cache.freq_factors_q;
+                inputs.freq_factors_k = cache.freq_factors_k;
+
+                ggml_tensor * next = append_supertonic_vector_step_subgraph(
+                    cache.ctx, cache.gf, model, inputs, L, text_len, total_steps);
+                const std::string step_name = "loop_next_" + std::to_string(s);
+                ggml_set_name(next, step_name.c_str());
+                cur_latent = next;
+            }
+            ggml_set_output(cur_latent);
+            ggml_build_forward_expand(cache.gf, cur_latent);
+            cache.final_latent_out = cur_latent;
+
+            cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
+            if (!cache.allocr) throw std::runtime_error("ggml_gallocr_new vector loop one-graph failed");
+            if (!ggml_gallocr_reserve(cache.allocr, cache.gf)) {
+                throw std::runtime_error("ggml_gallocr_reserve vector loop one-graph failed");
+            }
+            ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
+        }
+
+        // --- Per-call inputs (constants across CFM steps) ---
+        ggml_backend_tensor_set(cache.x0_in, initial_noisy_latent, 0,
+                                 (size_t) L * Cin * sizeof(float));
+        ggml_backend_tensor_set(cache.mask_in, latent_mask, 0, (size_t) L * sizeof(float));
+        ggml_backend_tensor_set(cache.text_in, text_emb, 0, (size_t) text_len * 256 * sizeof(float));
+
+        const std::vector<float> * style_v_raw_ptr = nullptr;
+        const std::vector<float> * kctx_raw_ptr = nullptr;
+        cached_style_layouts(model, style_ttl, style_v_raw_ptr, kctx_raw_ptr);
+        ggml_backend_tensor_set(cache.style_v_raw_in, style_v_raw_ptr->data(), 0,
+                                 style_v_raw_ptr->size() * sizeof(float));
+        ggml_backend_tensor_set(cache.style_kctx_in, kctx_raw_ptr->data(), 0,
+                                 kctx_raw_ptr->size() * sizeof(float));
+
+        {
+            std::vector<int32_t> pos_q_host(L);
+            for (int i = 0; i < L; ++i) pos_q_host[i] = i;
+            ggml_backend_tensor_set(cache.pos_q, pos_q_host.data(), 0,
+                                     pos_q_host.size() * sizeof(int32_t));
+            std::vector<int32_t> pos_k_host(text_len);
+            for (int i = 0; i < text_len; ++i) pos_k_host[i] = i;
+            ggml_backend_tensor_set(cache.pos_k, pos_k_host.data(), 0,
+                                     pos_k_host.size() * sizeof(int32_t));
+
+            const int half = 32;
+            f32_tensor theta = read_f32(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
+            if ((int) theta.data.size() < half) {
+                throw std::runtime_error("theta tensor has fewer than D/2 elements");
+            }
+            std::vector<float> ff_q(half), ff_k(half);
+            for (int d = 0; d < half; ++d) {
+                ff_q[d] = (float) L / theta.data[d];
+                ff_k[d] = (float) text_len / theta.data[d];
+            }
+            ggml_backend_tensor_set(cache.freq_factors_q, ff_q.data(), 0,
+                                     ff_q.size() * sizeof(float));
+            ggml_backend_tensor_set(cache.freq_factors_k, ff_k.data(), 0,
+                                     ff_k.size() * sizeof(float));
+        }
+
+        // --- Per-step time embeddings ---
+        for (int s = 0; s < total_steps; ++s) {
+            std::vector<float> te = time_embedding(model, s, total_steps);
+            ggml_backend_tensor_set(cache.t_emb_in[s], te.data(), 0,
+                                     te.size() * sizeof(float));
+        }
+
+        // --- ONE compute call for ALL CFM steps ---
+        supertonic_graph_compute(model, cache.gf);
+
+        // --- Read final output ---
+        final_latent_out.assign((size_t) Cin * L, 0.0f);
+        ggml_backend_tensor_get(cache.final_latent_out, final_latent_out.data(), 0,
+                                 (size_t) Cin * L * sizeof(float));
+        if (error) error->clear();
+        return true;
+    } catch (const std::exception & e) {
+        if (error) *error = e.what();
+        return false;
+    }
+}
+
+// Public-ish driver: dispatches to the unrolled-loop path on non-CPU
+// backends, falls back to the per-step `supertonic_vector_step_ggml`
+// loop on CPU.  Gate the unrolled path off with
+// SUPERTONIC_DISABLE_LOOP_GRAPH=1 to A/B against the per-step path on
+// the same backend.
+bool supertonic_vector_loop_ggml(const supertonic_model & model,
+                                  const float * initial_noisy_latent,
+                                  int latent_len,
+                                  const float * text_emb,
+                                  int text_len,
+                                  const float * style_ttl,
+                                  const float * latent_mask,
+                                  int total_steps,
+                                  std::vector<float> & final_latent_out,
+                                  std::string * error) {
+    const bool disable_loop =
+        std::getenv("SUPERTONIC_DISABLE_LOOP_GRAPH") != nullptr;
+    if (!disable_loop && !model_prefers_cpu_kernels(model)) {
+        return supertonic_vector_loop_one_graph_ggml(
+            model, initial_noisy_latent, latent_len, text_emb, text_len,
+            style_ttl, latent_mask, total_steps, final_latent_out, error);
+    }
+    // CPU / disabled path: run the per-step loop in the addon's existing way.
+    try {
+        std::vector<float> latent((size_t) model.hparams.latent_channels * latent_len);
+        std::memcpy(latent.data(), initial_noisy_latent, latent.size() * sizeof(float));
+        std::vector<float> next;
+        for (int step = 0; step < total_steps; ++step) {
+            if (!supertonic_vector_step_ggml(model, latent.data(), latent_len,
+                                              text_emb, text_len,
+                                              style_ttl, latent_mask,
+                                              step, total_steps, next, error)) {
+                return false;
+            }
+            latent.swap(next);
+        }
+        final_latent_out = std::move(latent);
+        if (error) error->clear();
+        return true;
+    } catch (const std::exception & e) {
+        if (error) *error = e.what();
+        return false;
+    }
+}
+
 bool supertonic_vector_step_ggml(const supertonic_model & model,
                                  const float * noisy_latent,
                                  int latent_len,
@@ -2608,6 +5093,20 @@ bool supertonic_vector_step_ggml(const supertonic_model & model,
                                  int total_steps,
                                  std::vector<float> & next_latent_out,
                                  std::string * error) {
+    supertonic_op_dispatch_scope dispatch(model);
+    // Metal / CUDA / Vulkan / OpenCL: use the consolidated one-graph path
+    // (one ggml_backend_graph_compute call per CFM step instead of ~21).
+    // CPU: keep the multi-cache trace_proj path — its CPU fast-paths and
+    // thread_local sub-graph caches stay competitive on CPU and trace mode
+    // relies on the per-stage outputs.  Set SUPERTONIC_DISABLE_ONE_GRAPH=1
+    // to fall back to the multi-cache path on GPU backends if needed.
+    const bool disable_one_graph = std::getenv("SUPERTONIC_DISABLE_ONE_GRAPH") != nullptr;
+    if (!disable_one_graph && !model_prefers_cpu_kernels(model)) {
+        return supertonic_vector_step_one_graph_ggml(model, noisy_latent, latent_len,
+                                                      text_emb, text_len, style_ttl,
+                                                      latent_mask, current_step,
+                                                      total_steps, next_latent_out, error);
+    }
     try {
         std::vector<supertonic_trace_tensor> scalar_trace;
         std::vector<supertonic_trace_tensor> ggml_trace;
diff --git a/tts-cpp/src/supertonic_vocoder.cpp b/tts-cpp/src/supertonic_vocoder.cpp
index bbe00137273..4cd8937a30e 100644
--- a/tts-cpp/src/supertonic_vocoder.cpp
+++ b/tts-cpp/src/supertonic_vocoder.cpp
@@ -56,11 +56,21 @@ bool vocoder_profile_enabled() {
 
 void profile_vocoder_checkpoint(const char * label,
                                 std::chrono::steady_clock::time_point & last) {
-    if (!vocoder_profile_enabled()) return;
+    const bool stderr_on = vocoder_profile_enabled();
+    const bool csv_on    = supertonic_profile_csv_enabled();
+    if (!stderr_on && !csv_on) return;
     const auto now = std::chrono::steady_clock::now();
     const double ms = std::chrono::duration<double, std::milli>(now - last).count();
     last = now;
-    std::fprintf(stderr, "supertonic_vocoder_profile island=%s ms=%.3f\n", label, ms);
+    if (stderr_on) {
+        std::fprintf(stderr, "supertonic_vocoder_profile island=%s ms=%.3f\n", label, ms);
+    }
+    // Phase 2D: machine-readable row.  `step` doesn't apply to the
+    // vocoder (synth-level call, not denoise-step), so we pass -1
+    // as the sentinel.
+    if (csv_on) {
+        supertonic_profile_csv_record("vocoder", label, /*step=*/-1, ms);
+    }
 }
 
 ggml_tensor * repeat_like(ggml_context * ctx, ggml_tensor * v, ggml_tensor * like) {
@@ -78,11 +88,33 @@ ggml_tensor * repeat_like(ggml_context * ctx, ggml_tensor * v, ggml_tensor * lik
             std::to_string(like->ne[0]) + "," + std::to_string(like->ne[1]) + "," +
             std::to_string(like->ne[2]) + "," + std::to_string(like->ne[3]) + "]");
     }
-    return ggml_repeat(ctx, v, like);
+    // Every caller feeds the return value straight into ggml_add / ggml_mul,
+    // both of which broadcast natively in ggml.  Skip the explicit
+    // ggml_repeat node so the downstream op handles the broadcast — saves a
+    // kernel_repeat launch per call on Metal.
+    static const bool force_explicit_repeat =
+        std::getenv("SUPERTONIC_FORCE_EXPLICIT_REPEAT") != nullptr;
+    if (force_explicit_repeat) {
+        return ggml_repeat(ctx, v, like);
+    }
+    return v;
 }
 
 ggml_tensor * causal_replicate_pad_1d(ggml_context * ctx, ggml_tensor * x, int pad_left) {
     if (pad_left <= 0) return x;
+    // Prefer the fused supertonic_edge_pad_1d op when available (Metal
+    // via the overlay port + CPU via the parity backstop) — collapses
+    // the view + repeat_4d + concat triplet into a single dispatch.
+    // Override with SUPERTONIC_DISABLE_FUSED_EDGE_PAD=1 to A/B against
+    // the stock-ops chain.
+    static const bool disable_fused_edge_pad =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_EDGE_PAD") != nullptr;
+    if (!disable_fused_edge_pad &&
+        x->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        ggml_is_contiguous(x)) {
+        return ggml_supertonic_edge_pad_1d(ctx, x, pad_left, 0);
+    }
     const int64_t C = x->ne[1];
     ggml_tensor * first = ggml_view_2d(ctx, x, 1, C, x->nb[1], 0);
     ggml_tensor * rep = ggml_repeat_4d(ctx, first, pad_left, C, 1, 1);
@@ -96,7 +128,15 @@ ggml_tensor * conv1d_causal_ggml(ggml_context * ctx,
                                  int dilation = 1) {
     const int K = (int) w->ne[0];
 #if defined(TTS_CPP_USE_ACCELERATE) || defined(TTS_CPP_USE_CBLAS)
-    if (K == 1 && dilation == 1 &&
+    // The cblas-backed `ggml_custom_4d` fast paths below assume the op
+    // callbacks run on the CPU scheduler with host-addressable tensor
+    // data.  On any non-CPU backend (CUDA / Metal / Vulkan / OpenCL)
+    // GGML_OP_CUSTOM is rejected outright, so fall through to the
+    // pure-GGML im2col + mul_mat path which dispatches natively on
+    // every backend.  Flag is thread_local, set by the outer
+    // supertonic_op_dispatch_scope at each forward entry point.
+    const bool use_cpu_custom = supertonic_use_cpu_custom_ops();
+    if (use_cpu_custom && K == 1 && dilation == 1 &&
         x->type == GGML_TYPE_F32 && w->type == GGML_TYPE_F32 &&
         (!b || b->type == GGML_TYPE_F32) &&
         x->ne[2] == 1 && x->ne[3] == 1) {
@@ -146,7 +186,7 @@ ggml_tensor * conv1d_causal_ggml(ggml_context * ctx,
                               1,
                               nullptr);
     }
-    if (K > 1 && dilation == 1 &&
+    if (use_cpu_custom && K > 1 && dilation == 1 &&
         x->type == GGML_TYPE_F32 && w->type == GGML_TYPE_F32 &&
         (!b || b->type == GGML_TYPE_F32) &&
         x->ne[2] == 1 && x->ne[3] == 1) {
@@ -279,6 +319,9 @@ ggml_tensor * depthwise_causal_custom_ggml(ggml_context * ctx,
                                            ggml_tensor * w,
                                            ggml_tensor * b,
                                            int dilation) {
+    // CPU-only fast path; GPU backends reject GGML_OP_CUSTOM and must
+    // fall through to the im2col + mul_mat path further below.
+    if (!supertonic_use_cpu_custom_ops()) return nullptr;
     const depthwise_causal_op_config * cfg = depthwise_causal_config(dilation);
     if (!cfg || x->type != GGML_TYPE_F32 || w->type != GGML_TYPE_F32 || b->type != GGML_TYPE_F32) {
         return nullptr;
@@ -292,6 +335,11 @@ ggml_tensor * depthwise_causal_custom_ggml(ggml_context * ctx,
                           const_cast<depthwise_causal_op_config *>(cfg));
 }
 
+// `leaky_relu_portable_ggml` is now defined inline in
+// supertonic_internal.h so the dispatch tests can call it without
+// linking through this TU.  See the header for the lowering rationale
+// + parity-test reference.
+
 ggml_tensor * depthwise_conv1d_causal_ggml(ggml_context * ctx,
                                            ggml_tensor * x,
                                            ggml_tensor * w,
@@ -314,6 +362,15 @@ ggml_tensor * layer_norm_channel_ggml(ggml_context * ctx,
                                       ggml_tensor * gamma,
                                       ggml_tensor * beta,
                                       float eps = 1e-6f) {
+    static const bool disable_fused_layer_norm =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_LAYER_NORM") != nullptr;
+    if (!disable_fused_layer_norm &&
+        x->type == GGML_TYPE_F32 && gamma->type == GGML_TYPE_F32 && beta->type == GGML_TYPE_F32 &&
+        x->ne[2] == 1 && x->ne[3] == 1 &&
+        gamma->ne[0] == x->ne[1] && beta->ne[0] == x->ne[1] &&
+        ggml_is_contiguous(x) && ggml_is_contiguous(gamma) && ggml_is_contiguous(beta)) {
+        return ggml_supertonic_layer_norm_channel(ctx, x, gamma, beta, eps);
+    }
     ggml_tensor * y = ggml_cont(ctx, ggml_permute(ctx, x, 1, 0, 2, 3));
     y = ggml_norm(ctx, y, eps);
     y = ggml_mul(ctx, y, repeat_like(ctx, gamma, y));
@@ -326,16 +383,130 @@ ggml_tensor * convnext_block_ggml(ggml_context * ctx,
                                   ggml_tensor * x,
                                   int idx) {
     static const int dilations[10] = {1, 2, 4, 1, 2, 4, 1, 1, 1, 1};
+    const bool use_cpu_custom = supertonic_use_cpu_custom_ops();
+    ggml_tensor * dw = depthwise_conv1d_causal_ggml(ctx, x, w.dw_w, w.dw_b, dilations[idx]);
+    if (use_cpu_custom) {
+        // Audit follow-up #6 (F7) — fused LN + pw1 + gelu + pw2 + γ +
+        // residual.  The fused helper keeps the layer-norm output in
+        // `[C, T0]` (channel-major) memory and lowers both K=1 pointwise
+        // convs to direct `ggml_mul_mat` against that layout, eliminating
+        // the LN back-permute/cont and both im2col copies the previous
+        // chain paid (audit cost: ~16.8 MiB / vocoder pass).  The
+        // depthwise op stays in this TU so the CBLAS custom-op fast
+        // path is unaffected.  Trace + pipeline parity preserved — the
+        // fused helper computes the same arithmetic in the same order,
+        // just on a different (compatible) intermediate layout.  See
+        // `supertonic_internal.h::convnext_block_fused_ggml` for the
+        // op-by-op rationale and
+        // `test/test_supertonic_convnext_block_fused.cpp` for the
+        // parity test.
+        return convnext_block_fused_ggml(
+            ctx,
+            /*residual=*/x,
+            /*dw_out=*/dw,
+            w.norm_g, w.norm_b,
+            w.pw1_w, w.pw1_b,
+            w.pw2_w, w.pw2_b,
+            w.gamma);
+    }
+    // Metal / non-CPU backend path: keep the granular chain so the
+    // per-op Metal fused-kernel fast paths inside the helpers (layer
+    // norm, bias+gelu, ...) get a chance to fire.  GGML_OP_CUSTOM is
+    // rejected on GPU backends so the F7 fused helper above isn't
+    // usable here regardless.
     ggml_tensor * residual = x;
-    ggml_tensor * y = depthwise_conv1d_causal_ggml(ctx, x, w.dw_w, w.dw_b, dilations[idx]);
+    ggml_tensor * y = dw;
     y = layer_norm_channel_ggml(ctx, y, w.norm_g, w.norm_b);
-    y = conv1d_causal_ggml(ctx, y, w.pw1_w, w.pw1_b);
-    y = ggml_gelu_erf(ctx, y);
+    // pw1 + bias + GELU.  On Metal we drop the bias from conv1d_causal_ggml
+    // and feed the pre-bias matmul output to the fused bias_gelu op (one
+    // dispatch instead of two: ggml_add + gelu_erf).  CPU keeps its existing
+    // cblas+bias_inside path — the standard library erff in the unfused
+    // chain is already the cheapest there.
+    static const bool disable_fused_bias_gelu =
+        std::getenv("SUPERTONIC_DISABLE_FUSED_BIAS_GELU") != nullptr;
+    if (!disable_fused_bias_gelu &&
+        y->type == GGML_TYPE_F32 && w.pw1_w->type == GGML_TYPE_F32 &&
+        w.pw1_b->type == GGML_TYPE_F32) {
+        y = conv1d_causal_ggml(ctx, y, w.pw1_w, /*b=*/nullptr);
+        if (y->ne[2] == 1 && y->ne[3] == 1 &&
+            w.pw1_b->ne[0] == y->ne[1] &&
+            ggml_is_contiguous(y) && ggml_is_contiguous(w.pw1_b)) {
+            y = ggml_supertonic_bias_gelu(ctx, y, w.pw1_b);
+        } else {
+            y = ggml_add(ctx, y, repeat_like(ctx, w.pw1_b, y));
+            y = ggml_gelu_erf(ctx, y);
+        }
+    } else {
+        y = conv1d_causal_ggml(ctx, y, w.pw1_w, w.pw1_b);
+        y = ggml_gelu_erf(ctx, y);
+    }
+    // NOTE: the vector_estimator's `ggml_supertonic_pw2_residual` op
+    // expects `gamma` to be `[C]` (per-channel scale); the vocoder
+    // however stores `gamma` as a `[1]` scalar (single learnable
+    // scale per ConvNeXt block).  The shapes are incompatible, so we
+    // keep the unfused chain here.  A vocoder-specific fused op with
+    // scalar gamma is possible but the win would be tiny (~10
+    // dispatches × ~40μs = 0.4 ms).
     y = conv1d_causal_ggml(ctx, y, w.pw2_w, w.pw2_b);
     y = ggml_mul(ctx, y, repeat_like(ctx, w.gamma, y));
     return ggml_add(ctx, residual, y);
 }
 
+ggml_tensor * pointwise_matmul_ct_voc(ggml_context * ctx,
+                                      ggml_tensor * x_ct,
+                                      ggml_tensor * w,
+                                      ggml_tensor * b) {
+    GGML_ASSERT(w->ne[0] == 1);
+    GGML_ASSERT(w->ne[1] == x_ct->ne[0]);
+    GGML_ASSERT(ggml_is_contiguous(w));
+    ggml_tensor * w_2d = ggml_reshape_2d(ctx, w, w->ne[1], w->ne[2]);
+    ggml_tensor * x_2d = ggml_reshape_2d(ctx, x_ct, x_ct->ne[0], x_ct->ne[1]);
+    ggml_tensor * y = ggml_mul_mat(ctx, w_2d, x_2d);
+    if (b) y = ggml_add(ctx, y, repeat_like(ctx, b, y));
+    return y;
+}
+
+// Phase B2 follow-up: vocoder ConvNeXt block on `[C, T]` activations
+// end-to-end.  Takes `[C, T]` input and returns `[C, T]` — the caller
+// wraps the 10-block chain in a single `[T, C] -> [C, T]` permute at
+// entry and a single `[C, T] -> [T, C]` permute at exit, so this
+// block has zero intra-block permutes.
+//
+// Vocoder ConvNeXt differs from vector_estimator's: (1) depthwise is
+// **causal** (left-only pad) rather than symmetric edge-clamp — handled
+// by the `_causal_ct` variant of the fused depthwise kernel (port-v14).
+// (2) `gamma` is a scalar `[1]`, not per-channel, so the `pw2_residual_ct`
+// fused op doesn't fit — unfused scalar `mul + add` tail.  (3) `norm_g` /
+// `norm_b` ship as `[1, C]` (same flatten-needed quirk as vector_estimator's
+// `.gamma`).
+//
+// Caller: `SUPERTONIC_DISABLE_CT_VOCODER=1` reverts to legacy
+// `convnext_block_ggml`.
+ggml_tensor * convnext_block_ggml_ct(ggml_context * ctx,
+                                     const supertonic_vocoder_convnext_weights & w,
+                                     ggml_tensor * x_ct,
+                                     int idx) {
+    static const int dilations[10] = {1, 2, 4, 1, 2, 4, 1, 1, 1, 1};
+    ggml_tensor * residual = x_ct;
+
+    auto flatten_1d = [&](ggml_tensor * t) -> ggml_tensor * {
+        const int64_t n = ggml_nelements(t);
+        if (t->ne[0] == n && t->ne[1] == 1 && t->ne[2] == 1 && t->ne[3] == 1) return t;
+        return ggml_reshape_1d(ctx, t, n);
+    };
+
+    ggml_tensor * y_ct = ggml_supertonic_depthwise_1d_causal_ct(ctx, x_ct,
+        w.dw_w, flatten_1d(w.dw_b), dilations[idx]);
+    y_ct = ggml_supertonic_layer_norm_channel_ct(ctx, y_ct,
+        flatten_1d(w.norm_g), flatten_1d(w.norm_b), 1e-6f);
+    y_ct = pointwise_matmul_ct_voc(ctx, y_ct, w.pw1_w, /*bias=*/nullptr);
+    y_ct = ggml_supertonic_bias_gelu_ct(ctx, y_ct, flatten_1d(w.pw1_b));
+    y_ct = pointwise_matmul_ct_voc(ctx, y_ct, w.pw2_w, flatten_1d(w.pw2_b));
+    // Scalar gamma multiply (broadcasts in any layout).
+    y_ct = ggml_mul(ctx, y_ct, repeat_like(ctx, w.gamma, y_ct));
+    return ggml_add(ctx, residual, y_ct);
+}
+
 struct vocoder_graph_cache {
     const supertonic_model * model = nullptr;
     uint64_t generation_id = 0;
@@ -344,9 +515,17 @@ struct vocoder_graph_cache {
     ggml_context * ctx = nullptr;
     ggml_cgraph * gf = nullptr;
     ggml_gallocr_t allocr = nullptr;
-    ggml_tensor * x_in = nullptr;
-    ggml_tensor * bn_scale = nullptr;
-    ggml_tensor * bn_shift = nullptr;
+
+    // F3: the new graph input is the raw latent in its natural
+    // `[latent_len, latent_channels]` shape; the existing
+    // `[t, r] → [t*factor + r]` unpack runs on the device via
+    // `ggml_reshape + ggml_permute + ggml_cont`.  Drops a ~40 KiB
+    // CPU loop + redundant upload per synth on a discrete GPU.
+    ggml_tensor * latent_in = nullptr;
+    // F2: bn_scale / bn_shift are no longer graph inputs — the
+    // vocoder graph references `model.vocoder.bn_scale_pre` /
+    // `bn_shift_pre` directly (allocated in model.buffer_w at load
+    // time).  The previous `ggml_set_input` markers are gone.
     ggml_tensor * wav = nullptr;
 };
 
@@ -366,13 +545,21 @@ void free_vocoder_cache(vocoder_graph_cache & cache) {
 void build_supertonic_vocoder_cache(vocoder_graph_cache & cache,
                                     const supertonic_model & model,
                                     int latent_len) {
-    // Reuse the cached graph when it already matches this shape AND was built on
-    // the direct backend path (cache.allocr non-null). The scheduler path leaves
-    // cache.allocr null, so it always rebuilds. Mirrors run_hift_decode.
+    // QVAC-19254 — reuse the cached graph when it already matches this shape
+    // AND was built on the direct backend path (cache.allocr non-null).  The
+    // scheduler path leaves cache.allocr null, so it always rebuilds.
+    // Mirrors run_hift_decode.
     if (cache.ctx && cache.allocr && cache.generation_id == model.generation_id
         && cache.latent_len == latent_len) {
         return;
     }
+    // `supertonic_op_dispatch_scope` is set by the outer
+    // `supertonic_vocoder_forward_ggml` entry point; inside graph builders
+    // we read the thread-local flag directly.
+    const bool use_cpu_custom = supertonic_use_cpu_custom_ops();
+    (void) use_cpu_custom;  // documentation only — graph builders below
+                            // read the flag themselves via
+                            // `supertonic_use_cpu_custom_ops()`.
     free_vocoder_cache(cache);
     cache.model = &model;
     cache.generation_id = model.generation_id;
@@ -387,17 +574,38 @@ void build_supertonic_vocoder_cache(vocoder_graph_cache & cache,
     cache.ctx = ggml_init(p);
     cache.gf = ggml_new_graph_custom(cache.ctx, MAX_NODES, false);
 
-    ggml_tensor * x = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32, T0, C_latent);
-    cache.x_in = x;
-    ggml_set_name(cache.x_in, "vocoder_in");
-    ggml_set_input(cache.x_in);
-
-    cache.bn_scale = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, 512);
-    ggml_set_name(cache.bn_scale, "vocoder_bn_scale");
-    ggml_set_input(cache.bn_scale);
-    cache.bn_shift = ggml_new_tensor_1d(cache.ctx, GGML_TYPE_F32, 512);
-    ggml_set_name(cache.bn_shift, "vocoder_bn_shift");
-    ggml_set_input(cache.bn_shift);
+    // F3: graph input is the latent in its raw on-host layout
+    // `[latent_len, latent_channels]`.  The unpack-and-permute
+    // formerly done by a CPU triple-loop runs in the graph now:
+    //
+    //   latent_in : ne=[L, 144]
+    //   → reshape_3d  ne=[L, 6, 24]   (split channel into c × r)
+    //   → permute(1,0,2,3) ne=[6, L, 24]
+    //   → cont        ne=[6, L, 24]   contiguous
+    //   → reshape_2d  ne=[6*L, 24] = [T0, C_latent]
+    //
+    // Math is a pure permutation; output element
+    // `x[c * T0 + t*6 + r] = latent[(c*6+r) * L + t]` matches the
+    // CPU loop in the legacy `supertonic_vocoder_forward_cpu`.
+    const int latent_channels = model.hparams.latent_channels;  // 144
+    cache.latent_in = ggml_new_tensor_2d(cache.ctx, GGML_TYPE_F32,
+                                         latent_len, latent_channels);
+    ggml_set_name(cache.latent_in, "vocoder_latent_in");
+    ggml_set_input(cache.latent_in);
+    ggml_tensor * latent_3d = ggml_reshape_3d(cache.ctx, cache.latent_in,
+                                              latent_len,
+                                              model.hparams.ttl_chunk_compress_factor,
+                                              C_latent);
+    ggml_tensor * latent_perm = ggml_permute(cache.ctx, latent_3d, 1, 0, 2, 3);
+    ggml_tensor * latent_cont = ggml_cont(cache.ctx, latent_perm);
+    ggml_tensor * x = ggml_reshape_2d(cache.ctx, latent_cont, T0, C_latent);
+    ggml_set_name(x, "vocoder_unpacked");
+
+    // F2: bn_scale / bn_shift are now persistent weight tensors
+    // (`model.vocoder.bn_scale_pre` / `bn_shift_pre`) allocated at
+    // load time.  See AUDIT_SUPERTONIC_OPENCL.md F2 for the
+    // recompute formula.  The graph references them as regular
+    // weight tensors so they don't show up as inputs.
 
     const float normalizer_scale = scalar_f32_tensor(model.vocoder.normalizer_scale);
     x = ggml_scale(cache.ctx, x, 1.0f / normalizer_scale);
@@ -407,19 +615,40 @@ void build_supertonic_vocoder_cache(vocoder_graph_cache & cache,
 
     x = conv1d_causal_ggml(cache.ctx, x, model.vocoder.embed_w, model.vocoder.embed_b);
     ggml_set_name(x, "vocoder_embed");
-    for (int i = 0; i < 10; ++i) {
-        x = convnext_block_ggml(cache.ctx, model.vocoder.convnext[(size_t) i], x, i);
-        ggml_set_name(x, ("vocoder_convnext_" + std::to_string(i)).c_str());
+    // Phase B2 follow-up: route the 10-block ConvNeXt chain through the
+    // `[C, T]` variant on Metal.  Each block runs depthwise (causal_ct) +
+    // layer_norm + pw1 + bias_gelu + pw2 + scalar gamma + residual add
+    // entirely on `[C, T]` — no intra-block permutes.  The single
+    // `[T, C] -> [C, T]` permute happens once before the chain and the
+    // single reverse permute once after.  Override:
+    // SUPERTONIC_DISABLE_CT_VOCODER=1.
+    static const bool disable_ct_vocoder =
+        std::getenv("SUPERTONIC_DISABLE_CT_VOCODER") != nullptr;
+    const bool use_ct_vocoder = !disable_ct_vocoder && !use_cpu_custom;
+    if (use_ct_vocoder) {
+        ggml_tensor * x_ct = ggml_cont(cache.ctx, ggml_permute(cache.ctx, x, 1, 0, 2, 3));
+        for (int i = 0; i < 10; ++i) {
+            x_ct = convnext_block_ggml_ct(cache.ctx, model.vocoder.convnext[(size_t) i], x_ct, i);
+            ggml_set_name(x_ct, ("vocoder_convnext_" + std::to_string(i)).c_str());
+        }
+        x = ggml_cont(cache.ctx, ggml_permute(cache.ctx, x_ct, 1, 0, 2, 3));
+    } else {
+        for (int i = 0; i < 10; ++i) {
+            x = convnext_block_ggml(cache.ctx, model.vocoder.convnext[(size_t) i], x, i);
+            ggml_set_name(x, ("vocoder_convnext_" + std::to_string(i)).c_str());
+        }
     }
 
-    x = ggml_mul(cache.ctx, x, repeat_like(cache.ctx, cache.bn_scale, x));
-    x = ggml_add(cache.ctx, x, repeat_like(cache.ctx, cache.bn_shift, x));
+    // F2: reference the pre-baked weight tensors directly instead
+    // of the (deleted) per-call graph inputs.
+    x = ggml_mul(cache.ctx, x, repeat_like(cache.ctx, model.vocoder.bn_scale_pre, x));
+    x = ggml_add(cache.ctx, x, repeat_like(cache.ctx, model.vocoder.bn_shift_pre, x));
     ggml_set_name(x, "vocoder_final_norm");
 
     x = conv1d_causal_ggml(cache.ctx, x, model.vocoder.head1_w, model.vocoder.head1_b);
     ggml_set_name(x, "vocoder_head1");
     const float prelu = scalar_f32_tensor(model.vocoder.head_prelu);
-    x = ggml_leaky_relu(cache.ctx, x, prelu, false);
+    x = leaky_relu_portable_ggml(cache.ctx, x, prelu);
     ggml_set_name(x, "vocoder_prelu");
     x = conv1d_causal_ggml(cache.ctx, x, model.vocoder.head2_w, nullptr);
     ggml_set_name(x, "wav");
@@ -698,35 +927,24 @@ bool supertonic_vocoder_forward_ggml(const supertonic_model & model,
                                      int latent_len,
                                      std::vector<float> & wav_out,
                                      std::string * error) {
+    // Sets thread_local CPU-custom-op + F16-attn flags for the duration
+    // of this call so the graph-build helpers below pick the backend-
+    // appropriate dispatch path; RAII teardown handles exceptions.
+    supertonic_op_dispatch_scope dispatch(model);
     try {
         auto profile_last = std::chrono::steady_clock::now();
-        const int C_latent = model.hparams.latent_dim;
-        const int factor = model.hparams.ttl_chunk_compress_factor;
-        const int T0 = latent_len * factor;
         if (latent_len <= 0) throw std::runtime_error("latent_len must be positive");
 
-        std::vector<float> x_in((size_t) T0 * C_latent);
-        for (int c = 0; c < C_latent; ++c) {
-            for (int t = 0; t < latent_len; ++t) {
-                for (int r = 0; r < factor; ++r) {
-                    int src_c = c * factor + r;
-                    x_in[(size_t) c * T0 + (t * factor + r)] =
-                        latent[(size_t) src_c * latent_len + t];
-                }
-            }
-        }
-        profile_vocoder_checkpoint("unpack", profile_last);
-
-        f32_tensor gamma = read_f32_tensor(model.vocoder.final_norm_g);
-        f32_tensor beta = read_f32_tensor(model.vocoder.final_norm_b);
-        f32_tensor mean = read_f32_tensor(model.vocoder.final_norm_running_mean);
-        f32_tensor var = read_f32_tensor(model.vocoder.final_norm_running_var);
-        std::vector<float> bn_scale(512), bn_shift(512);
-        for (int c = 0; c < 512; ++c) {
-            bn_scale[c] = gamma.data[c] / std::sqrt(var.data[c] + 1e-5f);
-            bn_shift[c] = beta.data[c] - mean.data[c] * bn_scale[c];
-        }
-        profile_vocoder_checkpoint("bn_params", profile_last);
+        // F3: the CPU host-side unpack loop is gone — the graph
+        // ingests `latent` in its natural `[latent_len, latent_channels]`
+        // shape and runs the `reshape + permute + cont + reshape`
+        // chain on the device.
+
+        // F2: bn_scale / bn_shift were pre-baked at load time into
+        // model.vocoder.{bn_scale_pre, bn_shift_pre} and the
+        // vocoder graph references those weight tensors directly.
+        // The per-synth pattern of 4 final_norm.* downloads + CPU
+        // compute + 2 uploads is gone; nothing happens here for BN.
 
         thread_local vocoder_graph_cache cache;
         // Reuse the shape-keyed graph on the direct backend path; rebuild + route
@@ -734,6 +952,9 @@ bool supertonic_vocoder_forward_ggml(const supertonic_model & model,
         build_supertonic_vocoder_cache(cache, model, latent_len);
         profile_vocoder_checkpoint("graph_cache", profile_last);
 
+        // QVAC-19254 — direct vs scheduler routing.  Re-uses cache.allocr
+        // for direct dispatch; falls through to the model scheduler when
+        // an op must run on CPU (GGML_OP_CUSTOM etc.).
         bool direct = true;
         const int n_nodes = ggml_graph_n_nodes(cache.gf);
         for (int i = 0; i < n_nodes; ++i) {
@@ -751,9 +972,14 @@ bool supertonic_vocoder_forward_ggml(const supertonic_model & model,
         } else {
             supertonic_sched_alloc(model, cache.gf);
         }
-        ggml_backend_tensor_set(cache.x_in, x_in.data(), 0, x_in.size() * sizeof(float));
-        ggml_backend_tensor_set(cache.bn_scale, bn_scale.data(), 0, bn_scale.size() * sizeof(float));
-        ggml_backend_tensor_set(cache.bn_shift, bn_shift.data(), 0, bn_shift.size() * sizeof(float));
+        // HEAD F3: upload latent in raw `[latent_len, latent_channels]`
+        // layout.  HEAD F2 pre-baked bn_scale / bn_shift into model
+        // weights at load time (referenced by the graph as
+        // `model.vocoder.bn_scale_pre` / `bn_shift_pre`), so no per-call
+        // BN upload is needed — that's why the struct doesn't carry
+        // `cache.bn_scale` / `cache.bn_shift` fields.
+        const size_t latent_bytes = (size_t) ggml_nelements(cache.latent_in) * sizeof(float);
+        ggml_backend_tensor_set(cache.latent_in, latent, 0, latent_bytes);
         profile_vocoder_checkpoint("set_inputs", profile_last);
 
         if (direct) supertonic_graph_compute(model, cache.gf);
@@ -866,6 +1092,7 @@ bool supertonic_vocoder_trace_ggml(const supertonic_model & model,
                                    int latent_len,
                                    std::vector<supertonic_trace_tensor> & trace_out,
                                    std::string * error) {
+    supertonic_op_dispatch_scope dispatch(model);
     try {
         trace_out.clear();
         const int C_latent = model.hparams.latent_dim;
@@ -930,14 +1157,11 @@ bool supertonic_vocoder_trace_ggml(const supertonic_model & model,
             ggml_build_forward_expand(gf, cur);
         }
 
-        ggml_tensor * bn_scale = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 512);
-        ggml_set_name(bn_scale, "trace_bn_scale");
-        ggml_set_input(bn_scale);
-        ggml_tensor * bn_shift = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 512);
-        ggml_set_name(bn_shift, "trace_bn_shift");
-        ggml_set_input(bn_shift);
-        cur = ggml_mul(ctx, cur, repeat_like(ctx, bn_scale, cur));
-        cur = ggml_add(ctx, cur, repeat_like(ctx, bn_shift, cur));
+        // F2: trace graph now references the pre-baked weight
+        // tensors directly (same as the production graph), so the
+        // per-call BN re-derivation below is gone too.
+        cur = ggml_mul(ctx, cur, repeat_like(ctx, model.vocoder.bn_scale_pre, cur));
+        cur = ggml_add(ctx, cur, repeat_like(ctx, model.vocoder.bn_shift_pre, cur));
         ggml_set_name(cur, "final_norm");
         ggml_set_output(cur);
         ggml_build_forward_expand(gf, cur);
@@ -945,7 +1169,7 @@ bool supertonic_vocoder_trace_ggml(const supertonic_model & model,
         ggml_set_name(cur, "head1");
         ggml_set_output(cur);
         ggml_build_forward_expand(gf, cur);
-        cur = ggml_leaky_relu(ctx, cur, scalar_f32_tensor(model.vocoder.head_prelu), false);
+        cur = leaky_relu_portable_ggml(ctx, cur, scalar_f32_tensor(model.vocoder.head_prelu));
         ggml_set_name(cur, "prelu");
         ggml_set_output(cur);
         ggml_build_forward_expand(gf, cur);
@@ -958,17 +1182,10 @@ bool supertonic_vocoder_trace_ggml(const supertonic_model & model,
 
         std::vector<float> x_host = unpack_latent_ggml_layout(model, latent, latent_len);
         ggml_backend_tensor_set(x_in, x_host.data(), 0, x_host.size() * sizeof(float));
-        f32_tensor gamma = read_f32_tensor(model.vocoder.final_norm_g);
-        f32_tensor beta = read_f32_tensor(model.vocoder.final_norm_b);
-        f32_tensor mean = read_f32_tensor(model.vocoder.final_norm_running_mean);
-        f32_tensor var = read_f32_tensor(model.vocoder.final_norm_running_var);
-        std::vector<float> bn_scale_host(512), bn_shift_host(512);
-        for (int c = 0; c < 512; ++c) {
-            bn_scale_host[c] = gamma.data[c] / std::sqrt(var.data[c] + 1e-5f);
-            bn_shift_host[c] = beta.data[c] - mean.data[c] * bn_scale_host[c];
-        }
-        ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "trace_bn_scale"), bn_scale_host.data(), 0, bn_scale_host.size() * sizeof(float));
-        ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "trace_bn_shift"), bn_shift_host.data(), 0, bn_shift_host.size() * sizeof(float));
+        // HEAD F2: trace_bn_scale / trace_bn_shift inputs are gone; the
+        // graph above now folds the pre-baked bn_scale_pre /
+        // bn_shift_pre weight tensors in directly.
+        // QVAC-19254 — pair the sched_alloc above with sched_compute here.
         supertonic_sched_compute(model, gf);
 
         trace_out.push_back({"unpack", {T0, C_latent}, unpack_latent_scalar(model, latent, latent_len)});
diff --git a/tts-cpp/test/test_supertonic_audit3_caches.cpp b/tts-cpp/test/test_supertonic_audit3_caches.cpp
new file mode 100644
index 00000000000..fcb63ea4007
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_audit3_caches.cpp
@@ -0,0 +1,279 @@
+// TDD harness for the audit follow-up #3 caches: F17 (duration
+// scalar-continuation weight cache), F18 (text-encoder convnext-
+// front graph cache), and F19 (vector-estimator front-block graph
+// cache).
+//
+// Each finding is a "make the second call cheaper" change: the
+// graph or weight bytes that the per-synth code path reaches for
+// are pulled out into model-lifetime storage on first touch, then
+// reused on every subsequent call.  Math is unchanged; the
+// test gate is a strict "two consecutive calls with identical
+// inputs produce bit-exact identical outputs" — if the cache
+// accidentally aliases buffers or resets state across calls, this
+// test trips.
+//
+//   F17 — Duration scalar-continuation `read_f32` cache.
+//         `supertonic_duration_forward_ggml` runs ~30 backend
+//         tensor reads in its scalar continuation (after the
+//         cached graph computes Q/K/V).  Validates that the
+//         `model.scalar_weight_cache` map is populated after the
+//         first synth and reused on the second.
+//
+//   F18 — Text-encoder convnext-front graph cache.
+//         `supertonic_text_encoder_forward_ggml` previously
+//         allocated a fresh `ggml_context` + `gallocr` for the
+//         front-half ConvNeXt graph on every synth.  Validates
+//         that the second synth produces bit-exact output.
+//
+//   F19 — Vector-estimator front-block graph cache.
+//         `supertonic_vector_trace_proj_ggml` allocated a fresh
+//         ~200-node graph per denoise step (5 alloc/free per
+//         synth on the default schedule).  Validates that two
+//         consecutive `supertonic_vector_step_ggml` calls with
+//         identical inputs are bit-exact (already partially
+//         covered by F8 / F11 tests; this extends with the front
+//         block being the new cached island).
+//
+// Registered with `LABEL "fixture"` — needs the Supertonic GGUF.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <random>
+#include <string>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+std::vector<float> make_synthetic(int n, uint32_t seed) {
+    std::vector<float> out((size_t) n);
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+    for (auto & v : out) v = dist(rng);
+    return out;
+}
+
+// F17 — Duration scalar weight cache.
+//
+// Contract:
+//   - After the first `supertonic_duration_forward_ggml` call,
+//     `model.scalar_weight_cache` contains at least one rostered
+//     entry (the relpos K/V embeddings + conv_o weight/bias are
+//     the audit's hot list).
+//   - A second call with the same input produces bit-exactly the
+//     same duration scalar (the cache must not corrupt values).
+//   - Cache size does NOT grow on the second call (every entry
+//     was a cache hit).
+void test_f17_duration_scalar_weight_cache(const supertonic_model & model) {
+    std::fprintf(stderr, "[F17 duration scalar weight cache]\n");
+
+    if (model.voices.empty()) {
+        std::fprintf(stderr, "  SKIP: no voices in model\n");
+        return;
+    }
+    const auto & voice = model.voices.begin()->second;
+    std::vector<float> style_dp((size_t) ggml_nelements(voice.dp));
+    ggml_backend_tensor_get(voice.dp, style_dp.data(), 0, ggml_nbytes(voice.dp));
+
+    std::vector<int64_t> text_ids;
+    for (int i = 1; i <= 16; ++i) text_ids.push_back(i);
+
+    std::string err;
+    float dur1 = 0.0f;
+    const size_t cache_before = model.scalar_weight_cache.size();
+    if (!supertonic_duration_forward_ggml(model, text_ids.data(),
+                                           (int) text_ids.size(),
+                                           style_dp.data(), dur1, &err)) {
+        std::fprintf(stderr, "  SKIP duration call 1: %s\n", err.c_str());
+        return;
+    }
+    const size_t cache_after_one = model.scalar_weight_cache.size();
+    std::fprintf(stderr, "  cache size: before=%zu  after-1=%zu\n",
+                 cache_before, cache_after_one);
+    CHECK(cache_after_one > cache_before);
+
+    // Specific rostered entries we expect (matches the call sites
+    // that `cached_read_f32` replaced).  Sub-rostered: not every
+    // GGUF carries every key, so we accept >= 4 of the 6 spotchecks.
+    static const char * const kRostered[] = {
+        "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.emb_rel_k",
+        "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.emb_rel_v",
+        "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_o.weight",
+        "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.conv_o.bias",
+        "duration:tts.dp.sentence_encoder.proj_out.net.weight",
+        "duration:tts.dp.sentence_encoder.attn_encoder.norm_layers_1.0.norm.weight",
+    };
+    int hits = 0;
+    for (const char * key : kRostered) {
+        if (model.scalar_weight_cache.find(key) != model.scalar_weight_cache.end()) {
+            ++hits;
+        }
+    }
+    std::fprintf(stderr, "  spot-check rostered entries: %d / %zu present\n",
+                 hits, sizeof(kRostered) / sizeof(kRostered[0]));
+    CHECK(hits >= 4);
+
+    // Second call must NOT grow the cache (every entry is a hit).
+    float dur2 = 0.0f;
+    if (!supertonic_duration_forward_ggml(model, text_ids.data(),
+                                           (int) text_ids.size(),
+                                           style_dp.data(), dur2, &err)) {
+        std::fprintf(stderr, "  SKIP duration call 2: %s\n", err.c_str());
+        return;
+    }
+    const size_t cache_after_two = model.scalar_weight_cache.size();
+    CHECK(cache_after_two == cache_after_one);
+    std::fprintf(stderr, "  cache size: after-2=%zu (must == after-1)\n", cache_after_two);
+
+    // Bit-exact duration across the two calls.
+    CHECK(dur1 == dur2);
+    std::fprintf(stderr, "  dur1=%.6g  dur2=%.6g\n", dur1, dur2);
+}
+
+// F18 — Text-encoder convnext-front graph cache.
+//
+// Contract: two consecutive `supertonic_text_encoder_forward_ggml`
+// calls with identical inputs produce bit-exact identical output
+// vectors.  The first call rebuilds the cached graph; the second
+// reuses it.  If the cache state leaks across calls (e.g. allocator
+// re-aliases an input tensor's buffer with an intermediate's), this
+// test trips.
+void test_f18_text_encoder_convnext_cache(const supertonic_model & model) {
+    std::fprintf(stderr, "[F18 text-encoder convnext-front graph cache]\n");
+
+    if (model.voices.empty()) {
+        std::fprintf(stderr, "  SKIP: no voices in model\n");
+        return;
+    }
+    const auto & voice = model.voices.begin()->second;
+    std::vector<float> style_ttl((size_t) ggml_nelements(voice.ttl));
+    ggml_backend_tensor_get(voice.ttl, style_ttl.data(), 0, ggml_nbytes(voice.ttl));
+
+    std::vector<int64_t> text_ids;
+    for (int i = 1; i <= 24; ++i) text_ids.push_back(i);
+
+    std::string err;
+    std::vector<float> emb1, emb2;
+    if (!supertonic_text_encoder_forward_ggml(model, text_ids.data(),
+                                               (int) text_ids.size(),
+                                               style_ttl.data(), emb1, &err)) {
+        std::fprintf(stderr, "  SKIP call 1: %s\n", err.c_str());
+        return;
+    }
+    if (!supertonic_text_encoder_forward_ggml(model, text_ids.data(),
+                                               (int) text_ids.size(),
+                                               style_ttl.data(), emb2, &err)) {
+        std::fprintf(stderr, "  SKIP call 2: %s\n", err.c_str());
+        return;
+    }
+
+    CHECK(emb1.size() == emb2.size());
+    int bad = 0;
+    float max_abs = 0.0f;
+    for (size_t i = 0; i < emb1.size() && i < emb2.size(); ++i) {
+        const float d = std::fabs(emb1[i] - emb2[i]);
+        if (d > 0.0f) ++bad;
+        max_abs = std::max(max_abs, d);
+    }
+    std::fprintf(stderr,
+                 "  emb.size=%zu  max_abs_diff=%.3e  bad=%d (must be 0)\n",
+                 emb1.size(), max_abs, bad);
+    CHECK(bad == 0);
+}
+
+// F19 — Vector-estimator front-block graph cache.
+//
+// Contract: same as F18.  `supertonic_vector_step_ggml` invokes
+// `supertonic_vector_trace_proj_ggml` internally, which has the
+// front-block graph.  Two consecutive calls with identical inputs
+// must yield bit-exact identical outputs.  Builds on the F8 / F11
+// tests with the new front-block cache as the additional gate.
+void test_f19_vector_front_block_cache(const supertonic_model & model) {
+    std::fprintf(stderr, "[F19 vector-estimator front-block cache]\n");
+
+    if (model.voices.empty()) {
+        std::fprintf(stderr, "  SKIP: no voices in model\n");
+        return;
+    }
+    const auto & voice = model.voices.begin()->second;
+    std::vector<float> style_ttl((size_t) ggml_nelements(voice.ttl));
+    ggml_backend_tensor_get(voice.ttl, style_ttl.data(), 0, ggml_nbytes(voice.ttl));
+
+    const int text_len   = 24;
+    const int latent_len = 12;
+    const int Cin        = model.hparams.latent_channels;
+
+    auto latent     = make_synthetic(Cin * latent_len, 0xF00D);
+    auto text_emb   = make_synthetic(256 * text_len,   0xBEEF);
+    std::vector<float> latent_mask((size_t) latent_len, 1.0f);
+
+    std::string err;
+    std::vector<float> next1, next2;
+    if (!supertonic_vector_step_ggml(model, latent.data(), latent_len,
+                                     text_emb.data(), text_len,
+                                     style_ttl.data(), latent_mask.data(),
+                                     /*current_step=*/0, /*total_steps=*/5,
+                                     next1, &err)) {
+        std::fprintf(stderr, "  SKIP step 1: %s\n", err.c_str());
+        return;
+    }
+    if (!supertonic_vector_step_ggml(model, latent.data(), latent_len,
+                                     text_emb.data(), text_len,
+                                     style_ttl.data(), latent_mask.data(),
+                                     /*current_step=*/0, /*total_steps=*/5,
+                                     next2, &err)) {
+        std::fprintf(stderr, "  SKIP step 2: %s\n", err.c_str());
+        return;
+    }
+    CHECK(next1.size() == next2.size());
+    int bad = 0;
+    float max_abs = 0.0f;
+    for (size_t i = 0; i < next1.size() && i < next2.size(); ++i) {
+        const float d = std::fabs(next1[i] - next2[i]);
+        if (d > 0.0f) ++bad;
+        max_abs = std::max(max_abs, d);
+    }
+    std::fprintf(stderr,
+                 "  next.size=%zu  max_abs_diff=%.3e  bad=%d (must be 0)\n",
+                 next1.size(), max_abs, bad);
+    CHECK(bad == 0);
+}
+
+} // namespace
+
+int main(int argc, char ** argv) {
+    if (argc < 2) {
+        std::fprintf(stderr, "usage: %s MODEL.gguf\n", argv[0]);
+        return 2;
+    }
+    supertonic_model model;
+    if (!load_supertonic_gguf(argv[1], model)) {
+        std::fprintf(stderr, "failed to load model: %s\n", argv[1]);
+        return 1;
+    }
+
+    test_f17_duration_scalar_weight_cache(model);
+    test_f18_text_encoder_convnext_cache(model);
+    test_f19_vector_front_block_cache(model);
+
+    free_supertonic_model(model);
+
+    std::fprintf(stderr,
+                 "test_supertonic_audit3_caches: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_backend_dispatch.cpp b/tts-cpp/test/test_supertonic_backend_dispatch.cpp
new file mode 100644
index 00000000000..c80b926ae3c
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_backend_dispatch.cpp
@@ -0,0 +1,186 @@
+// Unit tests for the OpenCL bring-up dispatch helpers landed in
+// QVAC-18607: `supertonic_op_dispatch_scope`, the thread-local
+// `supertonic_use_cpu_custom_ops()` / `supertonic_use_f16_attn()`
+// queries, and the `supertonic_model::backend_is_cpu`
+// + `supertonic_model::use_f16_attn` fields they mirror.
+//
+// No GGUF / model file required — every test instantiates a bare
+// `supertonic_model` POD on the stack with the two relevant flags set
+// by hand, opens an RAII scope around it, and re-asserts the
+// thread-local query state matches what the scope was constructed
+// with.  This is what every public `supertonic_*_forward_ggml` /
+// `*_trace_ggml` entry point does, so a regression here would mean a
+// regression in the *real* dispatch path.
+//
+// Registered with `LABEL "unit"` in CMakeLists.txt so a fresh
+// checkout's `ctest` exercises this without needing any fixture.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <stdexcept>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Test 1 — Default thread-local state.
+//
+// Every thread enters with CPU custom ops enabled (the historical
+// CPU-only Supertonic path keeps working unchanged) and F16 K/V
+// attention disabled (the CPU CBLAS attention path is the cheaper
+// choice on a CPU backend, so the auto-policy lands here).
+void test_default_flags() {
+    CHECK(supertonic_use_cpu_custom_ops() == true);
+    CHECK(supertonic_use_f16_attn() == false);
+}
+
+// Test 2 — Scope mirrors a CPU model.
+//
+// A CPU-backend model toggles nothing: defaults already match.
+// The point of this test is to catch a "scope leaked the wrong
+// previous-value back into the thread-local on dtor" regression by
+// also asserting the default state after teardown.
+void test_scope_mirrors_cpu_model() {
+    supertonic_model model;
+    model.backend_is_cpu = true;
+    model.use_f16_attn   = false;
+    {
+        supertonic_op_dispatch_scope scope(model);
+        CHECK(supertonic_use_cpu_custom_ops() == true);
+        CHECK(supertonic_use_f16_attn() == false);
+    }
+    CHECK(supertonic_use_cpu_custom_ops() == true);
+    CHECK(supertonic_use_f16_attn() == false);
+}
+
+// Test 3 — Scope mirrors a GPU model + restores defaults after.
+//
+// A GPU-backend engine (OpenCL / CUDA / Metal / Vulkan) sets both
+// flags via the dispatch scope; the cblas-backed `ggml_custom_4d`
+// fast paths in the vocoder + vector estimator must see `false`
+// inside the scope, then `true` again after teardown so a
+// CPU-only second engine in the same thread isn't poisoned.
+void test_scope_mirrors_gpu_model() {
+    supertonic_model model;
+    model.backend_is_cpu = false;
+    model.use_f16_attn   = true;
+    {
+        supertonic_op_dispatch_scope scope(model);
+        CHECK(supertonic_use_cpu_custom_ops() == false);
+        CHECK(supertonic_use_f16_attn() == true);
+    }
+    CHECK(supertonic_use_cpu_custom_ops() == true);
+    CHECK(supertonic_use_f16_attn() == false);
+}
+
+// Test 4 — RAII teardown on exception.
+//
+// The forward functions wrap the rest of their body in try / catch;
+// if the body throws (e.g. invalid voice, GGML buffer alloc failure),
+// the scope must still restore the previous flags so the next
+// engine's call sees a clean slate.
+void test_scope_unwinds_on_exception() {
+    supertonic_model model;
+    model.backend_is_cpu = false;
+    model.use_f16_attn   = true;
+    bool caught = false;
+    try {
+        supertonic_op_dispatch_scope scope(model);
+        CHECK(supertonic_use_cpu_custom_ops() == false);
+        CHECK(supertonic_use_f16_attn() == true);
+        throw std::runtime_error("simulated forward failure");
+    } catch (const std::runtime_error &) {
+        caught = true;
+    }
+    CHECK(caught);
+    CHECK(supertonic_use_cpu_custom_ops() == true);
+    CHECK(supertonic_use_f16_attn() == false);
+}
+
+// Test 5 — Nested scopes stack and unwind correctly.
+//
+// This is the harness for the "host destroyed engine_a then
+// immediately invoked synthesize on engine_b on the same thread"
+// path the alive-id registry already covers for gallocr free.
+// Here we verify the dispatch flags don't get crossed during the
+// brief window where both scopes exist (e.g. one forward function
+// calling another's helper synchronously).
+void test_nested_scopes() {
+    supertonic_model gpu_model;
+    gpu_model.backend_is_cpu = false;
+    gpu_model.use_f16_attn   = true;
+
+    supertonic_model cpu_model;
+    cpu_model.backend_is_cpu = true;
+    cpu_model.use_f16_attn   = false;
+
+    {
+        supertonic_op_dispatch_scope outer(gpu_model);
+        CHECK(supertonic_use_cpu_custom_ops() == false);
+        CHECK(supertonic_use_f16_attn() == true);
+        {
+            supertonic_op_dispatch_scope inner(cpu_model);
+            CHECK(supertonic_use_cpu_custom_ops() == true);
+            CHECK(supertonic_use_f16_attn() == false);
+        }
+        // After inner unwinds, outer's state restored.
+        CHECK(supertonic_use_cpu_custom_ops() == false);
+        CHECK(supertonic_use_f16_attn() == true);
+    }
+    CHECK(supertonic_use_cpu_custom_ops() == true);
+    CHECK(supertonic_use_f16_attn() == false);
+}
+
+// Test 6 — Independent flags.
+//
+// `use_f16_attn = true` on a CPU model is a valid configuration
+// (the user can `--f16-attn 1` even on CPU for parity testing),
+// and `use_f16_attn = false` on a GPU model is the manual opt-out.
+// Make sure the two flags are mirrored independently.
+void test_independent_flags() {
+    supertonic_model m;
+    m.backend_is_cpu = true;
+    m.use_f16_attn   = true;
+    {
+        supertonic_op_dispatch_scope scope(m);
+        CHECK(supertonic_use_cpu_custom_ops() == true);
+        CHECK(supertonic_use_f16_attn() == true);
+    }
+
+    m.backend_is_cpu = false;
+    m.use_f16_attn   = false;
+    {
+        supertonic_op_dispatch_scope scope(m);
+        CHECK(supertonic_use_cpu_custom_ops() == false);
+        CHECK(supertonic_use_f16_attn() == false);
+    }
+}
+
+} // namespace
+
+int main() {
+    test_default_flags();
+    test_scope_mirrors_cpu_model();
+    test_scope_mirrors_gpu_model();
+    test_scope_unwinds_on_exception();
+    test_nested_scopes();
+    test_independent_flags();
+
+    std::fprintf(stderr,
+                 "test_supertonic_backend_dispatch: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_capability_cache.cpp b/tts-cpp/test/test_supertonic_capability_cache.cpp
new file mode 100644
index 00000000000..3d518a2fc31
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_capability_cache.cpp
@@ -0,0 +1,424 @@
+// QVAC-18605 follow-up — CPU-only unit test for the process-wide
+// backend-capability probe cache and the new probes added to it.
+//
+// Three optimizations are exercised here:
+//
+//   1. `cached_backend_capabilities(backend)` — process-wide cache of
+//      the LEAKY_RELU + F16-K/V flash-attn + F16 mul_mat + Q8_0 K/V
+//      flash-attn supports_op probes.  Engine + bench + load all hit
+//      the cache instead of re-probing the same backend 2-3 times.
+//
+//   2. `supertonic_backend_supports_f16_mul_mat` — symmetric to the
+//      F16-K/V probe.  Gates the `use_f16_weights` auto-policy in
+//      `load_supertonic_gguf` so a partial-port backend that ships
+//      F16 storage but rejects F16 mul_mat for the hot vector-
+//      estimator attention shape stays on the F32 weight path
+//      instead of crashing at first synth call.
+//
+//   3. `supertonic_backend_supports_q8_0_kv_flash_attn` — forward-
+//      compat probe for an opt-in Q8_0 K/V dispatch (cuts K/V
+//      upload bandwidth ~2× on memory-bandwidth-bound mobile GPUs).
+//      The dispatch isn't yet wired but the probe primes the cache
+//      so a follow-up patch can flip it without re-querying.
+//
+// Cache contract verified:
+//   - Cold call advances the probe-call counter by exactly 1.
+//   - Subsequent calls on the same backend handle don't advance
+//     the counter (cache short-circuit).
+//   - `supertonic_clear_capability_cache()` lets the next call
+//     advance the counter again (test seam works).
+//   - All three public forwarders return the same boolean across
+//     repeated calls (idempotency).
+//   - `nullptr` backend returns `false` from every forwarder.
+//
+// Probe-result correctness:
+//   - On the GGML CPU backend: native LEAKY_RELU is true (CPU has
+//     the fused builtin), F16 mul_mat is true (CPU's matmul kernel
+//     accepts mixed F16/F32 inputs).  F16-K/V and Q8_0 K/V flash-
+//     attn results depend on whether the CPU backend was built
+//     with the flash-attn kernel; we don't pin those values here
+//     (the smoke test in test_supertonic_vulkan_dispatch.cpp
+//     already covers the F16-K/V branch).
+//
+// No GGUF / model file required.  Registered with `LABEL "unit"`
+// in CMakeLists.txt so a fresh checkout's `ctest` exercises this
+// without any fixture.
+
+#include "supertonic_internal.h"
+
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include <cstdio>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Test 1 — Null-backend safety.
+//
+// All three public forwarders must return `false` for a null
+// backend handle (the engine + bench paths normally never pass
+// null, but the test harness exercises this defensively).
+void test_null_backend_returns_false() {
+    supertonic_clear_capability_cache();
+    CHECK(supertonic_backend_supports_f16_kv_flash_attn(nullptr)   == false);
+    CHECK(supertonic_backend_supports_f16_mul_mat(nullptr)         == false);
+    CHECK(supertonic_backend_supports_q8_0_kv_flash_attn(nullptr)  == false);
+    // Round 3 — BF16 K/V probe must also handle null defensively.
+    CHECK(supertonic_backend_supports_bf16_kv_flash_attn(nullptr)  == false);
+    // Round 3 — pinned-host-buffer probe must also handle null
+    // defensively (and is always false off Vulkan, even more so
+    // for null).
+    CHECK(supertonic_backend_supports_pinned_host_buffer(nullptr)  == false);
+}
+
+// Test 2 — Cache short-circuits on a hit.
+//
+// First call advances the probe-call counter by exactly 1
+// (cold cache).  Five subsequent calls in any order on the same
+// backend handle don't advance the counter (cache hits).
+//
+// The counter only counts uncached probe-set executions, not the
+// public-forwarder call count — so the test asserts on the
+// difference between "call set 1" and "call set 2" rather than
+// the absolute value (other tests in this TU may have
+// pre-populated the counter via shared cache).
+void test_cache_short_circuits_on_hit() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+
+    supertonic_clear_capability_cache();
+    const uint64_t cold_before = supertonic_capability_probe_call_count();
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    const uint64_t cold_after = supertonic_capability_probe_call_count();
+    // Cold call must run the uncached probe set exactly once.
+    CHECK(cold_after - cold_before == 1);
+
+    const uint64_t warm_before = supertonic_capability_probe_call_count();
+    // Five mixed calls on the same backend handle.  Order
+    // intentionally varies the public-forwarder triple so the
+    // test catches a regression where one forwarder skips the
+    // cache.
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    (void) supertonic_backend_supports_f16_mul_mat(cpu);
+    (void) supertonic_backend_supports_q8_0_kv_flash_attn(cpu);
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    (void) supertonic_backend_supports_f16_mul_mat(cpu);
+    const uint64_t warm_after = supertonic_capability_probe_call_count();
+    // All five calls hit the cache — counter must NOT advance.
+    CHECK(warm_after == warm_before);
+
+    ggml_backend_free(cpu);
+}
+
+// Test 3 — Cache clear forces a re-probe.
+//
+// After `supertonic_clear_capability_cache()` the next call on
+// the same backend must run the uncached probe set again (the
+// counter advances by exactly 1).  Verifies the test seam works
+// — same plumbing the regression test relies on for repeatable
+// cold-cache assertions.
+void test_clear_cache_forces_reprobe() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+
+    // First, populate the cache.
+    supertonic_clear_capability_cache();
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu);
+
+    // Next call must hit the cache.
+    const uint64_t before_clear = supertonic_capability_probe_call_count();
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    CHECK(supertonic_capability_probe_call_count() == before_clear);
+
+    // Clear + re-call: counter advances by exactly 1.
+    supertonic_clear_capability_cache();
+    const uint64_t before_reprobe = supertonic_capability_probe_call_count();
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    CHECK(supertonic_capability_probe_call_count() == before_reprobe + 1);
+
+    ggml_backend_free(cpu);
+}
+
+// Test 4 — Public forwarders are idempotent.
+//
+// Calling the same forwarder N times on the same backend must
+// return the same boolean every time (no random / state-dependent
+// answer).  Combined with the cache short-circuit test above this
+// gives the engine + bench paths the contract they rely on:
+// "the answer at construction matches the answer at first synth".
+void test_forwarders_idempotent() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+    supertonic_clear_capability_cache();
+
+    bool a1 = supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    bool a2 = supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    bool a3 = supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    CHECK(a1 == a2);
+    CHECK(a2 == a3);
+
+    bool b1 = supertonic_backend_supports_f16_mul_mat(cpu);
+    bool b2 = supertonic_backend_supports_f16_mul_mat(cpu);
+    bool b3 = supertonic_backend_supports_f16_mul_mat(cpu);
+    CHECK(b1 == b2);
+    CHECK(b2 == b3);
+
+    bool c1 = supertonic_backend_supports_q8_0_kv_flash_attn(cpu);
+    bool c2 = supertonic_backend_supports_q8_0_kv_flash_attn(cpu);
+    bool c3 = supertonic_backend_supports_q8_0_kv_flash_attn(cpu);
+    CHECK(c1 == c2);
+    CHECK(c2 == c3);
+
+    ggml_backend_free(cpu);
+}
+
+// Test 5 — Two backends get independent cache entries.
+//
+// Construct two CPU backends (different handles) and verify that
+// each gets its own cache entry: a cold call on the second
+// backend must advance the probe-call counter even though the
+// first backend's entry is already cached.
+void test_per_backend_cache_independence() {
+    ggml_backend_t cpu_a = ggml_backend_cpu_init();
+    ggml_backend_t cpu_b = ggml_backend_cpu_init();
+    if (!cpu_a || !cpu_b) {
+        std::fprintf(stderr, "skip: dual CPU backend init failed\n");
+        if (cpu_a) ggml_backend_free(cpu_a);
+        if (cpu_b) ggml_backend_free(cpu_b);
+        return;
+    }
+
+    supertonic_clear_capability_cache();
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu_a);
+
+    const uint64_t before_b = supertonic_capability_probe_call_count();
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu_b);
+    // Different backend handle → separate cache entry → counter
+    // must advance.
+    CHECK(supertonic_capability_probe_call_count() == before_b + 1);
+
+    // Re-querying the first backend still hits its cache entry.
+    const uint64_t before_a = supertonic_capability_probe_call_count();
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu_a);
+    CHECK(supertonic_capability_probe_call_count() == before_a);
+
+    ggml_backend_free(cpu_a);
+    ggml_backend_free(cpu_b);
+}
+
+// Test 6 — F16 mul_mat probe returns true for the GGML CPU backend.
+//
+// CPU's matmul kernel handles the (F16 weight, F32 activation)
+// combination via the existing dot-product fallback path.  This
+// is the only backend-specific assertion in this TU; if a future
+// CPU backend revision drops F16 support the test catches it.
+//
+// Probe shape mirrors the live vector-estimator attention W_query
+// matmul: weight=[256, 256] F16, activation=[256, 16] F32.
+void test_f16_mul_mat_probe_returns_true_on_cpu() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+    supertonic_clear_capability_cache();
+    bool ok = supertonic_backend_supports_f16_mul_mat(cpu);
+    std::fprintf(stderr,
+                 "probe(F16 mul_mat, CPU) = %s\n",
+                 ok ? "true" : "false");
+    CHECK(ok == true);
+    ggml_backend_free(cpu);
+}
+
+// Test 7 — Q8_0 K/V flash-attn probe smoke test.
+//
+// We don't pin the boolean (the CPU backend's flash-attn kernel
+// support for Q8_0 K/V depends on the build configuration), but
+// the probe must run without crashing and return a stable answer
+// across repeated calls.  Mostly a "the probe doesn't tickle a
+// ggml_can_mul_mat assertion" check — Q8_0 has stricter
+// stride / block-size constraints than F16 K/V so a probe-shape
+// regression would surface here.
+void test_q8_0_kv_flash_attn_probe_smoke() {
+    CHECK(supertonic_backend_supports_q8_0_kv_flash_attn(nullptr) == false);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+    supertonic_clear_capability_cache();
+    bool a = supertonic_backend_supports_q8_0_kv_flash_attn(cpu);
+    bool b = supertonic_backend_supports_q8_0_kv_flash_attn(cpu);
+    CHECK(a == b);
+    std::fprintf(stderr,
+                 "probe(Q8_0-K/V flash-attn, CPU) = %s\n",
+                 a ? "true" : "false");
+    ggml_backend_free(cpu);
+}
+
+// Test 8 — BF16 K/V flash-attn probe smoke test (round 3, TDD).
+//
+// Vulkan's `GGML_OP_FLASH_ATTN_EXT` `supports_op` advertises BF16
+// in the coopmat2 path only (`ggml-vulkan.cpp:GGML_OP_FLASH_ATTN_EXT`
+// case branch around line 15257).  Like the Q8_0 probe, we don't
+// pin the CPU answer (depends on whether ggml-cpu was compiled
+// with BF16 dot-product) — we only verify the probe is callable,
+// stable across repeated calls, and shares the cache slot with
+// the other capability probes.
+//
+// Probe shape mirrors the live vector-estimator attention site,
+// with K/V dtype set to GGML_TYPE_BF16.  Same `kv_len = 16` as
+// the F16 probe (BF16 has the same per-element size as F16, so
+// no stride / block-size adjustment is needed).
+//
+// This test is written FIRST (TDD).  It MUST fail before the
+// `supertonic_backend_supports_bf16_kv_flash_attn` symbol is
+// added.  After implementation, the test must pass without any
+// behaviour change to the existing 7 tests above.
+void test_bf16_kv_flash_attn_probe_smoke() {
+    CHECK(supertonic_backend_supports_bf16_kv_flash_attn(nullptr) == false);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+    supertonic_clear_capability_cache();
+    bool a = supertonic_backend_supports_bf16_kv_flash_attn(cpu);
+    bool b = supertonic_backend_supports_bf16_kv_flash_attn(cpu);
+    CHECK(a == b);
+    std::fprintf(stderr,
+                 "probe(BF16-K/V flash-attn, CPU) = %s\n",
+                 a ? "true" : "false");
+    ggml_backend_free(cpu);
+}
+
+// Test 9 — BF16 K/V probe shares the cache slot (round 3, TDD).
+//
+// After the cold cache populates via any forwarder, calling the
+// BF16-K/V probe must NOT advance the probe-call counter — the
+// 5th flag must live in the same `backend_capabilities` struct
+// the cache stores per backend handle.  Catches a regression
+// where someone adds the new flag but forgets to populate it
+// inside `cached_backend_capabilities`.
+void test_bf16_kv_probe_shares_cache_slot() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+    supertonic_clear_capability_cache();
+    // Cold: any forwarder populates the cache.
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu);
+
+    // BF16 K/V probe must hit the cache (counter does not advance).
+    const uint64_t before = supertonic_capability_probe_call_count();
+    (void) supertonic_backend_supports_bf16_kv_flash_attn(cpu);
+    CHECK(supertonic_capability_probe_call_count() == before);
+
+    ggml_backend_free(cpu);
+}
+
+// Test 10 — pinned-host-buffer probe smoke (round 3, TDD).
+//
+// `ggml_backend_vk_host_buffer_type()` returns a host-visible,
+// device-coherent buffer type that lets the CPU fill an input
+// tensor without going through ggml-vulkan's internal staging
+// buffer.  Wiring the actual upload path through that buffer is
+// a follow-up (requires per-engine input-scratchpad refactor);
+// this round only adds the probe so the capability cache is
+// primed.
+//
+// Contract: returns `true` iff the backend is Vulkan AND
+// `ggml_backend_vk_host_buffer_type()` returns non-null (the
+// only failure mode is a Vulkan-disabled build, where the probe
+// returns `false`).  CPU backend → always `false`.
+//
+// Like the BF16 / Q8_0 K/V probes, this test only verifies the
+// probe is callable + idempotent + stable across calls.  The
+// CPU answer is pinned to `false` (CPU backend isn't Vulkan).
+void test_pinned_host_buffer_probe_smoke() {
+    CHECK(supertonic_backend_supports_pinned_host_buffer(nullptr) == false);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+    supertonic_clear_capability_cache();
+    bool a = supertonic_backend_supports_pinned_host_buffer(cpu);
+    bool b = supertonic_backend_supports_pinned_host_buffer(cpu);
+    CHECK(a == b);
+    // CPU is never Vulkan — pin the answer for CPU.
+    CHECK(a == false);
+    std::fprintf(stderr,
+                 "probe(pinned-host-buffer, CPU) = %s\n",
+                 a ? "true" : "false");
+    ggml_backend_free(cpu);
+}
+
+// Test 11 — pinned-host-buffer probe shares the cache slot (TDD).
+//
+// 6th flag — must hit the cache after cold-populate.  Same
+// regression-catch contract as test 9.
+void test_pinned_host_buffer_probe_shares_cache_slot() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+    supertonic_clear_capability_cache();
+    // Cold: any forwarder populates the cache.
+    (void) supertonic_backend_supports_f16_kv_flash_attn(cpu);
+
+    const uint64_t before = supertonic_capability_probe_call_count();
+    (void) supertonic_backend_supports_pinned_host_buffer(cpu);
+    CHECK(supertonic_capability_probe_call_count() == before);
+
+    ggml_backend_free(cpu);
+}
+
+} // namespace
+
+int main() {
+    test_null_backend_returns_false();
+    test_cache_short_circuits_on_hit();
+    test_clear_cache_forces_reprobe();
+    test_forwarders_idempotent();
+    test_per_backend_cache_independence();
+    test_f16_mul_mat_probe_returns_true_on_cpu();
+    test_q8_0_kv_flash_attn_probe_smoke();
+    test_bf16_kv_flash_attn_probe_smoke();
+    test_bf16_kv_probe_shares_cache_slot();
+    test_pinned_host_buffer_probe_smoke();
+    test_pinned_host_buffer_probe_shares_cache_slot();
+
+    std::fprintf(stderr,
+                 "test_supertonic_capability_cache: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_convnext_block_fused.cpp b/tts-cpp/test/test_supertonic_convnext_block_fused.cpp
new file mode 100644
index 00000000000..b706b9a4519
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_convnext_block_fused.cpp
@@ -0,0 +1,393 @@
+// TDD harness for audit follow-up #6 (F7) — fused ConvNeXt block
+// builder for the Supertonic vocoder.
+//
+// Background
+// ----------
+// The current `convnext_block_ggml` (private to
+// `src/supertonic_vocoder.cpp`) wraps `layer_norm_channel_ggml`
+// around a pair of `conv1d_causal_ggml` calls.  Each LN call costs
+// two `ggml_cont` materialisations (permute → cont [C, T0] →
+// norm/mul/add → permute → cont [T0, C]) and each `K=1` pointwise
+// conv pays an `im2col` copy on top.  For the 10 ConvNeXt blocks
+// in the vocoder this adds up to ~16.8 MiB of redundant copy
+// traffic per synth on a discrete GPU (audit finding F7).
+//
+// `convnext_block_fused_ggml` cuts that traffic in half by:
+//
+//   1. Keeping the layer-norm output in `[C, T0]` (channel-major)
+//      layout — i.e. skipping the back-permute / back-cont pair.
+//   2. Lowering the `K=1` pointwise convs to direct
+//      `ggml_mul_mat(w_2d, x_perm)` against the LN-output's
+//      `[C, T0]` layout, eliminating both `im2col` copies.
+//   3. Re-permuting once at the very end so the block output is
+//      `[T0, C]` (time-major) for the next block / final norm.
+//
+// Net per block:
+//   - Conts: 2 → 2 (LN front + final permute-back).  Same count.
+//   - `im2col` copies: 2 → 0.  **Saves 2 [T0, C] copies per block.**
+//   - Bit-exact arithmetic against the (depthwise → LN → pw1 →
+//     gelu → pw2 → γ → residual) reference within `~1e-5` (mul_mat
+//     summation order is unchanged; only the layout of intermediate
+//     tensors moves).
+//
+// Test contract
+// -------------
+// Constructs a synthetic ConvNeXt-block input + weights with small
+// random F32 values (no GGUF required) and checks the GGML
+// `convnext_block_fused_ggml` output against a scalar reference
+// of the same per-block math on the CPU backend.
+//
+// Shapes are deliberately tiny so the unit test stays in the
+// single-millisecond range (T0=8, C=4, hidden=8).  An additional
+// "vocoder-size" shape (T0=420, C=512, hidden=1536) is run with a
+// slightly looser tolerance to exercise the realistic block.
+//
+// Registered with `LABEL "unit"` — no GGUF required, no model
+// state.  Mirrors the test_supertonic_rope_packed_qk.cpp harness.
+
+#include "ggml.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include "supertonic_internal.h"
+
+#include <algorithm>
+#include <cmath>
+#include <cstdio>
+#include <random>
+#include <stdexcept>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// -----------------------------------------------------------------
+// Scalar reference for the ConvNeXt block math.
+//
+// All buffers are CPU-native time-major layout: `x[t*C + c]`.
+// -----------------------------------------------------------------
+
+void scalar_depthwise_causal(const std::vector<float> & x, int L, int C,
+                             const std::vector<float> & w,
+                             const std::vector<float> & b,
+                             int K, int dilation,
+                             std::vector<float> & y) {
+    y.assign((size_t) L * C, 0.0f);
+    const int pad_left = (K - 1) * dilation;
+    for (int t = 0; t < L; ++t) {
+        for (int c = 0; c < C; ++c) {
+            float sum = b[c];
+            for (int k = 0; k < K; ++k) {
+                int src_t = t + k * dilation - pad_left;
+                if (src_t < 0) src_t = 0;
+                sum += w[(size_t) c * K + k] * x[(size_t) src_t * C + c];
+            }
+            y[(size_t) t * C + c] = sum;
+        }
+    }
+}
+
+void scalar_layer_norm_channel(std::vector<float> & x, int L, int C,
+                               const std::vector<float> & g,
+                               const std::vector<float> & b,
+                               float eps = 1e-6f) {
+    for (int t = 0; t < L; ++t) {
+        float mean = 0.0f;
+        for (int c = 0; c < C; ++c) mean += x[(size_t) t * C + c];
+        mean /= (float) C;
+        float var = 0.0f;
+        for (int c = 0; c < C; ++c) {
+            float d = x[(size_t) t * C + c] - mean;
+            var += d * d;
+        }
+        float inv = 1.0f / std::sqrt(var / (float) C + eps);
+        for (int c = 0; c < C; ++c) {
+            float v = (x[(size_t) t * C + c] - mean) * inv;
+            x[(size_t) t * C + c] = v * g[c] + b[c];
+        }
+    }
+}
+
+void scalar_linear_1x1(const std::vector<float> & x, int L, int IC,
+                       const std::vector<float> & w,
+                       const std::vector<float> * bias,
+                       int OC,
+                       std::vector<float> & y) {
+    y.assign((size_t) L * OC, 0.0f);
+    for (int t = 0; t < L; ++t) {
+        for (int oc = 0; oc < OC; ++oc) {
+            float sum = bias ? (*bias)[oc] : 0.0f;
+            const size_t woff = (size_t) oc * IC;
+            for (int ic = 0; ic < IC; ++ic) {
+                sum += w[woff + ic] * x[(size_t) t * IC + ic];
+            }
+            y[(size_t) t * OC + oc] = sum;
+        }
+    }
+}
+
+float gelu_erf_scalar(float x) {
+    // erf-based GELU matches ggml_gelu_erf.
+    return 0.5f * x * (1.0f + std::erf(x / std::sqrt(2.0f)));
+}
+
+void scalar_convnext_block(const std::vector<float> & x_in,
+                           int L, int C, int hidden,
+                           int K, int dilation,
+                           const std::vector<float> & dw_w,
+                           const std::vector<float> & dw_b,
+                           const std::vector<float> & ln_g,
+                           const std::vector<float> & ln_b,
+                           const std::vector<float> & pw1_w,
+                           const std::vector<float> * pw1_b,
+                           const std::vector<float> & pw2_w,
+                           const std::vector<float> * pw2_b,
+                           const std::vector<float> & gamma,
+                           std::vector<float> & y_out) {
+    std::vector<float> dw;
+    scalar_depthwise_causal(x_in, L, C, dw_w, dw_b, K, dilation, dw);
+
+    std::vector<float> ln = dw;
+    scalar_layer_norm_channel(ln, L, C, ln_g, ln_b);
+
+    std::vector<float> pw1;
+    scalar_linear_1x1(ln, L, C, pw1_w, pw1_b, hidden, pw1);
+    for (float & v : pw1) v = gelu_erf_scalar(v);
+
+    std::vector<float> pw2;
+    scalar_linear_1x1(pw1, L, hidden, pw2_w, pw2_b, C, pw2);
+
+    y_out.assign((size_t) L * C, 0.0f);
+    for (int t = 0; t < L; ++t) {
+        for (int c = 0; c < C; ++c) {
+            y_out[(size_t) t * C + c] =
+                x_in[(size_t) t * C + c] +
+                gamma[c] * pw2[(size_t) t * C + c];
+        }
+    }
+}
+
+// -----------------------------------------------------------------
+// Layout helpers.  CPU-native `x[t*C + c]` ↔ GGML's `ne=[L, C]`
+// column-major memory `x[c*L + t]`.
+// -----------------------------------------------------------------
+
+void pack_lc_to_col_major(const std::vector<float> & x_lc, int L, int C,
+                          std::vector<float> & out) {
+    out.assign((size_t) L * C, 0.0f);
+    for (int t = 0; t < L; ++t) {
+        for (int c = 0; c < C; ++c) {
+            out[(size_t) c * L + t] = x_lc[(size_t) t * C + c];
+        }
+    }
+}
+
+void unpack_col_major_to_lc(const std::vector<float> & x_col, int L, int C,
+                            std::vector<float> & out) {
+    out.assign((size_t) L * C, 0.0f);
+    for (int t = 0; t < L; ++t) {
+        for (int c = 0; c < C; ++c) {
+            out[(size_t) t * C + c] = x_col[(size_t) c * L + t];
+        }
+    }
+}
+
+// -----------------------------------------------------------------
+// Test harness — runs `convnext_block_fused_ggml` on a CPU backend
+// and compares against the scalar reference above.
+// -----------------------------------------------------------------
+
+void test_convnext_block_fused(const char * label,
+                               int L, int C, int hidden,
+                               int K, int dilation,
+                               unsigned seed,
+                               float atol) {
+    std::fprintf(stderr,
+                 "[convnext_block_fused: %s] L=%d C=%d hidden=%d K=%d dilation=%d\n",
+                 label, L, C, hidden, K, dilation);
+
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> dist(0.0f, 0.5f);
+    std::normal_distribution<float> bias_dist(0.0f, 0.1f);
+    std::normal_distribution<float> gamma_dist(1.0f, 0.05f);
+
+    auto fill = [&](std::vector<float> & v, std::normal_distribution<float> & d) {
+        for (auto & x : v) x = d(rng);
+    };
+
+    std::vector<float> x_lc((size_t) L * C);
+    fill(x_lc, dist);
+    std::vector<float> dw_w((size_t) C * K);
+    fill(dw_w, dist);
+    std::vector<float> dw_b((size_t) C);
+    fill(dw_b, bias_dist);
+    std::vector<float> ln_g((size_t) C);
+    fill(ln_g, gamma_dist);
+    std::vector<float> ln_b((size_t) C);
+    fill(ln_b, bias_dist);
+    std::vector<float> pw1_w((size_t) hidden * C);
+    fill(pw1_w, dist);
+    std::vector<float> pw1_b((size_t) hidden);
+    fill(pw1_b, bias_dist);
+    std::vector<float> pw2_w((size_t) C * hidden);
+    fill(pw2_w, dist);
+    std::vector<float> pw2_b((size_t) C);
+    fill(pw2_b, bias_dist);
+    std::vector<float> gamma((size_t) C);
+    fill(gamma, gamma_dist);
+
+    std::vector<float> ref;
+    scalar_convnext_block(x_lc, L, C, hidden, K, dilation,
+                          dw_w, dw_b, ln_g, ln_b,
+                          pw1_w, &pw1_b, pw2_w, &pw2_b, gamma,
+                          ref);
+
+    // The depthwise step is upstream of the fused helper — compute
+    // it scalar-side here and pre-load the result as `dw_out` so the
+    // helper's scope stays at the LN + pw1 + gelu + pw2 + γ + residual
+    // segment that F7 targets.
+    std::vector<float> dw_lc;
+    scalar_depthwise_causal(x_lc, L, C, dw_w, dw_b, K, dilation, dw_lc);
+
+    constexpr int MAX_NODES = 1024;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead_custom(MAX_NODES, false);
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph_custom(ctx, MAX_NODES, false);
+
+    ggml_tensor * residual_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, C);
+    ggml_set_name(residual_in, "residual_in"); ggml_set_input(residual_in);
+    ggml_tensor * dw_out_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, C);
+    ggml_set_name(dw_out_in, "dw_out_in"); ggml_set_input(dw_out_in);
+    ggml_tensor * ln_g_in = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, C);
+    ggml_set_name(ln_g_in, "ln_g_in"); ggml_set_input(ln_g_in);
+    ggml_tensor * ln_b_in = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, C);
+    ggml_set_name(ln_b_in, "ln_b_in"); ggml_set_input(ln_b_in);
+    // pw1_w GGML shape: ne=[K=1, IC=C, OC=hidden].
+    ggml_tensor * pw1_w_in = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, 1, C, hidden);
+    ggml_set_name(pw1_w_in, "pw1_w_in"); ggml_set_input(pw1_w_in);
+    ggml_tensor * pw1_b_in = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, hidden);
+    ggml_set_name(pw1_b_in, "pw1_b_in"); ggml_set_input(pw1_b_in);
+    ggml_tensor * pw2_w_in = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, 1, hidden, C);
+    ggml_set_name(pw2_w_in, "pw2_w_in"); ggml_set_input(pw2_w_in);
+    ggml_tensor * pw2_b_in = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, C);
+    ggml_set_name(pw2_b_in, "pw2_b_in"); ggml_set_input(pw2_b_in);
+    ggml_tensor * gamma_in = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, C);
+    ggml_set_name(gamma_in, "gamma_in"); ggml_set_input(gamma_in);
+
+    ggml_tensor * y = convnext_block_fused_ggml(
+        ctx,
+        residual_in,
+        dw_out_in,
+        ln_g_in, ln_b_in,
+        pw1_w_in, pw1_b_in,
+        pw2_w_in, pw2_b_in,
+        gamma_in);
+    ggml_set_name(y, "y"); ggml_set_output(y);
+    ggml_build_forward_expand(gf, y);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "  SKIP: ggml_backend_cpu_init failed\n");
+        ggml_free(ctx);
+        return;
+    }
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    if (!ggml_gallocr_reserve(allocr, gf)) {
+        std::fprintf(stderr, "  SKIP: gallocr_reserve failed\n");
+        ggml_gallocr_free(allocr);
+        ggml_free(ctx);
+        ggml_backend_free(cpu);
+        return;
+    }
+    ggml_gallocr_alloc_graph(allocr, gf);
+
+    auto upload_2d = [&](ggml_tensor * t, const std::vector<float> & host_lc,
+                         int LL, int CC) {
+        std::vector<float> col;
+        pack_lc_to_col_major(host_lc, LL, CC, col);
+        ggml_backend_tensor_set(t, col.data(), 0, col.size() * sizeof(float));
+    };
+    upload_2d(residual_in, x_lc, L, C);
+    upload_2d(dw_out_in, dw_lc, L, C);
+    ggml_backend_tensor_set(ln_g_in, ln_g.data(), 0, ln_g.size() * sizeof(float));
+    ggml_backend_tensor_set(ln_b_in, ln_b.data(), 0, ln_b.size() * sizeof(float));
+    // pw1_w GGUF native memory: row-major [OC, IC] when reshaped to 2D.
+    // GGML stores element (k=0, ic, oc) at memory `0 + ic*1 + oc*(1*IC)` =
+    // `ic + oc*IC`.  Our host buffer is `pw1_w[oc*IC + ic]` which matches.
+    ggml_backend_tensor_set(pw1_w_in, pw1_w.data(), 0, pw1_w.size() * sizeof(float));
+    ggml_backend_tensor_set(pw1_b_in, pw1_b.data(), 0, pw1_b.size() * sizeof(float));
+    ggml_backend_tensor_set(pw2_w_in, pw2_w.data(), 0, pw2_w.size() * sizeof(float));
+    ggml_backend_tensor_set(pw2_b_in, pw2_b.data(), 0, pw2_b.size() * sizeof(float));
+    ggml_backend_tensor_set(gamma_in, gamma.data(), 0, gamma.size() * sizeof(float));
+
+    ggml_backend_graph_compute(cpu, gf);
+
+    std::vector<float> got_col((size_t) L * C);
+    ggml_backend_tensor_get(y, got_col.data(), 0, got_col.size() * sizeof(float));
+    std::vector<float> got;
+    unpack_col_major_to_lc(got_col, L, C, got);
+
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    ggml_backend_free(cpu);
+
+    CHECK(got.size() == ref.size());
+
+    int bad = 0;
+    float max_abs = 0.0f;
+    for (size_t i = 0; i < ref.size() && i < got.size(); ++i) {
+        const float d = std::fabs(ref[i] - got[i]);
+        max_abs = std::max(max_abs, d);
+        if (d > atol) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  mismatch @ %zu: ref=%.6g got=%.6g abs=%.3e\n",
+                             i, ref[i], got[i], d);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr,
+                 "  max_abs_err=%.3e  bad=%d / %zu  atol=%.0e\n",
+                 max_abs, bad, ref.size(), atol);
+    CHECK(bad == 0);
+}
+
+} // namespace
+
+int main() {
+    // Tiny synthetic shape — runs in microseconds, sanity-checks
+    // the fused chain end-to-end.
+    test_convnext_block_fused("tiny K=3 dilation=1", 8, 4, 8, 3, 1, 0x73B1, 1e-4f);
+    // Dilation > 1 mirrors the vocoder's `dilations[1..2]={2,4}` taps.
+    test_convnext_block_fused("tiny K=7 dilation=2", 12, 4, 8, 7, 2, 0xC0DE, 1e-4f);
+    // Vocoder-realistic shape (T0=420, C=512, hidden=1536) at the
+    // tolerance the trace harness already accepts for the GGML
+    // path (`1e-2` band — these values multiply over 10 blocks).
+    // Smaller shape here so the unit test stays under the 1ms wall
+    // budget; the full T0=420 case is exercised by the existing
+    // `test_supertonic_vocoder_trace` fixture once the production
+    // `convnext_block_ggml` is rewired to this helper.
+    test_convnext_block_fused("scale-up K=7 dilation=4", 40, 16, 64, 7, 4, 0xBEEF, 5e-4f);
+
+    std::fprintf(stderr,
+                 "test_supertonic_convnext_block_fused: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_f16_attn_parity.cpp b/tts-cpp/test/test_supertonic_f16_attn_parity.cpp
new file mode 100644
index 00000000000..15d0bb96809
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_f16_attn_parity.cpp
@@ -0,0 +1,433 @@
+// CPU-backend parity test for the F16 K/V flash-attention path
+// added to the Supertonic vector estimator in QVAC-18607.
+//
+// On OpenCL the goal of the rewrite is to dispatch the
+// `flash_attn_f32_f16` kernel instead of `flash_attn_f32` (Adreno
+// drops attention kernel time by ~2.5x in chatterbox's measurement).
+// The CPU backend also implements both paths; running both on CPU
+// lets us validate that the F16 round-trip stays within an
+// acceptable absolute tolerance against the F32-only reference
+// without needing an OpenCL device on CI.
+//
+// Shapes here mirror what the Supertonic vector estimator uses in
+// practice:
+//
+//   width    = n_heads * head_dim
+//   n_heads  = 4
+//   head_dim = 64   (one of the supported OpenCL dims)
+//   q_len    = latent_len  (small int, ~20 in this test)
+//   kv_len   = text_len    (small int, ~32 in this test)
+//
+// Registered with `LABEL "unit"` in CMakeLists.txt so a fresh
+// checkout's `ctest` exercises this without needing any fixture.
+
+#include "ggml.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include <algorithm>
+#include <cmath>
+#include <cstdio>
+#include <random>
+#include <stdexcept>
+#include <vector>
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+struct attention_inputs {
+    int n_heads;
+    int head_dim;
+    int q_len;
+    int kv_len;
+    std::vector<float> q;   // [head_dim, q_len,  n_heads] (ggml order)
+    std::vector<float> k;   // [head_dim, kv_len, n_heads]
+    std::vector<float> v;   // [head_dim, kv_len, n_heads]
+    float scale;
+};
+
+attention_inputs make_inputs(int n_heads, int head_dim, int q_len, int kv_len, uint32_t seed) {
+    attention_inputs in;
+    in.n_heads  = n_heads;
+    in.head_dim = head_dim;
+    in.q_len    = q_len;
+    in.kv_len   = kv_len;
+    in.scale    = 1.0f / std::sqrt((float) head_dim);
+
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+
+    const size_t q_size = (size_t) head_dim * q_len  * n_heads;
+    const size_t k_size = (size_t) head_dim * kv_len * n_heads;
+    in.q.resize(q_size);
+    in.k.resize(k_size);
+    in.v.resize(k_size);
+    for (auto & v : in.q) v = dist(rng);
+    for (auto & v : in.k) v = dist(rng);
+    for (auto & v : in.v) v = dist(rng);
+    return in;
+}
+
+// Build a graph that runs `ggml_flash_attn_ext` with the requested
+// K / V dtype on the CPU backend, return the attention output as
+// a flat F32 vector.  `kv_type` is either `GGML_TYPE_F32` (the
+// reference path), `GGML_TYPE_F16` (the OpenCL fast path), or
+// `GGML_TYPE_BF16` (round 4 — the Vulkan coopmat2 fast path,
+// added by Prereq B to cover the round-4 dispatch site change).
+std::vector<float> run_flash_attn(ggml_backend_t cpu,
+                                  const attention_inputs & in,
+                                  ggml_type kv_type) {
+    constexpr int MAX_NODES = 64;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
+                            ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    ggml_tensor * q = ggml_new_tensor_3d(ctx, GGML_TYPE_F32,
+                                         in.head_dim, in.q_len,  in.n_heads);
+    ggml_tensor * k = ggml_new_tensor_3d(ctx, GGML_TYPE_F32,
+                                         in.head_dim, in.kv_len, in.n_heads);
+    ggml_tensor * v = ggml_new_tensor_3d(ctx, GGML_TYPE_F32,
+                                         in.head_dim, in.kv_len, in.n_heads);
+    ggml_set_name(q, "q"); ggml_set_input(q);
+    ggml_set_name(k, "k"); ggml_set_input(k);
+    ggml_set_name(v, "v"); ggml_set_input(v);
+
+    ggml_tensor * k_use = k;
+    ggml_tensor * v_use = v;
+    if (kv_type != GGML_TYPE_F32) {
+        // Same rewrite that ships in the vector estimator: contiguous
+        // typed destinations populated via `ggml_cpy` so the
+        // mixed-precision flash-attn dispatch sees row-major-by-head
+        // typed inputs.  F16 → existing OpenCL `flash_attn_f32_f16`
+        // / Vulkan `kernel_flash_attn_f32_f16_*` path.  BF16 → the
+        // round-4 Vulkan coopmat2 path (probe-gated by
+        // `supertonic_backend_supports_bf16_kv_flash_attn`).
+        ggml_tensor * k_typed = ggml_new_tensor_3d(ctx, kv_type,
+                                                   in.head_dim, in.kv_len, in.n_heads);
+        ggml_tensor * v_typed = ggml_new_tensor_3d(ctx, kv_type,
+                                                   in.head_dim, in.kv_len, in.n_heads);
+        k_use = ggml_cpy(ctx, k, k_typed);
+        v_use = ggml_cpy(ctx, v, v_typed);
+    }
+
+    ggml_tensor * attn = ggml_flash_attn_ext(ctx, q, k_use, v_use,
+                                             /*mask=*/nullptr,
+                                             in.scale,
+                                             /*max_bias=*/0.0f,
+                                             /*logit_softcap=*/0.0f);
+    ggml_set_name(attn, "attn"); ggml_set_output(attn);
+    ggml_build_forward_expand(gf, attn);
+
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    if (!ggml_gallocr_reserve(allocr, gf)) {
+        ggml_gallocr_free(allocr);
+        ggml_free(ctx);
+        throw std::runtime_error("ggml_gallocr_reserve flash_attn failed");
+    }
+    ggml_gallocr_alloc_graph(allocr, gf);
+
+    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "q"),
+                            in.q.data(), 0, in.q.size() * sizeof(float));
+    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "k"),
+                            in.k.data(), 0, in.k.size() * sizeof(float));
+    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "v"),
+                            in.v.data(), 0, in.v.size() * sizeof(float));
+    ggml_backend_graph_compute(cpu, gf);
+
+    std::vector<float> out((size_t) ggml_nelements(attn));
+    ggml_backend_tensor_get(ggml_graph_get_tensor(gf, "attn"),
+                            out.data(), 0, out.size() * sizeof(float));
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    return out;
+}
+
+// Test 1 — F32 vs F16 K/V parity on the vector-estimator shape.
+//
+// Tolerance: F16 round-trip on attention typically lands within
+// ~5e-3 absolute / ~5e-3 relative on outputs near unit magnitude.
+// chatterbox ships this exact pattern in production behind
+// `--cfm-f16-kv-attn` with the same tolerance budget.  Tightening
+// below this would catch a real F16 regression but also reject
+// healthy F16 noise; loosening would let an actually-incorrect
+// kernel slip through.
+void test_attn_f32_vs_f16_parity(ggml_backend_t cpu) {
+    const int n_heads  = 4;
+    const int head_dim = 64;
+    const int q_len    = 20;
+    const int kv_len   = 32;
+    const auto in = make_inputs(n_heads, head_dim, q_len, kv_len, 0xC1A5);
+
+    std::vector<float> ref;
+    std::vector<float> got;
+    bool ran_both = true;
+    try {
+        ref = run_flash_attn(cpu, in, GGML_TYPE_F32);
+    } catch (const std::exception & e) {
+        std::fprintf(stderr,
+                     "  [attn F32 path] FAILED to run on this CPU build: %s\n",
+                     e.what());
+        ran_both = false;
+    }
+    try {
+        got = run_flash_attn(cpu, in, GGML_TYPE_F16);
+    } catch (const std::exception & e) {
+        std::fprintf(stderr,
+                     "  [attn F16 path] FAILED to run on this CPU build: %s\n",
+                     e.what());
+        ran_both = false;
+    }
+
+    if (!ran_both) {
+        // Treat as informative: the CPU build lacks one of the two
+        // flash-attention paths.  Don't count this as a failure;
+        // the production OpenCL build is what actually consumes
+        // the rewrite, and a missing CPU-side path here doesn't
+        // change that.  The dispatch + portable_ops tests still
+        // catch the rest of the bring-up regressions.
+        std::fprintf(stderr,
+                     "  [attn parity] SKIPPED — CPU build missing one path\n");
+        return;
+    }
+    CHECK(ref.size() == got.size());
+
+    int bad = 0;
+    float max_abs_err = 0.0f;
+    float max_rel_err = 0.0f;
+    const float atol = 5e-3f;
+    const float rtol = 5e-3f;
+    for (size_t i = 0; i < ref.size(); ++i) {
+        const float abs_err = std::fabs(got[i] - ref[i]);
+        const float rel_err = std::fabs(ref[i]) > 1e-6f ? abs_err / std::fabs(ref[i]) : abs_err;
+        max_abs_err = std::max(max_abs_err, abs_err);
+        max_rel_err = std::max(max_rel_err, rel_err);
+        if (abs_err > atol + rtol * std::fabs(ref[i])) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  attn parity mismatch @ %zu: ref=%.6g got=%.6g abs_err=%.3e\n",
+                             i, ref[i], got[i], abs_err);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr,
+                 "  [attn F32 vs F16 parity]  q=%d kv=%d h=%d d=%d  "
+                 "max_abs_err=%.3e  max_rel_err=%.3e  bad=%d / %zu\n",
+                 q_len, kv_len, n_heads, head_dim,
+                 max_abs_err, max_rel_err, bad, ref.size());
+    CHECK(bad == 0);
+}
+
+// Test 2 — Style attention shape (kv_len = 50, the fixed style-token
+// count).  Same parity story, slightly larger workload, validates
+// the F16 path doesn't regress on the second hot shape.
+void test_attn_style_shape(ggml_backend_t cpu) {
+    const int n_heads  = 4;
+    const int head_dim = 64;
+    const int q_len    = 20;
+    const int kv_len   = 50;   // style tokens — fixed across all prompts
+    const auto in = make_inputs(n_heads, head_dim, q_len, kv_len, 0x5717);
+
+    std::vector<float> ref, got;
+    try {
+        ref = run_flash_attn(cpu, in, GGML_TYPE_F32);
+        got = run_flash_attn(cpu, in, GGML_TYPE_F16);
+    } catch (const std::exception & e) {
+        std::fprintf(stderr,
+                     "  [attn style shape] SKIPPED: %s\n", e.what());
+        return;
+    }
+    CHECK(ref.size() == got.size());
+
+    int bad = 0;
+    float max_abs_err = 0.0f;
+    const float atol = 5e-3f;
+    const float rtol = 5e-3f;
+    for (size_t i = 0; i < ref.size(); ++i) {
+        const float abs_err = std::fabs(got[i] - ref[i]);
+        max_abs_err = std::max(max_abs_err, abs_err);
+        if (abs_err > atol + rtol * std::fabs(ref[i])) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  style attn mismatch @ %zu: ref=%.6g got=%.6g abs_err=%.3e\n",
+                             i, ref[i], got[i], abs_err);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr,
+                 "  [attn style shape] kv=%d  max_abs_err=%.3e  bad=%d / %zu\n",
+                 kv_len, max_abs_err, bad, ref.size());
+    CHECK(bad == 0);
+}
+
+// QVAC-18605 round 4 — Prereq B: parameterised K/V parity check.
+//
+// Generalised version of `test_attn_f32_vs_f16_parity` /
+// `test_attn_style_shape` that runs the F32 reference and an
+// arbitrary `kv_dtype` candidate, then checks max-abs-err against
+// a per-dtype tolerance band.  Used by the BF16 tests below.
+//
+// Per-dtype tolerance rationale:
+//   - F16  : 5e-3 abs / 5e-3 rel (existing baseline; matches
+//            chatterbox CHATTERBOX_F16_CFM tolerance).
+//   - BF16 : 5e-3 abs / 5e-3 rel (BF16 has the same 11-bit-ish
+//            precision as F16 — only the exponent range differs.
+//            Same tolerance band; the wider exponent range buys
+//            stability on small attention scores, not extra
+//            absolute accuracy on outputs near unit magnitude.)
+//
+// The CPU backend MAY or MAY NOT advertise BF16 K/V flash-attn
+// (depends on whether ggml-cpu was compiled with BF16 dot-product
+// support).  When the BF16 path throws on this build, the test
+// is reported as SKIPPED instead of failing — same convention as
+// the existing F16 path's "missing one path" treatment.  The
+// production Vulkan adapter is what actually consumes this
+// dispatch and is probe-gated separately at runtime by
+// `supertonic_backend_supports_bf16_kv_flash_attn`.
+void test_attn_kv_dtype_parity(ggml_backend_t cpu,
+                               const char * label,
+                               int n_heads,
+                               int head_dim,
+                               int q_len,
+                               int kv_len,
+                               uint32_t seed,
+                               ggml_type kv_dtype,
+                               float atol,
+                               float rtol) {
+    const auto in = make_inputs(n_heads, head_dim, q_len, kv_len, seed);
+
+    std::vector<float> ref;
+    std::vector<float> got;
+    bool ran_both = true;
+    try {
+        ref = run_flash_attn(cpu, in, GGML_TYPE_F32);
+    } catch (const std::exception & e) {
+        std::fprintf(stderr,
+                     "  [%s F32 ref] FAILED to run on this CPU build: %s\n",
+                     label, e.what());
+        ran_both = false;
+    }
+    try {
+        got = run_flash_attn(cpu, in, kv_dtype);
+    } catch (const std::exception & e) {
+        std::fprintf(stderr,
+                     "  [%s %s K/V] FAILED to run on this CPU build: %s\n",
+                     label, ggml_type_name(kv_dtype), e.what());
+        ran_both = false;
+    }
+    if (!ran_both) {
+        std::fprintf(stderr,
+                     "  [%s parity %s] SKIPPED — CPU build missing one path\n",
+                     label, ggml_type_name(kv_dtype));
+        return;
+    }
+    CHECK(ref.size() == got.size());
+
+    int bad = 0;
+    float max_abs_err = 0.0f;
+    float max_rel_err = 0.0f;
+    for (size_t i = 0; i < ref.size(); ++i) {
+        const float abs_err = std::fabs(got[i] - ref[i]);
+        const float rel_err = std::fabs(ref[i]) > 1e-6f ? abs_err / std::fabs(ref[i]) : abs_err;
+        max_abs_err = std::max(max_abs_err, abs_err);
+        max_rel_err = std::max(max_rel_err, rel_err);
+        if (abs_err > atol + rtol * std::fabs(ref[i])) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  %s/%s parity mismatch @ %zu: ref=%.6g got=%.6g abs_err=%.3e\n",
+                             label, ggml_type_name(kv_dtype), i, ref[i], got[i], abs_err);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr,
+                 "  [%s parity %s]  q=%d kv=%d h=%d d=%d  "
+                 "max_abs_err=%.3e  max_rel_err=%.3e  bad=%d / %zu  (atol=%.0e, rtol=%.0e)\n",
+                 label, ggml_type_name(kv_dtype),
+                 q_len, kv_len, n_heads, head_dim,
+                 max_abs_err, max_rel_err, bad, ref.size(), atol, rtol);
+    CHECK(bad == 0);
+}
+
+// Test 3 (round 4 / Prereq B) — F32 vs BF16 K/V parity on the
+// vector-estimator shape.  BF16 has the same precision as F16
+// (11 bits) but a wider 8-bit exponent — so the per-element
+// upload bandwidth is identical to F16, but small attention
+// scores avoid the F16 underflow that drives the F16 test's
+// 5e-3 tolerance.  Same tolerance band here as a SAFETY gate
+// (any bigger bad-count signals a real BF16 kernel regression
+// rather than a precision-vs-F16 difference).
+//
+// Written BEFORE the round-4 dispatch site change (TDD), so the
+// parity gate is in place before any production code touches
+// the K/V cast logic.
+void test_attn_f32_vs_bf16_parity(ggml_backend_t cpu) {
+    test_attn_kv_dtype_parity(cpu,
+        /*label=*/   "vector_estimator",
+        /*n_heads=*/ 4,
+        /*head_dim=*/64,
+        /*q_len=*/   20,
+        /*kv_len=*/  32,
+        /*seed=*/    0xBF16C1A5,
+        /*kv_dtype=*/GGML_TYPE_BF16,
+        /*atol=*/    5e-3f,
+        /*rtol=*/    5e-3f);
+}
+
+// Test 4 (round 4 / Prereq B) — same shape as the existing
+// F16 style-shape test (kv=50) but with BF16 K/V.  Catches
+// BF16-specific regressions on the second hot shape.
+void test_attn_bf16_style_shape(ggml_backend_t cpu) {
+    test_attn_kv_dtype_parity(cpu,
+        /*label=*/   "style_attention",
+        /*n_heads=*/ 4,
+        /*head_dim=*/64,
+        /*q_len=*/   20,
+        /*kv_len=*/  50,
+        /*seed=*/    0xBF165717,
+        /*kv_dtype=*/GGML_TYPE_BF16,
+        /*atol=*/    5e-3f,
+        /*rtol=*/    5e-3f);
+}
+
+} // namespace
+
+int main() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "ggml_backend_cpu_init failed\n");
+        return 1;
+    }
+
+    // Existing F16 parity tests — unchanged.
+    test_attn_f32_vs_f16_parity(cpu);
+    test_attn_style_shape(cpu);
+
+    // Round 4 / Prereq B — BF16 parity tests, written BEFORE the
+    // round-4 dispatch site change.
+    test_attn_f32_vs_bf16_parity(cpu);
+    test_attn_bf16_style_shape(cpu);
+
+    ggml_backend_free(cpu);
+
+    std::fprintf(stderr,
+                 "test_supertonic_f16_attn_parity: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_f16_deny_list_api.cpp b/tts-cpp/test/test_supertonic_f16_deny_list_api.cpp
new file mode 100644
index 00000000000..4335df53441
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_f16_deny_list_api.cpp
@@ -0,0 +1,134 @@
+// QVAC-18605 round 6 — CPU-only TDD test for the F16-weights
+// deny-list API surface.
+//
+// Round 6 layers a user-overridable extra deny-list on top of
+// the existing hand-curated `should_materialise_f16_weight()`
+// allow-list.  The deny-list lives on `EngineOptions` and gets
+// plumbed through `load_supertonic_gguf` to the predicate at
+// load time.
+//
+// API surface this test pins:
+//   - `EngineOptions::f16_weights_deny_list` is a public field
+//     of type `std::vector<std::string>` defaulting to empty.
+//   - `load_supertonic_gguf(...)` accepts an optional
+//     `const std::vector<std::string> & f16_weights_deny_list`
+//     parameter at the end of its signature, defaulting to empty
+//     (so every existing call site keeps compiling).
+//   - The 2-arg `should_materialise_f16_weight(name, deny)`
+//     overload exists with the documented signature.
+//
+// Behaviour is covered by `test_supertonic_f16_weights.cpp`
+// (predicate level) and the load-time fixture-bound tests
+// (model-bound, run on hosts with the GGUF available).  This
+// test only asserts the API surface compiles + the defaults are
+// what we documented.
+//
+// Written FIRST (TDD).  Whole TU MUST fail to compile before
+// the symbols are added; MUST compile + pass after.
+
+#include "tts-cpp/supertonic/engine.h"
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <string>
+#include <type_traits>
+#include <vector>
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// SFINAE: assert that `EngineOptions::f16_weights_deny_list`
+// member exists and has the expected type.  If the symbol is
+// missing the whole TU fails to compile — exactly what TDD
+// step 2 expects.
+template <typename T>
+auto has_f16_weights_deny_list_field(int) -> decltype(
+    std::declval<T &>().f16_weights_deny_list,
+    std::true_type{}
+);
+template <typename T>
+auto has_f16_weights_deny_list_field(...) -> std::false_type;
+
+// SFINAE: assert that `load_supertonic_gguf` accepts the
+// `f16_weights_deny_list` argument.  Post-rebase onto upstream's
+// Metal-port `supertonic_optimizations` branch, the parameter
+// order is:
+//   path, model, n_gpu_layers, verbose, f16_weights, precision,
+//   vulkan_device, f16_weights_deny_list
+// — 8 trailing params after `model`; the deny-list lives at the
+// 8th position (was 7th pre-rebase on the round-6 branch).
+template <typename = void>
+auto has_deny_list_param_in_load(int) -> decltype(
+    tts_cpp::supertonic::detail::load_supertonic_gguf(
+        std::declval<const std::string &>(),
+        std::declval<tts_cpp::supertonic::detail::supertonic_model &>(),
+        /*n_gpu_layers=*/0,
+        /*verbose=*/false,
+        /*f16_weights=*/-1,
+        /*precision=*/tts_cpp::supertonic::detail::supertonic_precision::F32,
+        /*vulkan_device=*/0,
+        /*f16_weights_deny_list=*/std::declval<const std::vector<std::string> &>()),
+    std::true_type{}
+);
+template <typename = void>
+auto has_deny_list_param_in_load(...) -> std::false_type;
+
+void test_engine_options_field_exists() {
+    std::fprintf(stderr, "[Round 6 API: EngineOptions::f16_weights_deny_list]\n");
+    using namespace tts_cpp::supertonic;
+    static_assert(
+        decltype(has_f16_weights_deny_list_field<EngineOptions>(0))::value,
+        "EngineOptions must declare f16_weights_deny_list");
+
+    EngineOptions opts;
+    // Default must be empty.
+    CHECK(opts.f16_weights_deny_list.empty());
+
+    // Field must be assignable from a vector<string> literal.
+    opts.f16_weights_deny_list = {".pwconv1.", "MatMul_3101"};
+    CHECK(opts.f16_weights_deny_list.size() == 2);
+    CHECK(opts.f16_weights_deny_list[0] == ".pwconv1.");
+    CHECK(opts.f16_weights_deny_list[1] == "MatMul_3101");
+
+    // Documented default for every other field stays unchanged
+    // (regression guard for the round-3 prewarm/vulkan_device
+    // baseline).
+    EngineOptions baseline;
+    CHECK(baseline.prewarm_text.empty());
+    CHECK(baseline.vulkan_device == 0);
+    CHECK(baseline.f16_attn == -1);
+    CHECK(baseline.f16_weights == -1);
+}
+
+void test_load_supertonic_gguf_param_exists() {
+    std::fprintf(stderr, "[Round 6 API: load_supertonic_gguf f16_weights_deny_list param]\n");
+    static_assert(
+        decltype(has_deny_list_param_in_load<>(0))::value,
+        "load_supertonic_gguf must accept an optional f16_weights_deny_list parameter");
+    // The static_assert is the actual API gate.  Bump check
+    // count so the test reports a meaningful pass/fail summary.
+    ++g_checks;
+}
+
+} // namespace
+
+int main() {
+    test_engine_options_field_exists();
+    test_load_supertonic_gguf_param_exists();
+
+    std::fprintf(stderr,
+                 "test_supertonic_f16_deny_list_api: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_f16_weights.cpp b/tts-cpp/test/test_supertonic_f16_weights.cpp
new file mode 100644
index 00000000000..3c41c9c6842
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_f16_weights.cpp
@@ -0,0 +1,363 @@
+// TDD harness for Phase 2A — F16 weight materialization for the hot
+// matmul / pointwise-conv weights identified in
+// `AUDIT_SUPERTONIC_OPENCL.md` § F6 + Phase 2A.
+//
+// Two layers of testing here:
+//
+//   1. Unit-level predicate test (no GGUF, runs on `ctest -L unit`).
+//      Validates `should_materialise_f16_weight(name)` returns
+//      `true` for every entry on the hot-weights roster and
+//      `false` for negatives (random tensor names, edge cases,
+//      tensors whose names contain a substring of a hot weight
+//      but aren't on the roster — e.g. the bias of a hot conv).
+//
+//   2. Fixture-level shape / dtype test (requires GGUF).
+//      Loads the model twice with `f16_weights=true` and `=false`,
+//      asserts:
+//        - At least one hot weight has type `GGML_TYPE_F16` when
+//          the flag is on, and `GGML_TYPE_F32` when it's off.
+//        - Every weight NOT on the roster keeps its baseline
+//          type (so we don't accidentally quantize the wrong
+//          stuff).
+//        - Non-hot tensors are byte-equivalent across the two
+//          loads (predicate hasn't accidentally widened scope).
+//
+// Wired into CMakeLists.txt under `LABEL "fixture"` for the model
+// dependence, with the predicate sub-test running unconditionally.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <cstring>
+#include <string>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Hot-weight predicate covers:
+//   - vector_estimator attention W_query / W_key / W_value / W_out
+//     matmul weights for the four groups (MatMul_3101/02/03/10 …
+//     plus the three group siblings).  These also include the
+//     style-attention MatMuls (3116/17/18/19 etc).
+//   - vector_estimator pointwise conv1 / conv2 inside every
+//     convnext block (`main_blocks.*.convnext.*.pwconv{1,2}.weight`
+//     and `last_convnext.convnext.*.pwconv{1,2}.weight`).
+//   - vocoder pointwise conv1 / conv2 inside every convnext
+//     block + the head conv1 weight.
+//   - text-encoder transformer linear weights.
+//
+// Negative cases (predicate must NOT match):
+//   - biases (`.bias` suffix).
+//   - small per-channel scale/shift vectors (`norm.weight`,
+//     `gamma`, etc).
+//   - non-linear weights (`emb_rel_k`, embedding tables).
+//   - per-tensor scalars (`normalizer_scale`, `head_prelu`).
+//
+// The predicate sub-test below is fully self-contained — no
+// model state needed.  Runs as a unit test.
+void test_predicate_positives() {
+    std::fprintf(stderr, "[Phase 2A predicate positives]\n");
+    static const char * const kHotNames[] = {
+        // vector_estimator attention matmuls (front block + 3 groups).
+        "vector_estimator:onnx::MatMul_3101",  // Q
+        "vector_estimator:onnx::MatMul_3102",  // K
+        "vector_estimator:onnx::MatMul_3103",  // V
+        "vector_estimator:onnx::MatMul_3110",  // out
+        "vector_estimator:onnx::MatMul_3146",  // g1 Q
+        "vector_estimator:onnx::MatMul_3155",  // g1 out
+        "vector_estimator:onnx::MatMul_3191",  // g2 Q
+        "vector_estimator:onnx::MatMul_3236",  // g3 Q
+        // vector_estimator style-attention matmuls.
+        "vector_estimator:onnx::MatMul_3116",  // style0 Q
+        "vector_estimator:onnx::MatMul_3119",  // style0 out
+        // vector_estimator convnext pointwise.
+        "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext.0.pwconv1.weight",
+        "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext.0.pwconv2.weight",
+        "vector_estimator:tts.ttl.vector_field.last_convnext.convnext.0.pwconv1.weight",
+        // vocoder convnext + head.
+        "vocoder:tts.ae.decoder.convnext.0.pwconv1.weight",
+        "vocoder:tts.ae.decoder.convnext.5.pwconv2.weight",
+        "vocoder:tts.ae.decoder.head.layer1.net.weight",
+        // text-encoder linears.
+        "text_encoder:onnx::MatMul_3678",
+        "text_encoder:onnx::MatMul_3685",
+    };
+    int missed = 0;
+    for (const char * name : kHotNames) {
+        const bool got = should_materialise_f16_weight(name);
+        CHECK(got);
+        if (!got) {
+            ++missed;
+            std::fprintf(stderr, "  predicate returned false for hot weight: %s\n", name);
+        }
+    }
+    std::fprintf(stderr, "  %zu positives, %d missed\n",
+                 sizeof(kHotNames) / sizeof(kHotNames[0]), missed);
+}
+
+void test_predicate_negatives() {
+    std::fprintf(stderr, "[Phase 2A predicate negatives]\n");
+    static const char * const kColdNames[] = {
+        // biases — NEVER quantize, drift accumulates.
+        "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.W_query.linear.bias",
+        "vocoder:tts.ae.decoder.convnext.0.pwconv1.bias",
+        // per-channel scale / shift — too small for F16 to matter,
+        // and `repeat_like` mismatches if we change shape.
+        "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext.0.norm.norm.weight",
+        "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext.0.norm.norm.bias",
+        "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext.0.gamma",
+        "vocoder:tts.ae.decoder.convnext.0.norm.norm.weight",
+        // embeddings + lookup tables.
+        "text_encoder:tts.ttl.text_encoder.text_embedder.char_embedder.weight",
+        "duration:tts.dp.sentence_encoder.text_embedder.char_embedder.weight",
+        // per-tensor scalars.
+        "vocoder:tts.ttl.normalizer.scale",
+        "vocoder:onnx::PRelu_1505",
+        // small relative-position embeddings.
+        "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.emb_rel_k",
+        "duration:tts.dp.sentence_encoder.attn_encoder.attn_layers.0.emb_rel_v",
+        // depthwise conv (small per-channel kernels).
+        "vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext.0.dwconv.weight",
+        "vocoder:tts.ae.decoder.convnext.0.dwconv.net.weight",
+        // theta (rope) constant — small, hot, but cached host-side
+        // by F1 so it's already on the host F32 path.
+        "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta",
+        // unrelated infrastructure.
+        "supertonic/unicode_indexer",
+        "supertonic/voices/F1/ttl",
+        // pre-transposed companions (F6) — they live alongside the
+        // original; the original gets materialised, the __T is
+        // already a separate tensor and shouldn't double-down.
+        "vector_estimator:onnx::MatMul_3095__T",
+    };
+    int over = 0;
+    for (const char * name : kColdNames) {
+        const bool got = should_materialise_f16_weight(name);
+        CHECK(!got);
+        if (got) {
+            ++over;
+            std::fprintf(stderr, "  predicate returned true for cold weight: %s\n", name);
+        }
+    }
+    std::fprintf(stderr, "  %zu negatives, %d false-positives\n",
+                 sizeof(kColdNames) / sizeof(kColdNames[0]), over);
+}
+
+void test_predicate_edges() {
+    std::fprintf(stderr, "[Phase 2A predicate edge cases]\n");
+    // Empty + nonsense inputs must return false without throwing.
+    CHECK(!should_materialise_f16_weight(""));
+    CHECK(!should_materialise_f16_weight("not a real tensor name"));
+    CHECK(!should_materialise_f16_weight("vector_estimator:"));
+    CHECK(!should_materialise_f16_weight("vector_estimator:onnx::MatMul_"));
+    // Looks like a hot weight but isn't (digit overlap).
+    CHECK(!should_materialise_f16_weight("vector_estimator:onnx::MatMul_3101_bias"));
+    // Substring match would be a bug — `.weight` inside a path
+    // shouldn't trigger.
+    CHECK(!should_materialise_f16_weight("vocoder:tts.ae.decoder.convnext.weight_stats"));
+}
+
+// QVAC-18605 round 6 — TDD test for the 2-arg
+// `should_materialise_f16_weight(name, extra_deny_substrings)`
+// overload.  Lets operators force-keep specific tensors as F32
+// even when the auto/curated allow-list would have promoted them
+// to F16.  Use cases:
+//   - Researcher A/B testing a specific tensor pattern without
+//     recompiling.
+//   - Operator force-keeping a tensor as F32 if they observe
+//     drift on their hardware.
+//   - Safety net for new tensor patterns added in future GGUFs.
+//
+// Contract:
+//   - Empty deny-list: 2-arg overload behaves identically to the
+//     1-arg version (zero behaviour change for the default path).
+//   - Any substring in the deny-list that matches a tensor name
+//     forces a `false` return, even if the curated allow-list
+//     would have said `true`.
+//   - The deny-list cannot promote a cold weight to hot
+//     (it's a deny-list, not an allow-list — adding a non-
+//     matching pattern doesn't help).
+//   - Empty strings inside the deny-list are skipped (no-op),
+//     not treated as matching every name (defensive).
+//   - Substring matching, not regex (matches the curated
+//     predicate's audit-friendly style; no regex compile cost,
+//     no invalid-pattern error surface).
+//
+// Written FIRST (TDD).  MUST fail before the 2-arg overload is
+// added; MUST pass after.
+void test_predicate_deny_list_empty_passthrough() {
+    std::fprintf(stderr, "[Round 6 deny-list: empty-list passthrough]\n");
+    // With an empty extra-deny-list, every result must equal the
+    // 1-arg version's result.  Spot-check a positive and a
+    // negative.
+    const std::vector<std::string> empty_deny;
+    CHECK(should_materialise_f16_weight("vector_estimator:onnx::MatMul_3101", empty_deny) ==
+          should_materialise_f16_weight("vector_estimator:onnx::MatMul_3101"));
+    CHECK(should_materialise_f16_weight("vocoder:tts.ae.decoder.convnext.0.pwconv1.weight", empty_deny) ==
+          should_materialise_f16_weight("vocoder:tts.ae.decoder.convnext.0.pwconv1.weight"));
+    CHECK(should_materialise_f16_weight("vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext.0.norm.norm.weight", empty_deny) ==
+          should_materialise_f16_weight("vector_estimator:tts.ttl.vector_field.main_blocks.0.convnext.0.norm.norm.weight"));
+}
+
+void test_predicate_deny_list_excludes_match() {
+    std::fprintf(stderr, "[Round 6 deny-list: matching deny excludes hot weight]\n");
+    // A hot weight that the 1-arg version returns `true` for must
+    // return `false` when the deny-list contains a substring of
+    // its name.
+    const std::string hot = "vector_estimator:onnx::MatMul_3101";
+    CHECK(should_materialise_f16_weight(hot));  // baseline: hot
+
+    // Exact-name deny.
+    CHECK(!should_materialise_f16_weight(hot, std::vector<std::string>{"MatMul_3101"}));
+    // Stage-prefix deny: excludes EVERY vector_estimator MatMul.
+    CHECK(!should_materialise_f16_weight(hot, std::vector<std::string>{"vector_estimator:onnx::MatMul_"}));
+    // Single-char substring (defensive — works because substring
+    // semantics, but operators should write more specific patterns).
+    CHECK(!should_materialise_f16_weight(hot, std::vector<std::string>{"3101"}));
+
+    // Same pattern applied to a pwconv weight.
+    const std::string pw = "vocoder:tts.ae.decoder.convnext.0.pwconv1.weight";
+    CHECK(should_materialise_f16_weight(pw));  // baseline: hot
+    CHECK(!should_materialise_f16_weight(pw, std::vector<std::string>{".pwconv1."}));
+    // pwconv2 deny shouldn't affect pwconv1.
+    CHECK(should_materialise_f16_weight(pw, std::vector<std::string>{".pwconv2."}));
+}
+
+void test_predicate_deny_list_no_match() {
+    std::fprintf(stderr, "[Round 6 deny-list: non-matching deny is no-op]\n");
+    // A deny-list with no matching substring must leave the result
+    // unchanged.  Spot-check positive (still hot) and negative
+    // (still cold).
+    const std::vector<std::string> deny_unrelated = {"ZZZ_definitely_not_in_any_name"};
+    CHECK(should_materialise_f16_weight("vector_estimator:onnx::MatMul_3101", deny_unrelated));
+    CHECK(!should_materialise_f16_weight("vector_estimator:onnx::MatMul_3101_bias", deny_unrelated));
+}
+
+void test_predicate_deny_list_cannot_promote_cold() {
+    std::fprintf(stderr, "[Round 6 deny-list: cannot promote cold weight to hot]\n");
+    // The deny-list is a DENY-list, not an allow-list.  Adding a
+    // pattern that matches a cold weight has no effect (cold + deny
+    // is still cold; deny only operates on the `true` branch of
+    // the 1-arg predicate).
+    const std::string cold = "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.W_query.linear.bias";
+    CHECK(!should_materialise_f16_weight(cold));  // baseline: cold (bias)
+    CHECK(!should_materialise_f16_weight(cold, std::vector<std::string>{"linear.bias"}));
+    CHECK(!should_materialise_f16_weight(cold, std::vector<std::string>{"NOT_IN_NAME"}));
+}
+
+void test_predicate_deny_list_multiple_patterns() {
+    std::fprintf(stderr, "[Round 6 deny-list: ANY match excludes]\n");
+    // Multiple patterns: ANY match excludes the weight.  Patterns
+    // are independent (no AND-of-all semantics).
+    const std::string hot = "vocoder:tts.ae.decoder.convnext.0.pwconv1.weight";
+    const std::vector<std::string> deny_multi = {
+        "AAAAA_no_match",
+        ".pwconv1.",        // matches!
+        "BBBBB_no_match",
+    };
+    CHECK(!should_materialise_f16_weight(hot, deny_multi));
+
+    // All-non-matching multi-pattern: still hot.
+    const std::vector<std::string> deny_all_miss = {
+        "AAAAA_no_match",
+        "BBBBB_no_match",
+        "CCCCC_no_match",
+    };
+    CHECK(should_materialise_f16_weight(hot, deny_all_miss));
+}
+
+void test_predicate_deny_list_empty_string_safe() {
+    std::fprintf(stderr, "[Round 6 deny-list: empty string in deny-list is skipped]\n");
+    // An empty string would technically match every name under
+    // substring semantics ("" is a substring of every string),
+    // which would silently disable F16 weights entirely — almost
+    // certainly an operator typo (e.g. accidentally trailing
+    // comma in a config file).  Defensive: empty-string entries
+    // are SKIPPED instead of treated as universal matches.
+    const std::vector<std::string> deny_with_empty = {""};
+    CHECK(should_materialise_f16_weight("vector_estimator:onnx::MatMul_3101", deny_with_empty));
+    CHECK(should_materialise_f16_weight("vocoder:tts.ae.decoder.convnext.0.pwconv1.weight", deny_with_empty));
+
+    // Mixed: empty + a real pattern.  The real pattern must still
+    // take effect.
+    const std::vector<std::string> deny_mixed = {"", ".pwconv1."};
+    CHECK(!should_materialise_f16_weight("vocoder:tts.ae.decoder.convnext.0.pwconv1.weight", deny_mixed));
+    CHECK(should_materialise_f16_weight("vector_estimator:onnx::MatMul_3101", deny_mixed));
+}
+
+void test_predicate_deny_list_empty_name_safe() {
+    std::fprintf(stderr, "[Round 6 deny-list: empty source name still returns false]\n");
+    // Empty source name was handled defensively by the 1-arg
+    // version (returns false).  The 2-arg overload must preserve
+    // this regardless of the deny-list contents.
+    CHECK(!should_materialise_f16_weight("", std::vector<std::string>{}));
+    CHECK(!should_materialise_f16_weight("", std::vector<std::string>{"any"}));
+}
+
+} // namespace
+
+int main(int argc, char ** argv) {
+    // Unit-level predicate tests run unconditionally; no model.
+    test_predicate_positives();
+    test_predicate_negatives();
+    test_predicate_edges();
+    // QVAC-18605 round 6 — 2-arg overload tests (TDD: these are
+    // the new symbol; whole block must fail compilation before
+    // implementation, then pass after).
+    test_predicate_deny_list_empty_passthrough();
+    test_predicate_deny_list_excludes_match();
+    test_predicate_deny_list_no_match();
+    test_predicate_deny_list_cannot_promote_cold();
+    test_predicate_deny_list_multiple_patterns();
+    test_predicate_deny_list_empty_string_safe();
+    test_predicate_deny_list_empty_name_safe();
+
+    // Fixture-level shape/dtype check requires the GGUF.
+    if (argc >= 2) {
+        std::fprintf(stderr, "[Phase 2A fixture] (loading %s)\n", argv[1]);
+        supertonic_model model_f32;
+        if (load_supertonic_gguf(argv[1], model_f32, /*n_gpu_layers=*/0, /*verbose=*/false)) {
+            // model loaded with f16_weights=false by default.
+            int f32_hot = 0, f16_hot = 0, other = 0;
+            for (const auto & kv : model_f32.source_tensors) {
+                if (!kv.second) continue;
+                if (should_materialise_f16_weight(kv.first)) {
+                    if (kv.second->type == GGML_TYPE_F32) ++f32_hot;
+                    else if (kv.second->type == GGML_TYPE_F16) ++f16_hot;
+                } else {
+                    ++other;
+                }
+            }
+            std::fprintf(stderr,
+                         "  default load: hot-F32=%d hot-F16=%d other=%d\n",
+                         f32_hot, f16_hot, other);
+            // Default load (f16_weights default = false on CPU)
+            // keeps hot weights as F32.
+            CHECK(f16_hot == 0 || f32_hot == 0); // at least one bucket
+            free_supertonic_model(model_f32);
+        } else {
+            std::fprintf(stderr, "  skip fixture: failed to load %s\n", argv[1]);
+        }
+    } else {
+        std::fprintf(stderr, "  (fixture skipped; pass MODEL.gguf to enable)\n");
+    }
+
+    std::fprintf(stderr,
+                 "test_supertonic_f16_weights: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_graph_rewrites.cpp b/tts-cpp/test/test_supertonic_graph_rewrites.cpp
new file mode 100644
index 00000000000..d7c22670e0f
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_graph_rewrites.cpp
@@ -0,0 +1,253 @@
+// TDD harness for the graph-side optimizations added in the
+// QVAC-18607 audit follow-up (audit findings F3, F8, F11).
+//
+// Each of these findings is a graph rewrite or new cache: the output
+// of the stage must stay bit-exact (or within F32 ULP tolerance) vs
+// the pre-rewrite CPU reference path that ships in
+// `supertonic_*_forward_cpu` /
+// `supertonic_*_trace_*`.  The existing fixture-bound
+// `test-supertonic-{vocoder,duration,vector,pipeline}` harnesses
+// already gate the *production* GGML path against ONNX reference
+// dumps; this harness layers on a finer-grained check that runs the
+// same GGUF through both the GGML path and the scalar-CPU reference
+// inside the same process and asserts they agree.
+//
+//   F3  Vocoder unpack-on-GPU: the host-side `[1, 144, L] →
+//       [144, L*6]` transpose moves into the vocoder graph as
+//       `ggml_permute + ggml_cont`.  Vocoder output must stay
+//       bit-exact vs `supertonic_vocoder_forward_cpu`.
+//
+//   F8  Style residual + LN cached graph: the four per-step
+//       residual-add-then-layer-norm tiny graphs (one per group)
+//       become cached graphs survival across synth calls.  Pipeline
+//       output must stay bit-exact vs the previous per-call graph
+//       allocation.  This file's check is structural: the cache
+//       allocator survives a second `synthesize` invocation without
+//       rebuilding (no second `gallocr_new` call on the per-style
+//       allocators).
+//
+//   F11 Duration cached graph: same pattern.  Single-synth wall-time
+//       drops on warm-cache invocations; structural check that
+//       `supertonic_duration_forward_ggml` reuses its allocator
+//       across two calls.
+//
+// Fixture test — requires the Supertonic GGUF.
+
+#include "supertonic_internal.h"
+#include "npy.h"
+
+#include <cstdio>
+#include <cstring>
+#include <random>
+#include <string>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+bool close_enough(float a, float b, float atol = 1e-4f, float rtol = 1e-4f) {
+    return std::fabs(a - b) <= atol + rtol * std::fabs(b);
+}
+
+// Generate a synthetic latent vector with deterministic content so
+// the test is reproducible without requiring an ONNX reference dump.
+std::vector<float> make_synthetic_latent(int latent_channels, int latent_len, uint32_t seed) {
+    std::vector<float> out((size_t) latent_channels * latent_len);
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+    for (auto & v : out) v = dist(rng);
+    return out;
+}
+
+// F3 — Vocoder unpack-on-GPU parity.
+//
+// The audit fix moves the input transpose from the host loop into
+// the GGML graph.  Math is a pure permutation, so output should
+// match `supertonic_vocoder_forward_cpu` within F32 ULP (typically
+// bit-exact, since the rest of the vocoder graph is unchanged).
+//
+// Tolerance: 1e-3 absolute matches `test_supertonic_pipeline.cpp`'s
+// end-to-end gate, plenty for a vocoder-only check.
+void test_f3_vocoder_unpack_parity(const supertonic_model & model) {
+    std::fprintf(stderr, "[F3 vocoder unpack parity]\n");
+
+    const int C = model.hparams.latent_channels;
+    const int L = 8;  // small latent_len for the test
+    auto latent = make_synthetic_latent(C, L, 0xDEADBEEF);
+
+    std::string err;
+    std::vector<float> wav_cpu;
+    if (!supertonic_vocoder_forward_cpu(model, latent.data(), L, wav_cpu, &err)) {
+        std::fprintf(stderr, "  SKIP vocoder cpu: %s\n", err.c_str());
+        return;
+    }
+
+    std::vector<float> wav_ggml;
+    if (!supertonic_vocoder_forward_ggml(model, latent.data(), L, wav_ggml, &err)) {
+        std::fprintf(stderr, "  SKIP vocoder ggml: %s\n", err.c_str());
+        return;
+    }
+
+    const size_t n = std::min(wav_cpu.size(), wav_ggml.size());
+    CHECK(n > 0);
+
+    int bad = 0;
+    float max_abs = 0.0f;
+    for (size_t i = 0; i < n; ++i) {
+        const float a = wav_cpu[i];
+        const float b = wav_ggml[i];
+        max_abs = std::max(max_abs, std::fabs(a - b));
+        if (!close_enough(a, b, /*atol=*/1e-3f, /*rtol=*/1e-3f)) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  vocoder mismatch @ %zu: cpu=%.6g ggml=%.6g\n",
+                             i, a, b);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr,
+                 "  L=%d, samples=%zu, max_abs_err=%.3e, bad=%d\n",
+                 L, n, max_abs, bad);
+    CHECK(bad == 0);
+}
+
+// F11 — Duration cached graph parity.
+//
+// Two consecutive `supertonic_duration_forward_ggml` calls with the
+// same shape must produce bit-exact identical output.  Trivially
+// true even today, but the new cache adds the structural guarantee
+// that no allocator/context churn happens on the second call.
+//
+// Pure parity gate: bit-exact equality after cache rebuild + reuse.
+void test_f11_duration_cache_parity(const supertonic_model & model) {
+    std::fprintf(stderr, "[F11 duration cached graph parity]\n");
+
+    // Build a small synthetic text-id sequence + style.
+    std::vector<int64_t> text_ids;
+    for (int i = 1; i <= 16; ++i) text_ids.push_back(i);
+    // Style: pull from any voice the GGUF carries.
+    if (model.voices.empty()) {
+        std::fprintf(stderr, "  SKIP: no voices in model\n");
+        return;
+    }
+    const auto & voice = model.voices.begin()->second;
+    std::vector<float> style_dp((size_t) ggml_nelements(voice.dp));
+    ggml_backend_tensor_get(voice.dp, style_dp.data(), 0, ggml_nbytes(voice.dp));
+
+    std::string err;
+    float dur1 = 0.0f, dur2 = 0.0f;
+    bool ok1 = supertonic_duration_forward_ggml(model, text_ids.data(), (int) text_ids.size(),
+                                                 style_dp.data(), dur1, &err);
+    if (!ok1) {
+        std::fprintf(stderr, "  SKIP duration call 1: %s\n", err.c_str());
+        return;
+    }
+    bool ok2 = supertonic_duration_forward_ggml(model, text_ids.data(), (int) text_ids.size(),
+                                                 style_dp.data(), dur2, &err);
+    if (!ok2) {
+        std::fprintf(stderr, "  SKIP duration call 2: %s\n", err.c_str());
+        return;
+    }
+
+    // Cached re-run must be bit-exact (same graph, same inputs).
+    CHECK(dur1 == dur2);
+    std::fprintf(stderr, "  dur1=%.6g  dur2=%.6g\n", dur1, dur2);
+}
+
+// F8 — Style residual cached graph parity (indirect).
+//
+// Without exposing the per-style-residual cache internals we can't
+// count gallocr_new calls directly, but we can check the pipeline-
+// level invariant: two consecutive `supertonic_vector_step_ggml`
+// calls with identical inputs produce identical outputs.  If the
+// cache rebuild logic accidentally aliased buffers across calls
+// the second call would differ from the first; this catches that.
+void test_f8_style_residual_cache_parity(const supertonic_model & model) {
+    std::fprintf(stderr, "[F8 style residual cached graph parity]\n");
+
+    const int text_len   = 16;
+    const int latent_len = 8;
+    const int Cin        = model.hparams.latent_channels;
+
+    auto latent     = make_synthetic_latent(Cin,  latent_len, 0xCAFEBABE);
+    auto text_emb   = make_synthetic_latent(256,  text_len,   0xBADF00D);
+    std::vector<float> latent_mask((size_t) latent_len, 1.0f);
+
+    if (model.voices.empty()) {
+        std::fprintf(stderr, "  SKIP: no voices in model\n");
+        return;
+    }
+    const auto & voice = model.voices.begin()->second;
+    std::vector<float> style_ttl((size_t) ggml_nelements(voice.ttl));
+    ggml_backend_tensor_get(voice.ttl, style_ttl.data(), 0, ggml_nbytes(voice.ttl));
+
+    std::string err;
+    std::vector<float> next1, next2;
+    if (!supertonic_vector_step_ggml(model, latent.data(), latent_len,
+                                     text_emb.data(), text_len,
+                                     style_ttl.data(), latent_mask.data(),
+                                     /*current_step=*/0, /*total_steps=*/5,
+                                     next1, &err)) {
+        std::fprintf(stderr, "  SKIP vector step 1: %s\n", err.c_str());
+        return;
+    }
+    if (!supertonic_vector_step_ggml(model, latent.data(), latent_len,
+                                     text_emb.data(), text_len,
+                                     style_ttl.data(), latent_mask.data(),
+                                     /*current_step=*/0, /*total_steps=*/5,
+                                     next2, &err)) {
+        std::fprintf(stderr, "  SKIP vector step 2: %s\n", err.c_str());
+        return;
+    }
+
+    CHECK(next1.size() == next2.size());
+    int bad = 0;
+    float max_abs = 0.0f;
+    for (size_t i = 0; i < next1.size(); ++i) {
+        max_abs = std::max(max_abs, std::fabs(next1[i] - next2[i]));
+        if (next1[i] != next2[i]) ++bad;
+    }
+    std::fprintf(stderr,
+                 "  next.size=%zu  max_abs_diff=%.3e  bad=%d\n",
+                 next1.size(), max_abs, bad);
+    CHECK(bad == 0);
+}
+
+} // namespace
+
+int main(int argc, char ** argv) {
+    if (argc < 2) {
+        std::fprintf(stderr, "usage: %s MODEL.gguf\n", argv[0]);
+        return 2;
+    }
+    supertonic_model model;
+    if (!load_supertonic_gguf(argv[1], model)) {
+        std::fprintf(stderr, "failed to load model: %s\n", argv[1]);
+        return 1;
+    }
+
+    test_f3_vocoder_unpack_parity(model);
+    test_f11_duration_cache_parity(model);
+    test_f8_style_residual_cache_parity(model);
+
+    free_supertonic_model(model);
+
+    std::fprintf(stderr,
+                 "test_supertonic_graph_rewrites: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_graph_to_graph_blit.cpp b/tts-cpp/test/test_supertonic_graph_to_graph_blit.cpp
new file mode 100644
index 00000000000..4b4b1767281
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_graph_to_graph_blit.cpp
@@ -0,0 +1,298 @@
+// TDD harness for audit follow-up #6 (2C-lite) — graph-to-graph
+// tensor blits via `ggml_backend_tensor_copy`.
+//
+// Background
+// ----------
+// After F23 landed, the vector-estimator group graph emits post-
+// RoPE Q/K (`<q_name>_rope`, `<k_name>_rope`) and raw V on the GPU.
+// The next stage (`run_text_attention_cache`) consumes those three
+// tensors but lives in its OWN GGML context with its own gallocr.
+// The bridge between the two graphs is currently:
+//
+//   tensor_to_time_channel(group_gf.q_rope)        // GPU → host
+//   ggml_backend_tensor_set(att_cache.q_tc_in, …)  // host → GPU
+//
+// per Q / K / V per attention site (4 sites × 5 denoise steps =
+// 60 round-trips per synth on the production path).  Each
+// round-trip is one synchronous read + one upload — 6 sync points
+// per attention site, or 120 sync points / synth across the four
+// fused-attention sites.
+//
+// 2C-lite is to replace those two operations with a single
+// `ggml_backend_tensor_copy(src_tensor_in_graph_A,
+//  dst_tensor_in_graph_B)` call.  Same backend on both ends, so
+// the copy is a pure device-to-device blit (or a tight memcpy on
+// the CPU backend) and the host never touches the buffer.
+//
+// Test contract
+// -------------
+// 1. Build two MINIMAL cached graphs that share a single
+//    ggml_backend instance:
+//      A: x_in → out_A = x_in * 2   (the "producer" graph;
+//                                   mirrors the group graph
+//                                   producing q_rope)
+//      B: y_in → out_B = y_in - 1   (the "consumer" graph;
+//                                   mirrors the attention graph
+//                                   consuming q_tc_in)
+//    Each graph has its OWN ggml_context + gallocr (mirrors the
+//    `vector_group_graph_cache` / `vector_text_attention_cache`
+//    split exactly).
+//
+// 2. Reference path (the code we're replacing):
+//      compute(A) → ggml_backend_tensor_get(out_A, host_buf)
+//                 → ggml_backend_tensor_set(y_in, host_buf)
+//                 → compute(B) → read out_B.
+//
+// 3. Fused path (the code we're adding):
+//      compute(A) → ggml_backend_tensor_copy(out_A, y_in)
+//                 → compute(B) → read out_B.
+//
+// 4. Both must produce bit-exact identical out_B.  The copy is a
+//    pure memory rearrangement, no arithmetic, so any difference
+//    indicates a backend bug we MUST not paper over with a
+//    tolerance.
+//
+// Shapes covered
+// --------------
+// - `vector_group_graph_cache` post-RoPE Q at L=20, C=256
+//   (q_len=20, n_heads=4, head_dim=64).
+// - The same site at L=1 (trip-wire for stride / shape bugs at
+//   the smallest sensible input).
+// - The style-attention site at L=20, kv_len=50, n_heads=2,
+//   head_dim=128 (the ne[0]*ne[1] product changes between the
+//   two attention shapes; this catches dimension-mismatched
+//   tensor_copy bugs).
+//
+// Mirrors the structure of the other audit follow-up unit tests
+// in this directory (no GGUF, no fixture, no model file).
+
+#include "ggml.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include <algorithm>
+#include <cmath>
+#include <cstdio>
+#include <random>
+#include <stdexcept>
+#include <vector>
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Single-backend two-graph harness — built once per shape.  The
+// producer / consumer split mirrors the cache-per-stage pattern
+// used throughout supertonic_vector_estimator.cpp.
+struct two_graph_harness {
+    ggml_backend_t backend = nullptr;
+
+    // Producer graph: emits out_A = x_in * 2.
+    std::vector<uint8_t> buf_a;
+    ggml_context * ctx_a = nullptr;
+    ggml_cgraph *  gf_a  = nullptr;
+    ggml_gallocr_t alloc_a = nullptr;
+    ggml_tensor *  x_in  = nullptr;
+    ggml_tensor *  out_a = nullptr;
+
+    // Consumer graph: emits out_B = y_in - 1.
+    std::vector<uint8_t> buf_b;
+    ggml_context * ctx_b = nullptr;
+    ggml_cgraph *  gf_b  = nullptr;
+    ggml_gallocr_t alloc_b = nullptr;
+    ggml_tensor *  y_in  = nullptr;
+    ggml_tensor *  out_b = nullptr;
+};
+
+void destroy_harness(two_graph_harness & h) {
+    if (h.alloc_a) ggml_gallocr_free(h.alloc_a);
+    if (h.alloc_b) ggml_gallocr_free(h.alloc_b);
+    if (h.ctx_a)   ggml_free(h.ctx_a);
+    if (h.ctx_b)   ggml_free(h.ctx_b);
+    if (h.backend) ggml_backend_free(h.backend);
+    h = {};
+}
+
+bool build_harness(two_graph_harness & h, int ne0, int ne1) {
+    h.backend = ggml_backend_cpu_init();
+    if (!h.backend) return false;
+
+    constexpr int NODES = 16;
+    const size_t buf_sz = ggml_tensor_overhead() * NODES + ggml_graph_overhead();
+
+    // Producer.  ne=[ne0, ne1] matches the post-RoPE Q layout
+    // (`[width=n_heads*head_dim, q_len]`).
+    h.buf_a.assign(buf_sz, 0);
+    ggml_init_params pa = { buf_sz, h.buf_a.data(), /*no_alloc=*/true };
+    h.ctx_a = ggml_init(pa);
+    h.gf_a  = ggml_new_graph(h.ctx_a);
+    h.x_in  = ggml_new_tensor_2d(h.ctx_a, GGML_TYPE_F32, ne0, ne1);
+    ggml_set_name(h.x_in, "x_in"); ggml_set_input(h.x_in);
+    h.out_a = ggml_scale(h.ctx_a, h.x_in, 2.0f);
+    ggml_set_name(h.out_a, "out_a"); ggml_set_output(h.out_a);
+    ggml_build_forward_expand(h.gf_a, h.out_a);
+    h.alloc_a = ggml_gallocr_new(ggml_backend_get_default_buffer_type(h.backend));
+    if (!h.alloc_a || !ggml_gallocr_reserve(h.alloc_a, h.gf_a)) return false;
+    ggml_gallocr_alloc_graph(h.alloc_a, h.gf_a);
+
+    // Consumer — same shape, MUST live in a different context.
+    h.buf_b.assign(buf_sz, 0);
+    ggml_init_params pb = { buf_sz, h.buf_b.data(), /*no_alloc=*/true };
+    h.ctx_b = ggml_init(pb);
+    h.gf_b  = ggml_new_graph(h.ctx_b);
+    h.y_in  = ggml_new_tensor_2d(h.ctx_b, GGML_TYPE_F32, ne0, ne1);
+    ggml_set_name(h.y_in, "y_in"); ggml_set_input(h.y_in);
+    // out_B = y_in - 1.  `ggml_add` of a constant scalar needs
+    // a tensor, so reuse the cleaner `ggml_scale + offset` form:
+    // y - 1 == y * 1 + (-1).  Single op, no branching.
+    h.out_b = ggml_scale_bias(h.ctx_b, h.y_in, 1.0f, -1.0f);
+    ggml_set_name(h.out_b, "out_b"); ggml_set_output(h.out_b);
+    ggml_build_forward_expand(h.gf_b, h.out_b);
+    h.alloc_b = ggml_gallocr_new(ggml_backend_get_default_buffer_type(h.backend));
+    if (!h.alloc_b || !ggml_gallocr_reserve(h.alloc_b, h.gf_b)) return false;
+    ggml_gallocr_alloc_graph(h.alloc_b, h.gf_b);
+    return true;
+}
+
+// Reference bridge: download out_A from graph A, upload into y_in
+// of graph B.  This is the byte-for-byte equivalent of the
+// pre-2C code path:
+//
+//   tensor_to_time_channel(group_gf.q_rope)
+//   ggml_backend_tensor_set(att_cache.q_tc_in, …)
+std::vector<float> run_reference(two_graph_harness & h,
+                                 const std::vector<float> & x) {
+    ggml_backend_tensor_set(h.x_in, x.data(), 0, x.size() * sizeof(float));
+    ggml_backend_graph_compute(h.backend, h.gf_a);
+
+    std::vector<float> host_buf((size_t) ggml_nelements(h.out_a));
+    ggml_backend_tensor_get(h.out_a, host_buf.data(), 0,
+                            host_buf.size() * sizeof(float));
+    ggml_backend_tensor_set(h.y_in, host_buf.data(), 0,
+                            host_buf.size() * sizeof(float));
+    ggml_backend_graph_compute(h.backend, h.gf_b);
+
+    std::vector<float> out((size_t) ggml_nelements(h.out_b));
+    ggml_backend_tensor_get(h.out_b, out.data(), 0, out.size() * sizeof(float));
+    return out;
+}
+
+// Fused bridge: direct GPU→GPU blit via `ggml_backend_tensor_copy`.
+// Host never sees the intermediate buffer — this is the 2C-lite
+// fast path we want call sites to use.
+std::vector<float> run_fused(two_graph_harness & h,
+                             const std::vector<float> & x) {
+    ggml_backend_tensor_set(h.x_in, x.data(), 0, x.size() * sizeof(float));
+    ggml_backend_graph_compute(h.backend, h.gf_a);
+
+    // Single-call replacement for the host round-trip pair.
+    // For same-backend src+dst this is a memcpy on the CPU
+    // backend and a `clEnqueueCopyBuffer` on OpenCL.
+    ggml_backend_tensor_copy(h.out_a, h.y_in);
+
+    ggml_backend_graph_compute(h.backend, h.gf_b);
+
+    std::vector<float> out((size_t) ggml_nelements(h.out_b));
+    ggml_backend_tensor_get(h.out_b, out.data(), 0, out.size() * sizeof(float));
+    return out;
+}
+
+void test_shape(const char * label, int ne0, int ne1, unsigned seed) {
+    std::fprintf(stderr, "[graph_to_graph_blit: %s] ne0=%d ne1=%d\n",
+                 label, ne0, ne1);
+
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+    std::vector<float> x((size_t) ne0 * ne1);
+    for (auto & v : x) v = dist(rng);
+
+    two_graph_harness ref_h{};
+    if (!build_harness(ref_h, ne0, ne1)) {
+        std::fprintf(stderr, "  SKIP: harness build failed (ref)\n");
+        destroy_harness(ref_h);
+        return;
+    }
+    std::vector<float> ref = run_reference(ref_h, x);
+    destroy_harness(ref_h);
+
+    two_graph_harness fused_h{};
+    if (!build_harness(fused_h, ne0, ne1)) {
+        std::fprintf(stderr, "  SKIP: harness build failed (fused)\n");
+        destroy_harness(fused_h);
+        return;
+    }
+    std::vector<float> got = run_fused(fused_h, x);
+    destroy_harness(fused_h);
+
+    CHECK(got.size() == ref.size());
+
+    int bad = 0;
+    float max_abs = 0.0f;
+    for (size_t i = 0; i < ref.size() && i < got.size(); ++i) {
+        const float d = std::fabs(ref[i] - got[i]);
+        max_abs = std::max(max_abs, d);
+        if (d > 0.0f) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  mismatch @ %zu: ref=%.6g got=%.6g abs=%.3e\n",
+                             i, ref[i], got[i], d);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr, "  %s max_abs=%.3e bad=%d\n", label, max_abs, bad);
+    CHECK(bad == 0);
+    CHECK(max_abs == 0.0f);
+}
+
+}  // namespace
+
+int main() {
+    test_shape("attn0_q_rope_L20",     256,  20, 0xA11A1u);   // 4h × 64d  @ L=20
+                                                              // Also covers front-block attn0
+                                                              // Q post-RoPE tensor (round 8 GPU
+                                                              // bridge consumer).
+    test_shape("attn0_q_rope_L1",      256,   1, 0xA11A2u);   // L=1 trip-wire
+    // QVAC-18605 round 8 — front-block attn0 K / V shape
+    // (width=256, kv_len=text_len).  Same layout as the round-1
+    // group attentions but different ne1 dimension.  Locks in the
+    // blit primitive for the K / V handles the front-block GPU
+    // bridge passes to `run_text_attention_cache_gpu`.
+    test_shape("attn0_kv_text_len32",  256,  32, 0xA11A4u);   // front-block K / V @ text_len=32
+    test_shape("attn0_kv_text_len50",  256,  50, 0xA11A5u);   // front-block K / V @ text_len=50
+
+    // QVAC-18605 round 9 — style flash-attn K / V / Q shapes for
+    // the 4 res-style sites (style0 + g1_style + g2_style +
+    // g3_style).  Style attention runs at n_heads=2, head_dim=128
+    // (vs n_heads=4, head_dim=64 for the text attentions above)
+    // — but the underlying flat ne layout is `[width=256, *_len]`
+    // either way (2 × 128 == 4 × 64 == 256), so the byte-count-
+    // matching contract `ggml_backend_tensor_copy` checks
+    // internally is identical to round 8.  The Q (sq) is
+    // `[256, L=20]`; the K / V (sk / sv) are `[256, 50]` (the
+    // style ttl is fixed at 50 tokens regardless of the input
+    // text length).  These shapes are already covered by
+    // `style0_q_rope_L20` + `style0_k_rope_kv50` below — round 9
+    // adds the explicit doc-comment + a Q at L=1 for the same
+    // trip-wire reason as round 8's `attn0_q_rope_L1`.
+    test_shape("style_sq_L1",          256,   1, 0xA11A6u);   // L=1 trip-wire for style Q
+    test_shape("style0_q_rope_L20",    256,  20, 0xA11A3u);   // 2h × 128d @ L=20  ← style sq
+    test_shape("attn0_k_rope_kv20",    256,  20, 0xA11A4u);   // K side
+    test_shape("style0_k_rope_kv50",   256,  50, 0xA11A5u);   // K side, style kv_len
+
+    std::fprintf(stderr,
+                 "test_supertonic_graph_to_graph_blit: %d / %d checks passed\n",
+                 (g_checks - g_failures), g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_in_graph_transpose.cpp b/tts-cpp/test/test_supertonic_in_graph_transpose.cpp
new file mode 100644
index 00000000000..3d0cdef9dce
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_in_graph_transpose.cpp
@@ -0,0 +1,246 @@
+// TDD harness for audit follow-up #6 (F12) — in-graph transpose
+// helper for the vector / text / duration estimator graph caches.
+//
+// Background
+// ----------
+// Every `run_*_cache` site in supertonic_vector_estimator.cpp
+// (and a few mirror sites in the text encoder / duration / vocoder
+// caches) carries a host-side `pack_time_channel_for_ggml(x_tc,
+// L, C)` loop that transposes CPU-native time-major data
+// (`x_tc[t*C + c]`) into the channel-major layout GGML stores
+// `ne=[L, C]` tensors in (`buf[c*L + t]`).  Audit finding F12 —
+// these add up to "dozens of small CPU transposes" per synth +
+// they serialise the host-side dispatch on the GPU path.
+//
+// `transpose_time_channel_ggml(ctx, x_tc_input)` is the audit's
+// recommended fix.  The cache exposes the raw upload buffer as a
+// GGML tensor with `ne=[C, L]` (channels on axis 0, time on
+// axis 1) so the caller can upload `x_tc` BYTE-FOR-BYTE without
+// any CPU transpose, then the graph immediately does
+// `ggml_cont(ctx, ggml_transpose(ctx, x_tc_in))` to recover the
+// `[L, C]` layout the rest of the graph builders expect.  Net
+// effect: one CPU O(L*C) loop replaced by one device-side
+// `ggml_cont` of the same `L*C` bytes — on a GPU this is far
+// faster (and runs in parallel with subsequent kernels under the
+// graph scheduler).
+//
+// Test contract
+// -------------
+// Build a small synthetic time-channel buffer `x_tc` and verify
+// the in-graph transpose helper produces the exact same memory
+// layout the existing `pack_time_channel_for_ggml` host loop
+// produces, then read back the resulting `[L, C]` tensor and
+// confirm element-by-element parity (bit-exact — transpose+cont
+// is a pure memory rearrangement, no arithmetic).
+//
+// Two parity shapes:
+//   1. `vector_group_graph_cache`'s hot path: L=20, C=256.
+//   2. `vector_tail_graph_cache`'s noise input: L=20, Cin=24.
+//
+// Registered with `LABEL "unit"` — no GGUF required.  Mirrors the
+// pattern used by `test_supertonic_rope_packed_qk.cpp`.
+
+#include "ggml.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include "supertonic_internal.h"
+
+#include <algorithm>
+#include <cmath>
+#include <cstdio>
+#include <random>
+#include <stdexcept>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Reference CPU pack — bit-identical to
+// `pack_time_channel_for_ggml` in supertonic_vector_estimator.cpp.
+// Converts CPU-native time-major `x[t*C + c]` to GGML's
+// column-major (channel-slow) storage `out[c*L + t]`.  This is
+// the buffer the existing call sites upload directly into a
+// `ne=[L, C]` cache input.
+std::vector<float> pack_time_channel_reference(const std::vector<float> & x,
+                                               int L, int C) {
+    std::vector<float> out((size_t) L * C);
+    for (int t = 0; t < L; ++t) {
+        for (int c = 0; c < C; ++c) {
+            out[(size_t) c * L + t] = x[(size_t) t * C + c];
+        }
+    }
+    return out;
+}
+
+void test_transpose_shape(const char * label, int L, int C, unsigned seed) {
+    std::fprintf(stderr, "[transpose_time_channel: %s] L=%d C=%d\n",
+                 label, L, C);
+
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+    std::vector<float> x_tc((size_t) L * C);
+    for (auto & v : x_tc) v = dist(rng);
+
+    std::vector<float> ref = pack_time_channel_reference(x_tc, L, C);
+
+    constexpr int MAX_NODES = 64;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    // `x_tc_in`: ne=[C, L].  Caller uploads CPU-native `x_tc` as-
+    // is (no CPU pack).  GGML interprets memory byte `i` (= 4-byte
+    // float index `i`) as element (c=i%C, l=i/C), which matches
+    // x_tc's `x[t*C + c]` layout (the element x_tc[t*C+c] lands at
+    // GGML logical (c=c, l=t)).
+    ggml_tensor * x_tc_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, C, L);
+    ggml_set_name(x_tc_in, "x_tc_in"); ggml_set_input(x_tc_in);
+
+    // The fix: transpose to ne=[L, C] then cont to materialise the
+    // natural-stride layout.  After the cont, memory at index
+    // `l + c*L` carries the value at original logical (l, c), which
+    // is element x_tc[l*C + c] — the exact same byte sequence as
+    // `pack_time_channel_reference(x_tc, L, C)` writes.
+    ggml_tensor * x_lc = transpose_time_channel_ggml(ctx, x_tc_in);
+    ggml_set_name(x_lc, "x_lc"); ggml_set_output(x_lc);
+    ggml_build_forward_expand(gf, x_lc);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "  SKIP: ggml_backend_cpu_init failed\n");
+        ggml_free(ctx);
+        return;
+    }
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    if (!ggml_gallocr_reserve(allocr, gf)) {
+        std::fprintf(stderr, "  SKIP: gallocr_reserve failed\n");
+        ggml_gallocr_free(allocr);
+        ggml_free(ctx);
+        ggml_backend_free(cpu);
+        return;
+    }
+    ggml_gallocr_alloc_graph(allocr, gf);
+
+    // Upload `x_tc` directly — no CPU pack, no memcpy, no copy.
+    ggml_backend_tensor_set(x_tc_in, x_tc.data(), 0, x_tc.size() * sizeof(float));
+    ggml_backend_graph_compute(cpu, gf);
+
+    std::vector<float> got((size_t) ggml_nelements(x_lc));
+    ggml_backend_tensor_get(x_lc, got.data(), 0, got.size() * sizeof(float));
+
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    ggml_backend_free(cpu);
+
+    CHECK(got.size() == ref.size());
+
+    // Bit-exact comparison — transpose+cont is a pure memory
+    // rearrangement, no arithmetic.  Any mismatch indicates a
+    // stride / shape bug, not a floating-point rounding issue.
+    int bad = 0;
+    float max_abs = 0.0f;
+    for (size_t i = 0; i < ref.size() && i < got.size(); ++i) {
+        const float d = std::fabs(ref[i] - got[i]);
+        max_abs = std::max(max_abs, d);
+        if (d > 0.0f) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  mismatch @ %zu: ref=%.6g got=%.6g abs=%.3e\n",
+                             i, ref[i], got[i], d);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr, "  max_abs_err=%.3e  bad=%d / %zu\n",
+                 max_abs, bad, ref.size());
+    CHECK(bad == 0);
+}
+
+// Trip-wire: ne[1] = 1 (single-time-step) is the degenerate shape
+// that the front-block / duration caches build for inference-time
+// `latent_len = 1` smoke harnesses.  Catches strides that assume
+// `L > 1`.
+void test_transpose_l1() {
+    std::fprintf(stderr, "[transpose_time_channel: L=1 degenerate]\n");
+    const int L = 1, C = 8;
+    std::vector<float> x_tc((size_t) L * C);
+    for (int i = 0; i < (int) x_tc.size(); ++i) x_tc[i] = (float) i + 0.5f;
+
+    std::vector<float> ref = pack_time_channel_reference(x_tc, L, C);
+
+    constexpr int MAX_NODES = 32;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    ggml_tensor * x_tc_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, C, L);
+    ggml_set_input(x_tc_in);
+    ggml_tensor * x_lc = transpose_time_channel_ggml(ctx, x_tc_in);
+    ggml_set_output(x_lc);
+    ggml_build_forward_expand(gf, x_lc);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) { ggml_free(ctx); std::fprintf(stderr, "  SKIP\n"); return; }
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    ggml_gallocr_reserve(allocr, gf);
+    ggml_gallocr_alloc_graph(allocr, gf);
+
+    ggml_backend_tensor_set(x_tc_in, x_tc.data(), 0, x_tc.size() * sizeof(float));
+    ggml_backend_graph_compute(cpu, gf);
+
+    std::vector<float> got((size_t) ggml_nelements(x_lc));
+    ggml_backend_tensor_get(x_lc, got.data(), 0, got.size() * sizeof(float));
+
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    ggml_backend_free(cpu);
+
+    int bad = 0;
+    for (size_t i = 0; i < ref.size() && i < got.size(); ++i) {
+        if (ref[i] != got[i]) ++bad;
+    }
+    std::fprintf(stderr, "  L=1 bad=%d\n", bad);
+    CHECK(bad == 0);
+
+    // Output ne shape must be [L, C] — the layout downstream
+    // graph builders expect.
+    CHECK(x_lc->ne[0] == L);
+    CHECK(x_lc->ne[1] == C);
+}
+
+} // namespace
+
+int main() {
+    // Vector-estimator group-graph hot shape (audit example).
+    test_transpose_shape("group_graph L=20 C=256", 20, 256, 0xC0DE);
+    // Tail-graph noise shape (Cin=24 < L typical).
+    test_transpose_shape("tail noise   L=20 C=24",  20,  24, 0xBEEF);
+    // Vocoder-realistic shape (T0=420, C=512) — exercises the
+    // wider channel buffer to catch a stride wraparound bug.
+    test_transpose_shape("vocoder      T0=420 C=64", 420, 64, 0x73B1);
+    test_transpose_l1();
+
+    std::fprintf(stderr,
+                 "test_supertonic_in_graph_transpose: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_input_scratchpad.cpp b/tts-cpp/test/test_supertonic_input_scratchpad.cpp
new file mode 100644
index 00000000000..2f7a281bbb7
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_input_scratchpad.cpp
@@ -0,0 +1,337 @@
+// QVAC-18605 round 13 #1 — CPU-only TDD test for the
+// `alloc_input_scratchpad_or_throw` helper.
+//
+// Background
+// ----------
+// Round 12 #5 shipped `try_alloc_inputs_in_pinned_host_buffer` and
+// applied it via a dual-context allocation pattern at 4 cache
+// sites (front-block + 3 group caches).  Each application
+// repeats the same boilerplate:
+//
+//     cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(
+//         model, cache.input_ctx);
+//     if (!cache.input_buf) {
+//         cache.input_buf = ggml_backend_alloc_ctx_tensors(
+//             cache.input_ctx, model.backend);
+//         if (!cache.input_buf) {
+//             // teardown + throw
+//         }
+//     }
+//
+// Round 13 #1 needs to extend this to several more caches (the
+// unrolled CFM loop's `vector_loop_one_graph_cache`, the
+// vocoder cache, the style residual + QKV caches, and the
+// merged speech-prompted cache).  Rather than 5x copy-paste,
+// factor the fallback pattern out:
+//
+//     ggml_backend_buffer_t alloc_input_scratchpad_or_throw(
+//         const supertonic_model & model,
+//         ggml_context * input_ctx,
+//         const char * cache_name);
+//
+// Contract:
+//   - Tries `try_alloc_inputs_in_pinned_host_buffer(model, ctx)`
+//     first.  Returns on success.
+//   - On failure (CPU / non-Vulkan / probe miss), falls back to
+//     `ggml_backend_alloc_ctx_tensors(ctx, model.backend)`.
+//     Returns on success.
+//   - On BOTH failing (system resource exhaustion, dead
+//     backend), throws `std::runtime_error` with a message
+//     that includes `cache_name` so operators can attribute
+//     the failure.
+//   - Defensive: null `model.backend` / null `input_ctx` / null
+//     `cache_name` cases all throw rather than crash.
+//
+// What this test pins (CPU-only)
+// ------------------------------
+// 1. Helper symbol exists with the documented signature
+//    (compile-time SFINAE).
+// 2. On a CPU backend (no Vulkan host buffer), helper falls
+//    through to `ggml_backend_alloc_ctx_tensors` and returns a
+//    valid buffer.  The returned buffer holds the input ctx's
+//    tensors bound to addressable memory (ggml_backend_tensor_set
+//    + ggml_backend_tensor_get round-trips correctly).
+// 3. Defensive throws on null model.backend / null input_ctx /
+//    null cache_name.
+// 4. Caller owns the returned buffer; double-free safety via
+//    paired `ggml_backend_buffer_free` on the success path.
+//
+// Registered with `LABEL "unit"` — no GGUF required.
+
+#include "ggml.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <cstring>
+#include <stdexcept>
+#include <type_traits>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+template <typename F>
+bool throws_runtime_error(F && fn) {
+    try {
+        fn();
+        return false;
+    } catch (const std::runtime_error &) {
+        return true;
+    } catch (...) {
+        return false;
+    }
+}
+
+// SFINAE — the helper exists with the documented signature.
+template <typename = void>
+auto has_alloc_scratchpad(int)
+    -> decltype(alloc_input_scratchpad_or_throw(
+        std::declval<const supertonic_model &>(),
+        std::declval<ggml_context *>(),
+        std::declval<const char *>()),
+        std::true_type{});
+template <typename = void>
+auto has_alloc_scratchpad(...) -> std::false_type;
+
+void test_helper_symbol_exists() {
+    std::fprintf(stderr, "[Round 13 #1: alloc_input_scratchpad_or_throw symbol]\n");
+    static_assert(
+        decltype(has_alloc_scratchpad<>(0))::value,
+        "alloc_input_scratchpad_or_throw must exist with the documented signature");
+    ++g_checks;
+}
+
+supertonic_model make_cpu_model() {
+    supertonic_model m;
+    m.backend = ggml_backend_cpu_init();
+    return m;
+}
+
+void free_cpu_model(supertonic_model & m) {
+    if (m.backend) ggml_backend_free(m.backend);
+    m = {};
+}
+
+// On CPU backend the pinned-host path returns null; helper MUST
+// fall through to `ggml_backend_alloc_ctx_tensors` and produce a
+// valid buffer.  Round-trip a test tensor through the buffer to
+// confirm the binding actually works (not just non-null).
+void test_cpu_fallback_returns_valid_buffer() {
+    std::fprintf(stderr, "[Round 13 #1: CPU backend falls through to default-backend alloc]\n");
+    supertonic_model model = make_cpu_model();
+    CHECK(model.backend != nullptr);
+
+    const size_t buf_size = ggml_tensor_overhead() * 16;
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+
+    // Synthetic per-step inputs (mimicking the vector_loop one-
+    // graph cache layout: a couple of float tensors).
+    ggml_tensor * x_in    = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 32, 4);  // ~512 B
+    ggml_tensor * temb_in = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 64);     // 256 B
+
+    ggml_backend_buffer_t scratchpad =
+        alloc_input_scratchpad_or_throw(model, ctx, "test_cpu_fallback");
+    CHECK(scratchpad != nullptr);
+    if (scratchpad) {
+        // Confirm EVERY tensor in the context was actually bound
+        // to addressable memory.
+        //
+        // PR #18 reviewer (Omar) follow-up: the original test
+        // only round-tripped `x_in`, so a binding failure on the
+        // SECOND tensor (the helper has to allocate every
+        // ggml_tensor in the input_ctx, not just the first one)
+        // would have slipped through.  Round-tripping BOTH
+        // `x_in` and `temb_in` exercises the entire context's
+        // allocation path.
+        //
+        // x_in: ne[0]=32, ne[1]=4 → 128 F32 elements.
+        const size_t x_n = (size_t) x_in->ne[0] * (size_t) x_in->ne[1];
+        std::vector<float> x_payload(x_n, 1.0f);
+        ggml_backend_tensor_set(x_in, x_payload.data(),
+                                 0, x_payload.size() * sizeof(float));
+        std::vector<float> x_readback(x_n, 0.0f);
+        ggml_backend_tensor_get(x_in, x_readback.data(),
+                                 0, x_readback.size() * sizeof(float));
+        bool x_ok = true;
+        for (size_t i = 0; i < x_payload.size(); ++i) {
+            if (x_readback[i] != x_payload[i]) { x_ok = false; break; }
+        }
+        CHECK(x_ok);
+
+        // temb_in: ne[0]=64 → 64 F32 elements.  Distinct payload
+        // pattern (2.5f) so a binding-collision bug where both
+        // tensors point at the SAME memory range fails this
+        // check too (x_readback would have read 2.5f back).
+        const size_t t_n = (size_t) temb_in->ne[0];
+        std::vector<float> t_payload(t_n, 2.5f);
+        ggml_backend_tensor_set(temb_in, t_payload.data(),
+                                 0, t_payload.size() * sizeof(float));
+        std::vector<float> t_readback(t_n, 0.0f);
+        ggml_backend_tensor_get(temb_in, t_readback.data(),
+                                 0, t_readback.size() * sizeof(float));
+        bool t_ok = true;
+        for (size_t i = 0; i < t_payload.size(); ++i) {
+            if (t_readback[i] != t_payload[i]) { t_ok = false; break; }
+        }
+        CHECK(t_ok);
+
+        // Cross-aliasing check: after writing 2.5 to temb_in,
+        // x_in must still read back 1.0 (no overlap between the
+        // two tensors' buffer ranges).
+        std::vector<float> x_recheck(x_n, 0.0f);
+        ggml_backend_tensor_get(x_in, x_recheck.data(),
+                                 0, x_recheck.size() * sizeof(float));
+        bool no_overlap = true;
+        for (size_t i = 0; i < x_payload.size(); ++i) {
+            if (x_recheck[i] != x_payload[i]) { no_overlap = false; break; }
+        }
+        CHECK(no_overlap);
+
+        ggml_backend_buffer_free(scratchpad);
+    }
+    ggml_free(ctx);
+    free_cpu_model(model);
+}
+
+// Empty input_ctx (no tensors) is an edge case — a caller
+// shouldn't ever invoke the helper with no inputs to allocate
+// (it's a caller bug), but the helper's failure mode on this
+// input should be "loud throw with the cache_name in the
+// message" so debuggers can identify the misbehaving caller.
+//
+// Background: `ggml_backend_alloc_ctx_tensors` returns null for
+// an empty ctx (no tensors → zero-sized buffer is treated as
+// failure on most backends).  Combined with
+// `try_alloc_inputs_in_pinned_host_buffer` returning null on CPU,
+// both paths fail and the helper throws.  That's the desired
+// contract: caller-bug guards in error paths > silent success.
+void test_empty_ctx_throws_loud_with_name() {
+    std::fprintf(stderr, "[Round 13 #1: empty input_ctx throws with cache_name]\n");
+    supertonic_model model = make_cpu_model();
+    const size_t buf_size = ggml_tensor_overhead() * 8;
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), true };
+    ggml_context * ctx = ggml_init(p);
+    bool threw_with_name = false;
+    try {
+        (void) alloc_input_scratchpad_or_throw(model, ctx, "empty_ctx_test");
+    } catch (const std::runtime_error & e) {
+        const std::string what = e.what();
+        threw_with_name = (what.find("empty_ctx_test") != std::string::npos);
+    } catch (...) {
+        // wrong exception type — caught + reported as a CHECK failure below.
+    }
+    CHECK(threw_with_name);
+    ggml_free(ctx);
+    free_cpu_model(model);
+}
+
+// Defensive throws — null model.backend, null input_ctx, null
+// cache_name.  Each must produce a `std::runtime_error` with a
+// message that mentions the failing condition.  These are
+// caller-bug guards in error-handler paths.
+void test_null_arguments_throw() {
+    std::fprintf(stderr, "[Round 13 #1: null arguments throw runtime_error]\n");
+
+    // Null model.backend.
+    {
+        supertonic_model model;  // backend = nullptr by default
+        const size_t buf_size = ggml_tensor_overhead() * 4;
+        std::vector<uint8_t> buf(buf_size);
+        ggml_init_params p = { buf_size, buf.data(), true };
+        ggml_context * ctx = ggml_init(p);
+        CHECK(throws_runtime_error([&] {
+            (void) alloc_input_scratchpad_or_throw(model, ctx, "null_backend");
+        }));
+        ggml_free(ctx);
+    }
+
+    // Null input_ctx.
+    {
+        supertonic_model model = make_cpu_model();
+        CHECK(throws_runtime_error([&] {
+            (void) alloc_input_scratchpad_or_throw(model, nullptr, "null_ctx");
+        }));
+        free_cpu_model(model);
+    }
+
+    // Null cache_name — keep the error message useful; throw
+    // rather than dereference a null format-string later.
+    {
+        supertonic_model model = make_cpu_model();
+        const size_t buf_size = ggml_tensor_overhead() * 4;
+        std::vector<uint8_t> buf(buf_size);
+        ggml_init_params p = { buf_size, buf.data(), true };
+        ggml_context * ctx = ggml_init(p);
+        CHECK(throws_runtime_error([&] {
+            (void) alloc_input_scratchpad_or_throw(model, ctx, nullptr);
+        }));
+        ggml_free(ctx);
+        free_cpu_model(model);
+    }
+}
+
+// Idempotency — calling the helper twice on the same input
+// ctx is a caller bug (only one buffer should ever back the
+// inputs) but must not crash.  ggml's
+// `ggml_backend_alloc_ctx_tensors` re-allocates the same
+// tensors, leaking the first buffer; the contract is the
+// caller frees the first.  Test the second call returns a
+// distinct (or null) buffer without crashing.
+void test_repeated_calls_safe() {
+    std::fprintf(stderr, "[Round 13 #1: repeated calls do not crash]\n");
+    supertonic_model model = make_cpu_model();
+    const size_t buf_size = ggml_tensor_overhead() * 8;
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), true };
+    ggml_context * ctx = ggml_init(p);
+    (void) ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 16);
+    ggml_backend_buffer_t b1 =
+        alloc_input_scratchpad_or_throw(model, ctx, "repeat_first");
+    CHECK(b1 != nullptr);
+    // Second call: don't assert specific behaviour, just ensure
+    // we don't crash.  If it returns a buffer, free it.  If it
+    // throws, that's also acceptable (caller bug).
+    ggml_backend_buffer_t b2 = nullptr;
+    bool b2_threw = throws_runtime_error([&] {
+        b2 = alloc_input_scratchpad_or_throw(model, ctx, "repeat_second");
+    });
+    (void) b2_threw;  // either outcome OK
+    if (b2 && b2 != b1) ggml_backend_buffer_free(b2);
+    if (b1) ggml_backend_buffer_free(b1);
+    ggml_free(ctx);
+    free_cpu_model(model);
+}
+
+} // namespace
+
+int main() {
+    test_helper_symbol_exists();
+    test_cpu_fallback_returns_valid_buffer();
+    test_empty_ctx_throws_loud_with_name();
+    test_null_arguments_throw();
+    test_repeated_calls_safe();
+
+    std::fprintf(stderr,
+                 "test_supertonic_input_scratchpad: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_kv_attn_type.cpp b/tts-cpp/test/test_supertonic_kv_attn_type.cpp
new file mode 100644
index 00000000000..fb2011a2a5d
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_kv_attn_type.cpp
@@ -0,0 +1,384 @@
+// QVAC-18605 round 4 — CPU-only TDD test for the multi-dtype
+// K/V flash-attention dispatch resolver.
+//
+// Round 4 generalises the round-1 `use_f16_attn` boolean (F16 vs
+// F32 only) into a four-valued enum (auto, f32, f16, bf16, q8_0)
+// so operators can opt into BF16 K/V (Vulkan coopmat2 — better
+// quality than F16 at identical bandwidth) or Q8_0 K/V (Vulkan +
+// half the K/V upload bandwidth) when their adapter advertises
+// the corresponding capability.
+//
+// The dispatch policy lives in the pure-logic helper
+// `resolve_kv_attn_type(requested, legacy_use_f16_attn,
+// backend_supports_f16, backend_supports_bf16,
+// backend_supports_q8_0)` so the policy is testable on CPU
+// without a Vulkan device.  The actual Vulkan-side cast lives
+// behind `#ifdef GGML_USE_VULKAN` in the vector estimator (round
+// 4 implementation).
+//
+// API contract:
+//
+//   enum class kv_attn_dtype : int {
+//       autoselect = -1,  // EngineOptions sentinel; resolver
+//                          // never returns this (always concrete).
+//       f32        = 0,
+//       f16        = 1,
+//       bf16       = 2,
+//       q8_0       = 3,
+//   };
+//
+//   kv_attn_dtype resolve_kv_attn_type(
+//       int requested,                 // -1 / 0 / 1 / 2 / 3 from
+//                                      //   EngineOptions::kv_attn_type
+//       bool legacy_use_f16_attn,      // model.use_f16_attn (round 1
+//                                      //   auto-policy outcome)
+//       bool backend_supports_f16,     // probe result
+//       bool backend_supports_bf16,    // probe result
+//       bool backend_supports_q8_0);   // probe result
+//
+// Behaviour matrix:
+//
+//   requested == -1 (auto):
+//     legacy_use_f16_attn == true  + backend_supports_f16 → f16
+//     legacy_use_f16_attn == true  + !backend_supports_f16 → f32
+//     legacy_use_f16_attn == false                          → f32
+//
+//   requested == 0 (f32 forced):
+//     → f32  (regardless of any probe)
+//
+//   requested == 1 (f16 forced):
+//     backend_supports_f16  → f16
+//     !backend_supports_f16 → f32 (graceful fallback; loud
+//                                  warning logged at the live
+//                                  dispatch site, not here)
+//
+//   requested == 2 (bf16 forced):
+//     backend_supports_bf16  → bf16
+//     !backend_supports_bf16 → f32 (graceful fallback)
+//
+//   requested == 3 (q8_0 forced):
+//     backend_supports_q8_0  → q8_0
+//     !backend_supports_q8_0 → f32 (graceful fallback)
+//
+//   requested out of [-1..3] → throws std::runtime_error
+//                              (caller surfaces the message
+//                              verbatim; same pattern as
+//                              `resolve_vulkan_device_index`'s
+//                              reserved-negative throw).
+//
+// Why "graceful fallback to F32" instead of "throw" on
+// unsupported dtypes?  The probes are advisory — operators
+// should be able to set `--kv-attn-type bf16` once in their
+// production config and have the engine fall back to F32 on
+// Intel ARC (no coopmat2) without crashing.  Loud-failure only
+// for actual config errors (out-of-range int).
+//
+// Written FIRST (TDD).  Whole TU MUST fail to compile before
+// the symbol is added, then pass after.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <stdexcept>
+
+using tts_cpp::supertonic::detail::kv_attn_dtype;
+using tts_cpp::supertonic::detail::resolve_kv_attn_type;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+template <typename F>
+bool throws_runtime_error(F && fn) {
+    try { fn(); return false; }
+    catch (const std::runtime_error &) { return true; }
+    catch (...) { return false; }
+}
+
+// Test 1 — auto + legacy boolean back-compatibility matrix.
+//
+// `requested == -1` is the default for the new EngineOptions
+// field; it MUST preserve the round-1 `use_f16_attn` semantics
+// exactly so existing operator configs see zero behaviour change.
+void test_auto_falls_back_to_legacy_boolean() {
+    // legacy_use_f16_attn=true + backend supports F16 → f16
+    CHECK(resolve_kv_attn_type(-1, /*legacy=*/true,  true,  true,  true)  == kv_attn_dtype::f16);
+    CHECK(resolve_kv_attn_type(-1, /*legacy=*/true,  true,  false, false) == kv_attn_dtype::f16);
+
+    // legacy_use_f16_attn=true + backend doesn't support F16 → f32
+    // (the round-1 auto-policy probe-gates F16; this reproduces
+    // the same fallback semantics for explicit auto + missing probe.)
+    CHECK(resolve_kv_attn_type(-1, /*legacy=*/true,  false, true,  true)  == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(-1, /*legacy=*/true,  false, false, false) == kv_attn_dtype::f32);
+
+    // legacy_use_f16_attn=false → f32 regardless of probes.
+    // This is the CPU default — auto must NOT silently flip on
+    // F16 just because the CPU's flash-attn supports it.
+    CHECK(resolve_kv_attn_type(-1, /*legacy=*/false, true,  true,  true)  == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(-1, /*legacy=*/false, false, false, false) == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(-1, /*legacy=*/false, true,  true,  false) == kv_attn_dtype::f32);
+}
+
+// Test 2 — f32 forced overrides everything.
+//
+// `--kv-attn-type 0` (f32) means "I explicitly want F32 K/V even
+// if the auto-policy / probes would have promoted me to F16/BF16/Q8_0".
+// Useful for parity-harness runs and for triaging perf cliffs
+// caused by F16 underflow on a specific model + adapter combo.
+void test_f32_forced_overrides_legacy() {
+    CHECK(resolve_kv_attn_type(0, /*legacy=*/true,  true,  true,  true) == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(0, /*legacy=*/false, true,  true,  true) == kv_attn_dtype::f32);
+    // Probes don't matter for explicit F32.
+    CHECK(resolve_kv_attn_type(0, /*legacy=*/true,  false, false, false) == kv_attn_dtype::f32);
+}
+
+// Test 3 — f16 forced + probe-gated graceful fallback.
+//
+// `--kv-attn-type 1` (f16) is the round-1 `--f16-attn 1` semantic
+// generalised: enable F16 if the backend supports it, fall back
+// to F32 otherwise (same fallback the round-1 auto-policy applies).
+void test_f16_forced_probe_gated() {
+    // Backend supports F16 → f16.
+    CHECK(resolve_kv_attn_type(1, /*legacy=*/true,  true,  false, false) == kv_attn_dtype::f16);
+    CHECK(resolve_kv_attn_type(1, /*legacy=*/false, true,  false, false) == kv_attn_dtype::f16);
+
+    // Backend doesn't support F16 → graceful fallback to f32.
+    CHECK(resolve_kv_attn_type(1, /*legacy=*/true,  false, true,  true)  == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(1, /*legacy=*/false, false, true,  true)  == kv_attn_dtype::f32);
+}
+
+// Test 4 — bf16 forced + probe-gated graceful fallback.
+//
+// `--kv-attn-type 2` (bf16) is the new dispatch added in round 4.
+// Vulkan with coopmat2 supports BF16 K/V; Intel ARC (no coopmat2)
+// doesn't.  Graceful fallback to F32 on missing-probe so an
+// operator config that says `--kv-attn-type bf16` works on both
+// platforms (with the win on coopmat2 hardware, parity F32 on
+// the rest).
+void test_bf16_forced_probe_gated() {
+    // BF16 supported → bf16.
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/true,  true,  true,  false) == kv_attn_dtype::bf16);
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/false, false, true,  false) == kv_attn_dtype::bf16);
+
+    // BF16 not supported → graceful fallback to f32.  Even when
+    // F16 IS supported, we fall back to F32 (not F16) because the
+    // operator asked for BF16 specifically; silently downgrading
+    // to F16 would mask drift differences between BF16 and F16.
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/true,  true,  false, true)  == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/false, false, false, false) == kv_attn_dtype::f32);
+}
+
+// Test 5 — q8_0 forced + probe-gated graceful fallback.
+//
+// Same shape as the BF16 case; Q8_0 is the bandwidth-saving
+// option (half the K/V upload size).  Vulkan supports Q8_0 K/V
+// in both scalar and coopmat2 paths.  Forward-compat at this
+// round — the probe is in the cache (round 2) but the live
+// dispatch only wires when the operator opts in via
+// `--kv-attn-type q8_0`.
+void test_q8_0_forced_probe_gated() {
+    // Q8_0 supported → q8_0.
+    CHECK(resolve_kv_attn_type(3, /*legacy=*/true,  true,  true,  true)  == kv_attn_dtype::q8_0);
+    CHECK(resolve_kv_attn_type(3, /*legacy=*/false, false, false, true)  == kv_attn_dtype::q8_0);
+
+    // Q8_0 not supported → graceful fallback to f32.
+    CHECK(resolve_kv_attn_type(3, /*legacy=*/true,  true,  true,  false) == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(3, /*legacy=*/false, false, false, false) == kv_attn_dtype::f32);
+}
+
+// Test 6 — out-of-range request throws.
+//
+// Loud-failure for actual config errors (CLI typo).  Same pattern
+// as `resolve_vulkan_device_index`'s reserved-negative throw.
+void test_out_of_range_throws() {
+    CHECK(throws_runtime_error([] {
+        (void) resolve_kv_attn_type(4, true, true, true, true);
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_kv_attn_type(99, true, true, true, true);
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_kv_attn_type(-2, true, true, true, true);
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_kv_attn_type(-100, true, true, true, true);
+    }));
+}
+
+// Test 7 — resolver NEVER returns `autoselect`, AND every
+// happy-path branch maps to the EXACT expected concrete dtype.
+//
+// `kv_attn_dtype::autoselect` is the EngineOptions sentinel;
+// the resolver always returns a concrete dispatch dtype.  This
+// test pins the contract so a future refactor can't accidentally
+// leak the sentinel through to the dispatch site (which would
+// crash on the switch's default branch).
+//
+// PR #18 reviewer (Omar) follow-up: the original exhaustive
+// 5 × 2 × 8 sweep only asserted `dt != autoselect`, so a typo
+// in the resolver (e.g., returning `f16` when `bf16` was
+// requested + supported) would pass silently.  This test now
+// computes the expected concrete dtype as a pure function of
+// the inputs (mirror of the resolver's behaviour matrix) and
+// `CHECK`s the resolver's return value against that expected
+// dtype on every one of the 80 grid points — a typo in any
+// dispatch branch now fails LOUD with the exact mismatch.
+void test_resolver_returns_concrete_only() {
+    // Reference resolver — same behaviour matrix, separately
+    // implemented so a typo on one side doesn't cancel out
+    // a typo on the other.  Reads like the table in
+    // `supertonic_internal.h`'s docstring on
+    // `resolve_kv_attn_type`.
+    auto expected = [](int requested, bool legacy,
+                       bool sf16, bool sbf16, bool sq8) -> kv_attn_dtype {
+        switch (requested) {
+            case -1: return (legacy && sf16) ? kv_attn_dtype::f16 : kv_attn_dtype::f32;
+            case 0:  return kv_attn_dtype::f32;
+            case 1:  return sf16  ? kv_attn_dtype::f16  : kv_attn_dtype::f32;
+            case 2:  return sbf16 ? kv_attn_dtype::bf16 : kv_attn_dtype::f32;
+            case 3:  return sq8   ? kv_attn_dtype::q8_0 : kv_attn_dtype::f32;
+        }
+        // Unreachable for the request range we sweep below.
+        return kv_attn_dtype::autoselect;
+    };
+    for (int requested : { -1, 0, 1, 2, 3 }) {
+        for (int legacy_bit : { 0, 1 }) {
+            const bool legacy = legacy_bit != 0;
+            for (int probe_mask = 0; probe_mask < 8; ++probe_mask) {
+                const bool sf16  = (probe_mask & 1) != 0;
+                const bool sbf16 = (probe_mask & 2) != 0;
+                const bool sq8   = (probe_mask & 4) != 0;
+                const auto dt  = resolve_kv_attn_type(requested, legacy, sf16, sbf16, sq8);
+                const auto exp = expected(requested, legacy, sf16, sbf16, sq8);
+                CHECK(dt != kv_attn_dtype::autoselect);
+                CHECK(dt == exp);
+            }
+        }
+    }
+
+    // Belt-and-suspenders happy-path spot checks (Omar's
+    // example): the explicit-request paths get the dtype they
+    // asked for when the probe says yes, AND don't accidentally
+    // wander into a neighbouring enum value.
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/false, /*sf16=*/true,
+                               /*sbf16=*/true, /*sq8=*/true) == kv_attn_dtype::bf16);
+    CHECK(resolve_kv_attn_type(3, /*legacy=*/false, /*sf16=*/true,
+                               /*sbf16=*/true, /*sq8=*/true) == kv_attn_dtype::q8_0);
+    CHECK(resolve_kv_attn_type(1, /*legacy=*/false, /*sf16=*/true,
+                               /*sbf16=*/false, /*sq8=*/false) == kv_attn_dtype::f16);
+    // Cross-dtype non-contamination: requesting bf16 with f16 +
+    // q8_0 supported but bf16 NOT supported MUST fall to f32,
+    // not silently to f16 or q8_0.
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/true, /*sf16=*/true,
+                               /*sbf16=*/false, /*sq8=*/true) == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(3, /*legacy=*/true, /*sf16=*/true,
+                               /*sbf16=*/true, /*sq8=*/false) == kv_attn_dtype::f32);
+}
+
+// Test 8 — `out_was_downgraded` signal on explicit-request +
+// missing-probe paths.
+//
+// PR #18 reviewer (Omar) follow-up: the resolver silently
+// returns f32 when the operator explicitly requests f16/bf16/q8_0
+// and the corresponding backend probe is false.  The operator-
+// facing call sites need a programmatic signal so they can emit
+// a `fprintf(stderr, "warning: ...")` (auto + missing probe is
+// NOT a downgrade — the operator didn't ask for a specific
+// dtype).  This test pins:
+//   - Auto + missing probe → flag stays false.
+//   - Auto + matching probe → flag stays false.
+//   - f32 explicit → flag stays false (no concept of "downgrade
+//     from f32").
+//   - f16 / bf16 / q8_0 explicit + matching probe → flag stays
+//     false (operator got what they asked for).
+//   - f16 / bf16 / q8_0 explicit + missing probe → flag set.
+//   - Optional out-pointer: nullptr (default) MUST be safe.
+void test_downgrade_flag_signal() {
+    bool downgraded = true;  // pre-set to true to detect "no write"
+
+    // Auto + nothing supported.  Not a downgrade — auto policy.
+    (void) resolve_kv_attn_type(-1, /*legacy=*/true,
+                                false, false, false, &downgraded);
+    CHECK(downgraded == false);
+
+    // f32 explicit.  Never a downgrade.
+    downgraded = true;
+    (void) resolve_kv_attn_type(0, /*legacy=*/false,
+                                true, true, true, &downgraded);
+    CHECK(downgraded == false);
+
+    // f16 explicit + supported.  Not a downgrade.
+    downgraded = true;
+    (void) resolve_kv_attn_type(1, /*legacy=*/false,
+                                /*sf16=*/true, false, false, &downgraded);
+    CHECK(downgraded == false);
+
+    // bf16 explicit + supported.  Not a downgrade.
+    downgraded = true;
+    (void) resolve_kv_attn_type(2, /*legacy=*/false,
+                                false, /*sbf16=*/true, false, &downgraded);
+    CHECK(downgraded == false);
+
+    // q8_0 explicit + supported.  Not a downgrade.
+    downgraded = true;
+    (void) resolve_kv_attn_type(3, /*legacy=*/false,
+                                false, false, /*sq8=*/true, &downgraded);
+    CHECK(downgraded == false);
+
+    // f16 explicit + NOT supported.  Downgrade signal.
+    downgraded = false;
+    CHECK(resolve_kv_attn_type(1, /*legacy=*/false,
+                               /*sf16=*/false, true, true, &downgraded)
+          == kv_attn_dtype::f32);
+    CHECK(downgraded == true);
+
+    // bf16 explicit + NOT supported.  Downgrade signal.
+    downgraded = false;
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/false,
+                               true, /*sbf16=*/false, true, &downgraded)
+          == kv_attn_dtype::f32);
+    CHECK(downgraded == true);
+
+    // q8_0 explicit + NOT supported.  Downgrade signal.
+    downgraded = false;
+    CHECK(resolve_kv_attn_type(3, /*legacy=*/false,
+                               true, true, /*sq8=*/false, &downgraded)
+          == kv_attn_dtype::f32);
+    CHECK(downgraded == true);
+
+    // Nullptr default argument must not crash on the same paths.
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/false, true, false, true)
+          == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(3, /*legacy=*/false, true, true, false)
+          == kv_attn_dtype::f32);
+    CHECK(resolve_kv_attn_type(2, /*legacy=*/false, true, true, false)
+          == kv_attn_dtype::bf16);
+}
+
+} // namespace
+
+int main() {
+    test_auto_falls_back_to_legacy_boolean();
+    test_f32_forced_overrides_legacy();
+    test_f16_forced_probe_gated();
+    test_bf16_forced_probe_gated();
+    test_q8_0_forced_probe_gated();
+    test_out_of_range_throws();
+    test_resolver_returns_concrete_only();
+    test_downgrade_flag_signal();
+
+    std::fprintf(stderr,
+                 "test_supertonic_kv_attn_type: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_kv_attn_type_api.cpp b/tts-cpp/test/test_supertonic_kv_attn_type_api.cpp
new file mode 100644
index 00000000000..2dc9e7c12f0
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_kv_attn_type_api.cpp
@@ -0,0 +1,157 @@
+// QVAC-18605 round 4 — CPU-only TDD test for the multi-dtype
+// K/V flash-attention API surface.
+//
+// Pins:
+//   1. `EngineOptions::kv_attn_type` int field exists, defaults to -1
+//      (auto), and accepts assignment to the documented values
+//      0..3 (f32, f16, bf16, q8_0).
+//   2. `supertonic_model::kv_attn_type` (`detail::kv_attn_dtype`)
+//      field exists, defaults to `kv_attn_dtype::f32` (no
+//      surprise dispatch on a default-constructed model).
+//   3. `supertonic_kv_attn_type()` thread-local accessor exists
+//      and returns the currently-active dispatch dtype.  Default
+//      (no scope active) is `kv_attn_dtype::f32`.
+//   4. `supertonic_op_dispatch_scope::prev_kv_attn_type` field
+//      exists so the RAII teardown restores the right value.
+//   5. The round-3 baseline EngineOptions defaults
+//      (prewarm_text empty, vulkan_device 0, f16_attn -1,
+//      f16_weights -1, f16_weights_deny_list empty) are unchanged
+//      — regression guard against accidental ABI churn.
+//
+// Whole TU MUST fail to compile before the symbols are added,
+// then pass after.
+
+#include "tts-cpp/supertonic/engine.h"
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <type_traits>
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// SFINAE: assert the EngineOptions field exists.
+template <typename T>
+auto has_kv_attn_type_field(int) -> decltype(
+    std::declval<T &>().kv_attn_type, std::true_type{});
+template <typename T>
+auto has_kv_attn_type_field(...) -> std::false_type;
+
+// SFINAE: assert the dispatch-scope field exists.
+template <typename T>
+auto has_prev_kv_attn_type(int) -> decltype(
+    std::declval<T &>().prev_kv_attn_type, std::true_type{});
+template <typename T>
+auto has_prev_kv_attn_type(...) -> std::false_type;
+
+// SFINAE: assert the model field exists.
+template <typename T>
+auto has_model_kv_attn_type(int) -> decltype(
+    std::declval<T &>().kv_attn_type, std::true_type{});
+template <typename T>
+auto has_model_kv_attn_type(...) -> std::false_type;
+
+void test_engine_options_field_exists() {
+    using namespace tts_cpp::supertonic;
+    static_assert(
+        decltype(has_kv_attn_type_field<EngineOptions>(0))::value,
+        "EngineOptions must declare kv_attn_type (int, default -1 = auto)");
+
+    EngineOptions opts;
+    // Default = -1 (auto) — matches the f16_attn / f16_weights /
+    // vulkan_device convention.
+    CHECK(opts.kv_attn_type == -1);
+
+    // Field accepts the documented values.
+    opts.kv_attn_type = 0; CHECK(opts.kv_attn_type == 0);
+    opts.kv_attn_type = 1; CHECK(opts.kv_attn_type == 1);
+    opts.kv_attn_type = 2; CHECK(opts.kv_attn_type == 2);
+    opts.kv_attn_type = 3; CHECK(opts.kv_attn_type == 3);
+    opts.kv_attn_type = -1; CHECK(opts.kv_attn_type == -1);
+
+    // Round-3 + earlier defaults — regression guard.
+    EngineOptions baseline;
+    CHECK(baseline.kv_attn_type == -1);
+    CHECK(baseline.prewarm_text.empty());
+    CHECK(baseline.vulkan_device == 0);
+    CHECK(baseline.f16_attn == -1);
+    CHECK(baseline.f16_weights == -1);
+    CHECK(baseline.f16_weights_deny_list.empty());
+}
+
+void test_supertonic_model_field_exists() {
+    using namespace tts_cpp::supertonic::detail;
+    static_assert(
+        decltype(has_model_kv_attn_type<supertonic_model>(0))::value,
+        "supertonic_model must declare kv_attn_type (kv_attn_dtype)");
+
+    supertonic_model model;
+    // Default = f32 — a default-constructed model must NOT
+    // accidentally dispatch the F16 path before
+    // `load_supertonic_gguf` resolves the policy.
+    CHECK(model.kv_attn_type == kv_attn_dtype::f32);
+}
+
+void test_dispatch_scope_field_exists() {
+    using namespace tts_cpp::supertonic::detail;
+    static_assert(
+        decltype(has_prev_kv_attn_type<supertonic_op_dispatch_scope>(0))::value,
+        "supertonic_op_dispatch_scope must declare prev_kv_attn_type "
+        "for RAII teardown of the thread-local kv_attn_type flag");
+    // Static assert IS the gate.  Bump check count for the
+    // pass/fail summary.
+    ++g_checks;
+}
+
+void test_thread_local_accessor_default() {
+    using namespace tts_cpp::supertonic::detail;
+    // No scope active → default dtype must be f32 (matches the
+    // model default; ensures graph builders called outside a
+    // scope don't accidentally take the F16 path).
+    CHECK(supertonic_kv_attn_type() == kv_attn_dtype::f32);
+}
+
+void test_dispatch_scope_restores_on_teardown() {
+    using namespace tts_cpp::supertonic::detail;
+    // Baseline.
+    CHECK(supertonic_kv_attn_type() == kv_attn_dtype::f32);
+
+    // A scope built from a model with a non-default dtype must
+    // flip the thread-local; teardown must restore it.
+    {
+        supertonic_model m;
+        m.kv_attn_type = kv_attn_dtype::bf16;
+        // Other fields stay at their defaults; constructor must
+        // not require backend / tensors / hparams.
+        supertonic_op_dispatch_scope scope(m);
+        CHECK(supertonic_kv_attn_type() == kv_attn_dtype::bf16);
+    }
+    // RAII restored.
+    CHECK(supertonic_kv_attn_type() == kv_attn_dtype::f32);
+}
+
+} // namespace
+
+int main() {
+    test_engine_options_field_exists();
+    test_supertonic_model_field_exists();
+    test_dispatch_scope_field_exists();
+    test_thread_local_accessor_default();
+    test_dispatch_scope_restores_on_teardown();
+
+    std::fprintf(stderr,
+                 "test_supertonic_kv_attn_type_api: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_load_caches.cpp b/tts-cpp/test/test_supertonic_load_caches.cpp
new file mode 100644
index 00000000000..1e57f6730b9
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_load_caches.cpp
@@ -0,0 +1,317 @@
+// TDD harness for the host-side + GPU-side caches added in the
+// QVAC-18607 audit follow-up (audit findings F1, F2, F6, F9).
+//
+// Validates the *structural* properties of each cache so a regression
+// in the load-time precompute or the lazy cache populator is caught
+// before the end-to-end pipeline parity test runs.  Each test
+// references the precise behaviour the audit findings spell out:
+//
+//   F1  model.vector_rope_theta is populated at load time and matches
+//       what `read_f32(...3.attn.theta)` would have returned.
+//
+//   F2  model.vocoder.bn_scale_pre / bn_shift_pre are populated at
+//       load time and match host-side recomputation of the formula
+//       (gamma / sqrt(var + eps)), (beta - mean * scale).
+//
+//   F6  The hot t_proj weights are pre-transposed into companion
+//       source-tensor entries with the `__T` suffix.  The
+//       transposed contents match a host-side transpose of the
+//       original.  Documents the exact pre-transpose roster so a
+//       future audit can spot drift.
+//
+//   F9  cached_time_embedding(model, current, total) returns the same
+//       vector that `time_embedding(model, current, total)` would
+//       have computed on the first call, and the cache map is
+//       populated after the call (no recomputation on the second
+//       call with the same key).
+//
+// Fixture test — requires the Supertonic GGUF + REQUIRES gating in
+// CMakeLists.txt auto-disables it if the model isn't present.
+
+#include "supertonic_internal.h"
+
+#include <algorithm>
+#include <array>
+#include <cmath>
+#include <cstdio>
+#include <cstring>
+#include <string>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+bool close_enough(float a, float b, float atol = 1e-6f, float rtol = 1e-5f) {
+    return std::fabs(a - b) <= atol + rtol * std::fabs(b);
+}
+
+// Helper: download every element of `tensor` into a host F32 vector.
+// Reused across F1/F2/F6 checks because every source tensor we want
+// to verify lives in the backend buffer that `read_f32` reaches.
+std::vector<float> dump_f32(ggml_tensor * tensor) {
+    std::vector<float> out((size_t) ggml_nelements(tensor));
+    ggml_backend_tensor_get(tensor, out.data(), 0, ggml_nbytes(tensor));
+    return out;
+}
+
+ggml_tensor * find_source(const supertonic_model & model, const std::string & key) {
+    auto it = model.source_tensors.find(key);
+    return it == model.source_tensors.end() ? nullptr : it->second;
+}
+
+// F1 — RoPE θ host-side cache.  The audit finding identifies the
+// shared theta tensor at `main_blocks.3.attn.theta` as the source.
+// All four group attention sites in the vector estimator's GGML
+// production path read from the same tensor; caching it once at
+// load avoids 4×N_STEPS GPU→host downloads per synth (20 sync points
+// on the default 5-step schedule).
+void test_f1_rope_theta_cache(const supertonic_model & model) {
+    std::fprintf(stderr, "[F1 rope-theta cache]\n");
+
+    // Contract: cache is populated after load and has the same size
+    // as the source tensor.
+    CHECK(!model.vector_rope_theta.empty());
+
+    ggml_tensor * src = find_source(model, "vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta");
+    if (!src) {
+        std::fprintf(stderr, "  SKIP: theta source tensor missing in this GGUF\n");
+        return;
+    }
+    CHECK(model.vector_rope_theta.size() == (size_t) ggml_nelements(src));
+
+    // Contract: cached bytes match the source.
+    auto direct = dump_f32(src);
+    CHECK(direct.size() == model.vector_rope_theta.size());
+
+    int bad = 0;
+    for (size_t i = 0; i < direct.size() && i < model.vector_rope_theta.size(); ++i) {
+        if (model.vector_rope_theta[i] != direct[i]) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  mismatch @ %zu: cached=%f direct=%f\n",
+                             i, model.vector_rope_theta[i], direct[i]);
+            }
+            ++bad;
+        }
+    }
+    CHECK(bad == 0);
+    std::fprintf(stderr, "  size=%zu, bad=%d / %zu\n",
+                 model.vector_rope_theta.size(), bad, direct.size());
+}
+
+// F2 — Vocoder BN scale/shift pre-baked at load time.  The audit
+// finding identifies `bn_scale = gamma / sqrt(var + 1e-5)` and
+// `bn_shift = beta - mean * bn_scale` as constants that were being
+// recomputed every synth on the CPU.  Pre-baking saves the four
+// per-synth `read_f32_tensor` downloads + the two `ggml_backend_tensor_set`
+// uploads of the resulting scale/shift vectors.
+void test_f2_vocoder_bn_prebake(const supertonic_model & model) {
+    std::fprintf(stderr, "[F2 vocoder BN pre-bake]\n");
+
+    const auto & v = model.vocoder;
+
+    // Contract: precomputed scale/shift tensors exist post-load.
+    CHECK(v.bn_scale_pre != nullptr);
+    CHECK(v.bn_shift_pre != nullptr);
+    if (!v.bn_scale_pre || !v.bn_shift_pre) return;
+    CHECK(ggml_nelements(v.bn_scale_pre) == 512);
+    CHECK(ggml_nelements(v.bn_shift_pre) == 512);
+
+    auto cached_scale = dump_f32(v.bn_scale_pre);
+    auto cached_shift = dump_f32(v.bn_shift_pre);
+    auto gamma = dump_f32(v.final_norm_g);
+    auto beta  = dump_f32(v.final_norm_b);
+    auto mean  = dump_f32(v.final_norm_running_mean);
+    auto var   = dump_f32(v.final_norm_running_var);
+
+    // Contract: cached bytes match the canonical host-side formula.
+    int bad_scale = 0, bad_shift = 0;
+    float max_abs_err_scale = 0.0f, max_abs_err_shift = 0.0f;
+    for (int c = 0; c < 512; ++c) {
+        const float expected_scale = gamma[c] / std::sqrt(var[c] + 1e-5f);
+        const float expected_shift = beta[c]  - mean[c] * expected_scale;
+        const float abs_scale = std::fabs(cached_scale[c] - expected_scale);
+        const float abs_shift = std::fabs(cached_shift[c] - expected_shift);
+        max_abs_err_scale = std::max(max_abs_err_scale, abs_scale);
+        max_abs_err_shift = std::max(max_abs_err_shift, abs_shift);
+        if (!close_enough(cached_scale[c], expected_scale)) ++bad_scale;
+        if (!close_enough(cached_shift[c], expected_shift)) ++bad_shift;
+    }
+    CHECK(bad_scale == 0);
+    CHECK(bad_shift == 0);
+    std::fprintf(stderr,
+                 "  scale max_abs_err=%.3e bad=%d / 512\n"
+                 "  shift max_abs_err=%.3e bad=%d / 512\n",
+                 max_abs_err_scale, bad_scale,
+                 max_abs_err_shift, bad_shift);
+}
+
+// F6 — Load-time pre-transpose for hot `t_proj` matmul weights.
+// The audit roster: every `vector_field.main_blocks.{1,7,13,19}.linear.linear.weight`
+// (i.e. the four group `t_proj` weights) + the front block's
+// `vector_field.main_blocks.1.linear.linear.weight` equivalent.
+// Pre-transposing eliminates the `ggml_cont(ggml_transpose(W))`
+// inside every cached group graph; the pre-transposed companion is
+// stored alongside the original in `model.source_tensors` under
+// the same name with a `__T` suffix.
+void test_f6_pretranspose_roster(const supertonic_model & model) {
+    std::fprintf(stderr, "[F6 pre-transposed weights]\n");
+
+    // The exact roster — this list documents the audit finding so a
+    // future drift in the pre-transpose set is immediately visible.
+    // Updates here require updating the call-site rewrite in
+    // build_group_graph_cache / supertonic_vector_trace_proj_ggml.
+    static const char * const kRoster[] = {
+        "vector_estimator:onnx::MatMul_3095",
+        "vector_estimator:onnx::MatMul_3140",
+        "vector_estimator:onnx::MatMul_3185",
+        "vector_estimator:onnx::MatMul_3230",
+    };
+
+    int present = 0;
+    int missing = 0;
+    for (const char * name : kRoster) {
+        ggml_tensor * orig = find_source(model, name);
+        const std::string t_name = std::string(name) + "__T";
+        ggml_tensor * t = find_source(model, t_name);
+        if (!orig) {
+            // Some GGUFs may not carry the front-block weight; skip
+            // gracefully rather than failing the whole test.
+            std::fprintf(stderr,
+                         "  SKIP %s (original not in this GGUF)\n", name);
+            continue;
+        }
+        CHECK(t != nullptr);
+        if (!t) { ++missing; continue; }
+        ++present;
+
+        // Contract: __T tensor has the original's shape with the
+        // first two axes swapped (ggml's [W, H] <-> [H, W]).
+        CHECK(t->ne[0] == orig->ne[1]);
+        CHECK(t->ne[1] == orig->ne[0]);
+        CHECK(t->ne[2] == orig->ne[2]);
+        CHECK(t->ne[3] == orig->ne[3]);
+
+        // Contract: contents match host-side transpose.
+        auto orig_data = dump_f32(orig);
+        auto t_data    = dump_f32(t);
+        const int W = (int) orig->ne[0];
+        const int H = (int) orig->ne[1];
+        int bad = 0;
+        for (int j = 0; j < H; ++j) {
+            for (int i = 0; i < W; ++i) {
+                const float a = orig_data[(size_t) j * W + i];
+                const float b = t_data[(size_t) i * H + j];
+                if (a != b) {
+                    if (bad < 2) {
+                        std::fprintf(stderr,
+                                     "  %s mismatch @ (j=%d, i=%d): orig=%g t=%g\n",
+                                     name, j, i, a, b);
+                    }
+                    ++bad;
+                }
+            }
+        }
+        CHECK(bad == 0);
+    }
+    std::fprintf(stderr,
+                 "  pre-transposed roster: present=%d missing=%d\n",
+                 present, missing);
+}
+
+// F9 — time_embedding cache.  The audit finding identifies
+// `time_embedding(model, current_step, total_steps)` as a pure
+// function whose output is reused across every vector denoising
+// step.  Caching keyed by (current, total) drops 5 redundant
+// per-synth recomputations on the default schedule.
+//
+// Contract checked here:
+//   - First call populates the cache.
+//   - Second call with the same key returns the same vector
+//     bit-exactly (i.e. did not recompute).
+//   - Different keys produce different cache entries.
+//
+// Doesn't gate on cache-hit count because the cache lives behind a
+// helper inside `supertonic_vector_estimator.cpp` — we can only
+// inspect the map size.
+void test_f9_time_emb_cache(const supertonic_model & model) {
+    std::fprintf(stderr, "[F9 time-embedding cache]\n");
+
+    const size_t initial_size = model.time_emb_cache.size();
+    std::array<float, 64> v0 = cached_time_embedding(model, 0, 5);
+    const size_t after_one = model.time_emb_cache.size();
+    CHECK(after_one == initial_size + 1);
+
+    // Repeated call must return bit-exact same vector.
+    std::array<float, 64> v0_repeat = cached_time_embedding(model, 0, 5);
+    CHECK(model.time_emb_cache.size() == after_one); // no new entry
+    int bad = 0;
+    for (int i = 0; i < 64; ++i) {
+        if (v0[i] != v0_repeat[i]) ++bad;
+    }
+    CHECK(bad == 0);
+
+    // Different key → new cache entry, and that entry should be a
+    // distinct vector from `v0` (different position-of-step input
+    // produces different sinusoidal embedding through the MLP).
+    std::array<float, 64> v1 = cached_time_embedding(model, 1, 5);
+    CHECK(model.time_emb_cache.size() == after_one + 1);
+    bool v1_differs = false;
+    for (int i = 0; i < 64; ++i) {
+        if (v0[i] != v1[i]) { v1_differs = true; break; }
+    }
+    CHECK(v1_differs);
+
+    // Contract: cached value matches what the underlying scalar
+    // `time_embedding` would have produced.  Reread the cached
+    // vector and recompute via the slow path; compare bit-exact.
+    std::array<float, 64> v0_again = cached_time_embedding(model, 0, 5);
+    int bad2 = 0;
+    for (int i = 0; i < 64; ++i) {
+        if (v0_again[i] != v0[i]) ++bad2;
+    }
+    CHECK(bad2 == 0);
+
+    std::fprintf(stderr,
+                 "  initial=%zu, after-1=%zu, bad-repeat=%d, bad-readback=%d\n",
+                 initial_size, after_one, bad, bad2);
+}
+
+} // namespace
+
+int main(int argc, char ** argv) {
+    if (argc < 2) {
+        std::fprintf(stderr, "usage: %s MODEL.gguf\n", argv[0]);
+        return 2;
+    }
+    supertonic_model model;
+    if (!load_supertonic_gguf(argv[1], model)) {
+        std::fprintf(stderr, "failed to load model: %s\n", argv[1]);
+        return 1;
+    }
+
+    test_f1_rope_theta_cache(model);
+    test_f2_vocoder_bn_prebake(model);
+    test_f6_pretranspose_roster(model);
+    test_f9_time_emb_cache(model);
+
+    free_supertonic_model(model);
+
+    std::fprintf(stderr,
+                 "test_supertonic_load_caches: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_pinned_host_buffer.cpp b/tts-cpp/test/test_supertonic_pinned_host_buffer.cpp
new file mode 100644
index 00000000000..b76a117b43c
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_pinned_host_buffer.cpp
@@ -0,0 +1,236 @@
+// QVAC-18605 round 12 #5 — CPU-only TDD test for the pinned-host-
+// buffer input-allocation helper.
+//
+// Background
+// ----------
+// Round 3 shipped the capability probe
+// `supertonic_backend_supports_pinned_host_buffer`, which returns
+// `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the
+// resolved backend.  The probe primed the cache + bench surface
+// but the actual per-engine input-scratchpad refactor that would
+// USE the host-pinned buffer to skip ggml-vulkan's internal
+// staging-buffer hop was deferred.
+//
+// Round 12 #5 lands that refactor as a thin helper:
+//
+//   ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
+//       const supertonic_model & model,
+//       ggml_context * input_ctx);
+//
+// Callers create a small `ggml_context` containing ONLY the hot
+// per-step input tensors (e.g. front-block `x_in` / `mask_in` /
+// `t_emb_in`), then call the helper.  The helper:
+//
+//   - Returns `nullptr` if the backend doesn't expose
+//     `ggml_backend_vk_host_buffer_type()` (CPU, Metal, OpenCL,
+//     and any future backend that lacks the API).  Caller falls
+//     back to letting `ggml_gallocr_alloc_graph` handle the
+//     input tensors via the default buffer type — same memory
+//     layout, just one staging-buffer hop per upload.
+//
+//   - Allocates a buffer from `ggml_backend_vk_host_buffer_type()`
+//     and binds every tensor in `input_ctx` to it on success.
+//     `ggml_backend_tensor_set` writes from the host buffer
+//     directly into the BAR-mapped GPU memory without an
+//     intermediate staging-buffer copy.
+//
+// Per synth wins (RTX 5090, 5-step CFM):
+//   - 4 attention-feeding caches × per-step inputs:
+//       front_block: x_in (~80 KB), mask_in (~80 B), t_emb_in (~256 B)
+//       g1 / g2 / g3 group:  x_in, temb_in
+//   - 5 denoise steps × ~3 small uploads = ~15 staging-hops saved
+//     per synth.  Each hop is ~5-15 us on the test rig; net
+//     ~75-225 us / synth.
+//
+// What this test pins (CPU-only)
+// ------------------------------
+// 1. The helper symbol exists with the documented signature
+//    (compile-time SFINAE).
+//
+// 2. On a CPU backend (no Vulkan host-buffer API), the helper
+//    returns `nullptr` — and does so WITHOUT crashing when
+//    handed a context with no tensors, or a context with a
+//    couple of synthetic input tensors.
+//
+// 3. Repeated calls on the same input context against a CPU
+//    backend are idempotent (no leak on null return; no
+//    double-free on the second call).
+//
+// What is NOT testable in this CPU-only unit test:
+//   - The actual host-buffer allocation behaviour (requires a
+//     real Vulkan adapter).  Validated end-to-end by the
+//     model-fixture synth runs + the per-step bench.
+//   - The wiring at the production cache sites (validated by
+//     `ctest -L unit` running every other test green + the
+//     end-to-end Vulkan synth).
+//
+// Registered with `LABEL "unit"` — no GGUF required.
+
+#include "ggml.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <type_traits>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// SFINAE — the helper symbol exists with the expected signature.
+// Compile-fails before implementation lands; compile-passes after.
+template <typename = void>
+auto has_try_alloc_helper(int)
+    -> decltype(try_alloc_inputs_in_pinned_host_buffer(
+        std::declval<const supertonic_model &>(),
+        std::declval<ggml_context *>()),
+        std::true_type{});
+template <typename = void>
+auto has_try_alloc_helper(...) -> std::false_type;
+
+void test_helper_symbol_exists() {
+    std::fprintf(stderr, "[Round 12 #5: try_alloc_inputs_in_pinned_host_buffer symbol]\n");
+    static_assert(
+        decltype(has_try_alloc_helper<>(0))::value,
+        "try_alloc_inputs_in_pinned_host_buffer must exist with the documented signature");
+    ++g_checks;
+}
+
+// Build a minimal supertonic_model carrying only the backend
+// pointer the helper needs.  Synth code paths aren't exercised
+// here — the helper just queries `model.backend` for the host-
+// buffer-type capability.
+supertonic_model make_cpu_model() {
+    supertonic_model m;
+    m.backend = ggml_backend_cpu_init();
+    return m;
+}
+
+void free_cpu_model(supertonic_model & m) {
+    if (m.backend) ggml_backend_free(m.backend);
+    m = {};
+}
+
+// Round-12 #5 contract on CPU backend: helper returns nullptr
+// (no Vulkan host-buffer API available).  Caller proceeds with
+// the default gallocr path.
+void test_cpu_backend_returns_nullptr() {
+    std::fprintf(stderr, "[Round 12 #5: CPU backend → nullptr]\n");
+    supertonic_model model = make_cpu_model();
+    CHECK(model.backend != nullptr);
+
+    // Empty input ctx — should still return nullptr without
+    // crashing.
+    {
+        const size_t buf_size = ggml_tensor_overhead() * 16;
+        std::vector<uint8_t> buf(buf_size);
+        ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+        ggml_context * ctx = ggml_init(p);
+        CHECK(ctx != nullptr);
+        ggml_backend_buffer_t res = try_alloc_inputs_in_pinned_host_buffer(model, ctx);
+        CHECK(res == nullptr);
+        ggml_free(ctx);
+    }
+
+    // Input ctx with a handful of small synthetic input tensors.
+    // The helper must still return nullptr cleanly when the
+    // backend doesn't expose the host-buffer type.
+    {
+        const size_t buf_size = ggml_tensor_overhead() * 32;
+        std::vector<uint8_t> buf(buf_size);
+        ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+        ggml_context * ctx = ggml_init(p);
+        (void) ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 64, 20);   // ~x_in
+        (void) ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 20);       // ~mask_in
+        (void) ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 64);       // ~t_emb_in
+        ggml_backend_buffer_t res = try_alloc_inputs_in_pinned_host_buffer(model, ctx);
+        CHECK(res == nullptr);
+        ggml_free(ctx);
+    }
+
+    free_cpu_model(model);
+}
+
+// Round-12 #5: idempotency.  Calling the helper twice on the same
+// (model, ctx) pair against a backend that returns nullptr each
+// time must be safe (no internal state leakage, no double-free
+// path triggered).  Catches a regression where the helper
+// accidentally caches the buffer in `model` or `ctx` extras and
+// double-frees on the second call.
+void test_idempotent_on_cpu_backend() {
+    std::fprintf(stderr, "[Round 12 #5: idempotent on CPU backend]\n");
+    supertonic_model model = make_cpu_model();
+    const size_t buf_size = ggml_tensor_overhead() * 32;
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+    (void) ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 64, 20);
+
+    ggml_backend_buffer_t res1 = try_alloc_inputs_in_pinned_host_buffer(model, ctx);
+    ggml_backend_buffer_t res2 = try_alloc_inputs_in_pinned_host_buffer(model, ctx);
+    CHECK(res1 == nullptr);
+    CHECK(res2 == nullptr);
+    CHECK(res1 == res2);
+
+    ggml_free(ctx);
+    free_cpu_model(model);
+}
+
+// Round-12 #5: null-backend safety.  If the caller hands the
+// helper a `supertonic_model` whose `.backend` is null (e.g., a
+// half-constructed model in an error path), the helper must
+// return nullptr instead of dereferencing.  Conservative
+// failure mode beats SIGSEGV in error-handler code paths.
+void test_null_backend_returns_nullptr() {
+    std::fprintf(stderr, "[Round 12 #5: null backend → nullptr]\n");
+    supertonic_model model;  // .backend = nullptr by default
+    CHECK(model.backend == nullptr);
+    const size_t buf_size = ggml_tensor_overhead() * 16;
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_backend_buffer_t res = try_alloc_inputs_in_pinned_host_buffer(model, ctx);
+    CHECK(res == nullptr);
+    ggml_free(ctx);
+}
+
+// Round-12 #5: null-ctx safety.  Same conservative contract as
+// the null-backend test — pass a real backend with a null
+// ctx and verify the helper returns nullptr without crashing.
+void test_null_ctx_returns_nullptr() {
+    std::fprintf(stderr, "[Round 12 #5: null ctx → nullptr]\n");
+    supertonic_model model = make_cpu_model();
+    ggml_backend_buffer_t res = try_alloc_inputs_in_pinned_host_buffer(model, nullptr);
+    CHECK(res == nullptr);
+    free_cpu_model(model);
+}
+
+} // namespace
+
+int main() {
+    test_helper_symbol_exists();
+    test_cpu_backend_returns_nullptr();
+    test_idempotent_on_cpu_backend();
+    test_null_backend_returns_nullptr();
+    test_null_ctx_returns_nullptr();
+
+    std::fprintf(stderr,
+                 "test_supertonic_pinned_host_buffer: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_pipeline.cpp b/tts-cpp/test/test_supertonic_pipeline.cpp
index 75029883ff7..9583c2454cf 100644
--- a/tts-cpp/test/test_supertonic_pipeline.cpp
+++ b/tts-cpp/test/test_supertonic_pipeline.cpp
@@ -48,20 +48,41 @@ int main(int argc, char ** argv) {
 
         const int n_steps = 5; // matches reference dump
         const int channels = model.hparams.latent_channels;
+        // Mirror dump-supertonic-reference.py: `xt = noise * latent_mask`
+        // (pre-mask the noisy latent before the vector loop) and
+        // `vocoder({"latent": xt * latent_mask})` (post-mask before
+        // vocoder).  The Python harness feeds the ONNX model an already-
+        // masked input, so without these multiplications the C++ test
+        // and the reference dump diverge at every padded tail position.
+        const float * latent_mask_data = npy_as_f32(latent_mask);
         std::vector<float> latent(noise.n_elements());
-        std::memcpy(latent.data(), npy_as_f32(noise), latent.size() * sizeof(float));
+        const float * noise_data = npy_as_f32(noise);
+        for (int c = 0; c < channels; ++c) {
+            for (int t = 0; t < latent_len; ++t) {
+                latent[(size_t) c * latent_len + t] =
+                    noise_data[(size_t) c * latent_len + t] * latent_mask_data[t];
+            }
+        }
 
         std::vector<float> next;
         for (int step = 0; step < n_steps; ++step) {
             if (!supertonic_vector_step_ggml(model, latent.data(), latent_len,
                                              text_emb.data(), text_len,
-                                             npy_as_f32(style_ttl), npy_as_f32(latent_mask),
+                                             npy_as_f32(style_ttl), latent_mask_data,
                                              step, n_steps, next, &error)) {
                 throw std::runtime_error("vector step " + std::to_string(step) + " failed: " + error);
             }
             latent.swap(next);
         }
 
+        // Post-mask the final latent — the Python harness runs the
+        // vocoder on `xt * latent_mask`, not raw `xt`.
+        for (int c = 0; c < channels; ++c) {
+            for (int t = 0; t < latent_len; ++t) {
+                latent[(size_t) c * latent_len + t] *= latent_mask_data[t];
+            }
+        }
+
         std::vector<float> wav;
         if (!supertonic_vocoder_forward_ggml(model, latent.data(), latent_len, wav, &error)) {
             throw std::runtime_error("vocoder failed: " + error);
diff --git a/tts-cpp/test/test_supertonic_portable_ops.cpp b/tts-cpp/test/test_supertonic_portable_ops.cpp
new file mode 100644
index 00000000000..e2ed604382f
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_portable_ops.cpp
@@ -0,0 +1,268 @@
+// CPU-backend parity tests for the portable op rewrites landed in the
+// Supertonic OpenCL bring-up.  Each test builds two GGML graphs with
+// the same input data on the CPU backend:
+//
+//   - Reference graph: the original op (e.g. `ggml_leaky_relu`).
+//   - Portable graph : the GPU-friendly rewrite that
+//     `supertonic_internal.h` exposes (e.g.
+//     `leaky_relu_portable_ggml` with `supertonic_use_cpu_custom_ops()`
+//     forced to `false` via the dispatch scope).
+//
+// Then it asserts the outputs match within F32 tolerance.  Math
+// equivalence is the contract; running both lowerings on the CPU
+// backend lets us validate that contract without needing an
+// OpenCL device on CI.
+//
+// Registered with `LABEL "unit"` in CMakeLists.txt so a fresh
+// checkout's `ctest` exercises this without needing any fixture.
+
+#include "supertonic_internal.h"
+
+#include "ggml.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include <cmath>
+#include <cstdio>
+#include <random>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Pick a relative-+-absolute tolerance that covers F32 rounding for the
+// portable decomposition.  The rewrite computes
+// `(1-α)·relu(x) + α·x` as three separate rounding steps where the
+// original `ggml_leaky_relu` is one branch + one multiply, so we
+// expect ~3 ULPs of slack on the largest |x|.  Keeping the same
+// shape as `close_enough()` in `test_metal_ops.cpp` for consistency.
+bool close_enough(float a, float b, float atol = 1e-6f, float rtol = 1e-5f) {
+    if (std::isnan(a) || std::isnan(b)) return std::isnan(a) && std::isnan(b);
+    return std::fabs(a - b) <= atol + rtol * std::fabs(b);
+}
+
+// Build a 2-D F32 input tensor [W, H], allocate it on `backend`, run
+// the graph constructed by `build_op`, return the contents of its
+// last output tensor.  The `build_op` callback receives the graph
+// context + the input tensor and returns the output tensor it wants
+// observed.
+std::vector<float> run_one_op(
+    ggml_backend_t backend,
+    const std::vector<float> & input,
+    int W, int H,
+    ggml_tensor * (*build_op)(ggml_context *, ggml_tensor *, float),
+    float alpha) {
+
+    constexpr int MAX_NODES = 64;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
+                            ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, W, H);
+    ggml_set_name(x, "x"); ggml_set_input(x);
+
+    ggml_tensor * y = build_op(ctx, x, alpha);
+    ggml_set_name(y, "y"); ggml_set_output(y);
+    ggml_build_forward_expand(gf, y);
+
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
+    ggml_gallocr_reserve(allocr, gf);
+    ggml_gallocr_alloc_graph(allocr, gf);
+
+    ggml_backend_tensor_set(ggml_graph_get_tensor(gf, "x"),
+                            input.data(), 0, input.size() * sizeof(float));
+    ggml_backend_graph_compute(backend, gf);
+
+    std::vector<float> out((size_t) ggml_nelements(y));
+    ggml_backend_tensor_get(ggml_graph_get_tensor(gf, "y"),
+                            out.data(), 0, out.size() * sizeof(float));
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    return out;
+}
+
+ggml_tensor * build_reference(ggml_context * ctx, ggml_tensor * x, float alpha) {
+    // Direct fused builtin — the lowering used on the CPU backend.
+    return ggml_leaky_relu(ctx, x, alpha, /*inplace=*/false);
+}
+
+ggml_tensor * build_portable(ggml_context * ctx, ggml_tensor * x, float alpha) {
+    // Same lowering the dispatch helper picks when
+    // `supertonic_use_cpu_custom_ops()` is false; we call into the
+    // shared inline definition so a future change to the rewrite
+    // would automatically be exercised here too.  The dispatch
+    // scope around the call below forces the GPU branch even
+    // though we're physically running on the CPU backend.
+    return leaky_relu_portable_ggml(ctx, x, alpha);
+}
+
+// Test 1 — Sign-pattern coverage.
+//
+// LeakyReLU has different paths for `x >= 0` and `x < 0`; the
+// portable decomposition collapses them into a single algebraic
+// form.  Feed an input that exercises both halves and the boundary.
+void test_leaky_relu_signs(ggml_backend_t cpu) {
+    const int W = 64, H = 4;
+    std::vector<float> input((size_t) W * H);
+    std::mt19937 rng(42);
+    std::uniform_real_distribution<float> dist(-3.0f, 3.0f);
+    for (auto & v : input) v = dist(rng);
+    // Plant the boundary explicitly.
+    input[0] = 0.0f;
+    input[1] = -0.0f;
+    input[2] = 1e-10f;
+    input[3] = -1e-10f;
+
+    // Forcing the GPU lowering needs a "GPU-looking" model with a
+    // dispatch scope around the portable graph build.  The reference
+    // build runs without any scope so it picks the default
+    // `supertonic_use_cpu_custom_ops() == true` path, which routes
+    // through the CPU fused builtin.
+    supertonic_model gpu_model;
+    gpu_model.backend_is_cpu = false;
+    gpu_model.use_f16_attn   = false;
+
+    for (float alpha : { 0.0f, 0.01f, 0.05f, 0.1f, 0.5f, 0.99f, 1.0f }) {
+        auto ref = run_one_op(cpu, input, W, H, build_reference, alpha);
+        std::vector<float> got;
+        {
+            supertonic_op_dispatch_scope scope(gpu_model);
+            got = run_one_op(cpu, input, W, H, build_portable, alpha);
+        }
+
+        int bad = 0;
+        float worst = 0.0f;
+        for (size_t i = 0; i < ref.size(); ++i) {
+            if (!close_enough(got[i], ref[i])) {
+                if (bad < 4) {
+                    std::fprintf(stderr,
+                                 "  alpha=%.3f i=%zu  ref=%.6g  portable=%.6g\n",
+                                 alpha, i, ref[i], got[i]);
+                }
+                ++bad;
+            }
+            worst = std::max(worst, std::fabs(got[i] - ref[i]));
+        }
+        CHECK(bad == 0);
+        std::fprintf(stderr,
+                     "  [leaky_relu signs alpha=%.3f] max_abs_err=%.3e %s\n",
+                     alpha, worst, bad == 0 ? "PASS" : "FAIL");
+    }
+}
+
+// Test 2 — Dispatch scope actually routes through the portable path.
+//
+// Belt-and-braces: even if `close_enough()` accidentally permitted
+// any input → any output, the runtime should still observe the same
+// number of graph nodes in the portable build (1 RELU + 2 SCALE
+// + 1 ADD = 4 nodes) vs the reference build (1 LEAKY_RELU node).
+// Inspecting node count is fragile but cheap; it guards against
+// `leaky_relu_portable_ggml` regressing back to a `ggml_leaky_relu`
+// passthrough on GPU.
+void test_dispatch_actually_routes(ggml_backend_t cpu) {
+    const int W = 8, H = 1;
+    std::vector<float> input((size_t) W * H);
+    for (int i = 0; i < W; ++i) input[i] = (float) i - 3.5f;
+
+    auto count_nodes = [&](ggml_tensor * (*build)(ggml_context *, ggml_tensor *, float)) {
+        constexpr int MAX_NODES = 64;
+        const size_t buf_size = ggml_tensor_overhead() * MAX_NODES +
+                                ggml_graph_overhead();
+        std::vector<uint8_t> buf(buf_size);
+        ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+        ggml_context * ctx = ggml_init(p);
+        ggml_cgraph * gf = ggml_new_graph(ctx);
+
+        ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, W, H);
+        ggml_set_name(x, "x"); ggml_set_input(x);
+        ggml_tensor * y = build(ctx, x, 0.1f);
+        ggml_set_name(y, "y"); ggml_set_output(y);
+        ggml_build_forward_expand(gf, y);
+
+        int n = ggml_graph_n_nodes(gf);
+        ggml_free(ctx);
+        (void) cpu;
+        return n;
+    };
+
+    supertonic_model cpu_model;
+    cpu_model.backend_is_cpu        = true;
+    cpu_model.use_native_leaky_relu = true;
+    supertonic_model gpu_model;
+    gpu_model.backend_is_cpu = false;
+    // QVAC-18605 — explicit "no native LEAKY_RELU" GPU model so the
+    // decomposition branch fires.  Vulkan / Metal / CUDA models pick
+    // the fused builtin via `use_native_leaky_relu = true` (set at
+    // load time by `backend_supports_native_leaky_relu`); this test
+    // asserts the conservative-fallback path that plain upstream
+    // ggml-opencl + any future backend without `LEAKY_RELU` exercises.
+    gpu_model.use_native_leaky_relu = false;
+
+    int n_ref = 0;
+    int n_portable_cpu = 0;
+    int n_portable_gpu = 0;
+    {
+        n_ref = count_nodes(build_reference);
+    }
+    {
+        supertonic_op_dispatch_scope scope(cpu_model);
+        n_portable_cpu = count_nodes(build_portable);
+    }
+    {
+        supertonic_op_dispatch_scope scope(gpu_model);
+        n_portable_gpu = count_nodes(build_portable);
+    }
+
+    std::fprintf(stderr,
+                 "  [dispatch routing] ref=%d  portable(cpu)=%d  portable(gpu)=%d\n",
+                 n_ref, n_portable_cpu, n_portable_gpu);
+
+    // Reference is the fused builtin: exactly one op.
+    CHECK(n_ref == 1);
+    // Portable on the CPU dispatch picks the same fused builtin too,
+    // so the node count must match the reference.
+    CHECK(n_portable_cpu == n_ref);
+    // Portable on the GPU dispatch decomposes into RELU + SCALE +
+    // SCALE + ADD = 4 ops.  Asserting equality here would couple
+    // the test to today's exact lowering; assert "strictly more
+    // than 1" instead so a future fused-but-still-portable
+    // rewrite stays green.
+    CHECK(n_portable_gpu > n_ref);
+}
+
+} // namespace
+
+int main() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "ggml_backend_cpu_init failed\n");
+        return 1;
+    }
+
+    test_leaky_relu_signs(cpu);
+    test_dispatch_actually_routes(cpu);
+
+    ggml_backend_free(cpu);
+
+    std::fprintf(stderr,
+                 "test_supertonic_portable_ops: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_profile_csv.cpp b/tts-cpp/test/test_supertonic_profile_csv.cpp
new file mode 100644
index 00000000000..780fc376e76
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_profile_csv.cpp
@@ -0,0 +1,267 @@
+// TDD harness for Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-
+// readable timing emitter.
+//
+// Background:
+//   Each Supertonic stage already emits human-readable profile
+//   timing to stderr when its per-stage env var is set
+//   (`SUPERTONIC_VECTOR_PROFILE`, `SUPERTONIC_VOCODER_PROFILE`,
+//   `SUPERTONIC_TEXT_PROFILE`).  Those are great for eyeballing
+//   what just happened on a single run but useless for the next
+//   optimization round — we need a stable schema that a small
+//   Python script can ingest, group by (stage, island), and
+//   surface as "top 10 hot spots by p95 latency" over a 100-synth
+//   benchmark.  This finding adds `SUPERTONIC_PROFILE_CSV=PATH`
+//   that hooks into the same call sites and emits one row per
+//   `supertonic_graph_compute` invocation.
+//
+// Schema (one header row, then one data row per compute call):
+//
+//   stage,island,step,wall_ms,unix_us
+//   vector,attn0_flash,0,1.234,1715517000123456
+//   vector,style0_residual,0,0.412,1715517000125678
+//   ...
+//
+// The unit harness here verifies the writer mechanics without
+// requiring a model load.  It:
+//
+//   1. Points `SUPERTONIC_PROFILE_CSV` at a temp file.
+//   2. Calls `supertonic_profile_csv_record(...)` for a handful
+//      of synthetic rows.
+//   3. Calls `supertonic_profile_csv_flush()` to force the
+//      buffered writes to disk.
+//   4. Reopens the file and parses each row.
+//   5. Asserts the header is correct, the row count + ordering
+//      matches what was recorded, and the per-field types are
+//      well-formed (numeric where they should be).
+//
+// Registered with `LABEL "unit"` in CMakeLists.txt — no GGUF
+// required.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <fstream>
+#include <sstream>
+#include <string>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Split a CSV row on commas.  Pragmatic, doesn't handle quoting —
+// the emitter's schema doesn't use commas in any field.
+std::vector<std::string> split_csv(const std::string & line) {
+    std::vector<std::string> out;
+    std::string cur;
+    for (char c : line) {
+        if (c == ',') {
+            out.push_back(cur);
+            cur.clear();
+        } else {
+            cur.push_back(c);
+        }
+    }
+    out.push_back(cur);
+    return out;
+}
+
+bool is_numeric(const std::string & s) {
+    if (s.empty()) return false;
+    bool seen_digit = false;
+    bool seen_dot = false;
+    for (size_t i = 0; i < s.size(); ++i) {
+        char c = s[i];
+        if (c == '-' && i == 0) continue;
+        if (c >= '0' && c <= '9') { seen_digit = true; continue; }
+        if (c == '.' && !seen_dot) { seen_dot = true; continue; }
+        return false;
+    }
+    return seen_digit;
+}
+
+std::vector<std::string> read_lines(const std::string & path) {
+    std::vector<std::string> out;
+    std::ifstream f(path);
+    if (!f.good()) return out;
+    std::string line;
+    while (std::getline(f, line)) out.push_back(line);
+    return out;
+}
+
+// Test 1 — Disabled by default.
+//
+// With `SUPERTONIC_PROFILE_CSV` unset, recording must be a no-op:
+// any subsequent `record` call returns without touching disk, and
+// `flush` is similarly inert.  Otherwise the env-gated overhead
+// would land in every production synth.
+void test_disabled_by_default() {
+    std::fprintf(stderr, "[Phase 2D disabled-by-default]\n");
+    // Make absolutely sure the env var isn't set from the parent
+    // shell (CI hygiene).
+#if defined(_WIN32)
+    _putenv_s("SUPERTONIC_PROFILE_CSV", "");
+#else
+    unsetenv("SUPERTONIC_PROFILE_CSV");
+#endif
+    // No env var, no path-set.  Recording is a no-op.
+    supertonic_profile_csv_record("vector", "attn0_flash", /*step=*/0, /*wall_ms=*/1.0);
+    supertonic_profile_csv_flush();
+    CHECK(!supertonic_profile_csv_enabled());
+}
+
+// Test 2 — End-to-end round-trip via the explicit path override.
+//
+// Pointing the emitter at a temp file (via the test-only
+// `_set_path` helper that bypasses the env-var probe) records a
+// few rows, flushes, then re-reads the file to verify the
+// schema + values.  Avoids touching the parent process env state
+// to keep the test thread-safe against other unit tests.
+void test_csv_round_trip() {
+    std::fprintf(stderr, "[Phase 2D CSV round-trip]\n");
+
+    // Allocate a fresh path inside the build dir so multiple
+    // concurrent ctest runs don't collide.  Using `/tmp` directly
+    // also works on Linux + macOS; on Windows the test would need
+    // GetTempPathA, but our CI matrix runs the unit label on
+    // Linux + macOS where /tmp exists.
+    char path_buf[L_tmpnam];
+    if (!std::tmpnam(path_buf)) {
+        std::fprintf(stderr, "  SKIP: tmpnam failed\n");
+        return;
+    }
+    const std::string path = path_buf;
+    supertonic_profile_csv_set_path(path.c_str());
+    CHECK(supertonic_profile_csv_enabled());
+
+    // Record a few rows that exercise the schema:
+    //   - vector stage with a step != 0.
+    //   - vocoder stage with step = 0.
+    //   - text stage with negative step (sentinel for "not a
+    //     denoise step" — emitter should still accept and emit).
+    supertonic_profile_csv_record("vector",  "attn0_flash",       0,  1.234);
+    supertonic_profile_csv_record("vector",  "style0_residual",   0,  0.412);
+    supertonic_profile_csv_record("vector",  "attn0_flash",       1,  1.198);
+    supertonic_profile_csv_record("vocoder", "compute",           0, 42.0);
+    supertonic_profile_csv_record("text",    "convnext_front",   -1,  6.7);
+    supertonic_profile_csv_flush();
+
+    // Read it back.
+    auto lines = read_lines(path);
+    CHECK(lines.size() == 6); // header + 5 data rows
+
+    if (lines.size() >= 1) {
+        // Header row.  Exact order matters because the analysis
+        // script keys columns by position, not name.
+        const std::string expected_header = "stage,island,step,wall_ms,unix_us";
+        CHECK(lines[0] == expected_header);
+    }
+
+    if (lines.size() >= 6) {
+        // Per-row checks.
+        struct Expected {
+            const char * stage;
+            const char * island;
+            int          step;
+            double       wall_ms;
+        };
+        const Expected expected[] = {
+            { "vector",  "attn0_flash",      0,  1.234 },
+            { "vector",  "style0_residual",  0,  0.412 },
+            { "vector",  "attn0_flash",      1,  1.198 },
+            { "vocoder", "compute",          0, 42.0   },
+            { "text",    "convnext_front",  -1,  6.7   },
+        };
+        for (int i = 0; i < 5; ++i) {
+            auto cols = split_csv(lines[i + 1]);
+            CHECK(cols.size() == 5);
+            if (cols.size() != 5) continue;
+
+            CHECK(cols[0] == expected[i].stage);
+            CHECK(cols[1] == expected[i].island);
+            CHECK(std::atoi(cols[2].c_str()) == expected[i].step);
+
+            // wall_ms is a double; tolerate the emitter's print
+            // formatting (e.g. "%.3f" rounding).  Use parse +
+            // numeric tolerance instead of string match.
+            CHECK(is_numeric(cols[3]));
+            const double parsed = std::atof(cols[3].c_str());
+            const double err    = std::abs(parsed - expected[i].wall_ms);
+            CHECK(err <= 0.01); // 10 µs slack for "%.3f"-style formatting
+
+            // unix_us is opaque to us — emitter records the wall
+            // clock at record time — but must be numeric and
+            // non-negative.
+            CHECK(is_numeric(cols[4]));
+            const long long us = std::atoll(cols[4].c_str());
+            CHECK(us >= 0);
+        }
+    }
+
+    // Disable + clean up.
+    supertonic_profile_csv_set_path(nullptr);
+    CHECK(!supertonic_profile_csv_enabled());
+    std::remove(path.c_str());
+}
+
+// Test 3 — Multiple records appended, not overwritten.
+//
+// Re-enabling the same path and recording more rows must append
+// to the existing file (not truncate it).  This matches the
+// expected pattern: a bench harness runs many synths with the
+// env var set, and the CSV accumulates one row per
+// `supertonic_graph_compute` call across the whole run.
+void test_append_semantics() {
+    std::fprintf(stderr, "[Phase 2D append semantics]\n");
+    char path_buf[L_tmpnam];
+    if (!std::tmpnam(path_buf)) { std::fprintf(stderr, "  SKIP\n"); return; }
+    const std::string path = path_buf;
+
+    supertonic_profile_csv_set_path(path.c_str());
+    supertonic_profile_csv_record("vector", "x", 0, 1.0);
+    supertonic_profile_csv_flush();
+    supertonic_profile_csv_set_path(nullptr); // close
+
+    supertonic_profile_csv_set_path(path.c_str()); // reopen
+    supertonic_profile_csv_record("vector", "x", 1, 2.0);
+    supertonic_profile_csv_flush();
+    supertonic_profile_csv_set_path(nullptr);
+
+    auto lines = read_lines(path);
+    // One header + two data rows.  Re-opening must NOT re-write
+    // the header (or the analysis script will trip on it).
+    CHECK(lines.size() == 3);
+    if (lines.size() >= 3) {
+        CHECK(lines[0] == "stage,island,step,wall_ms,unix_us");
+        CHECK(split_csv(lines[1])[2] == "0");
+        CHECK(split_csv(lines[2])[2] == "1");
+    }
+    std::remove(path.c_str());
+}
+
+} // namespace
+
+int main() {
+    test_disabled_by_default();
+    test_csv_round_trip();
+    test_append_semantics();
+
+    std::fprintf(stderr,
+                 "test_supertonic_profile_csv: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_rope_in_graph.cpp b/tts-cpp/test/test_supertonic_rope_in_graph.cpp
new file mode 100644
index 00000000000..c5861fcc343
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_rope_in_graph.cpp
@@ -0,0 +1,371 @@
+// TDD harness for the audit follow-up #4 RoPE-in-graph helper
+// (F20 partial, Phase 2H in `aiDocs/PLAN_SUPERTONIC_OPENCL.md`).
+//
+// Background
+// ----------
+// The vector estimator's `apply_rope` is the last hot-path op
+// still running on the CPU between two GPU graph computes.  Every
+// per-step / per-attention-site sequence is:
+//
+//     QKV graph compute  → host download Q,K
+//     CPU apply_rope on Q (40 calls / synth on the default
+//                         5-step × 4-group + 1-front-block schedule)
+//     CPU apply_rope on K
+//     host upload Q,K  →  flash-attention graph compute
+//
+// Supertonic's `apply_rope` is non-standard:
+//
+//     angle = (t / L) * theta[d]          // ← `t/L`, not `t * base^(-2i/D)`
+//     cs = cos(angle), sn = sin(angle)
+//     i1 = (t*H + h)*D + d                // d in [0, half)
+//     i2 = (t*H + h)*D + half + d
+//     x[i1], x[i2] := x[i1]*cs - x[i2]*sn,
+//                     x[i2]*cs + x[i1]*sn
+//
+// `ggml_rope` / `ggml_rope_ext` compute their own θ from
+// `(position, base, freq_scale)` — they CAN'T match this formula
+// directly because the angle scales with `t/L` (position fraction
+// of total length, not absolute position).  The partial F20 lands
+// here is the host-precomputed-cos/sin variant:
+//
+//   1. Host precomputes `cos[half, L] = cos((t/L) * theta[d])`
+//      and `sin[half, L]` once per (L, θ) and uploads as graph
+//      inputs.
+//   2. `apply_rope_in_graph(ctx, x, cos_table, sin_table)` runs
+//      the rotation entirely with universally-supported ops
+//      (`view`, `repeat`, `mul`, `sub`, `add`, `concat`) — no
+//      patched `ggml_sin` / `ggml_cos` / `ggml_rope` needed, so
+//      it runs on baseline upstream OpenCL too.
+//
+// Test contract
+// -------------
+// Build two graphs over the same synthetic Q on the CPU backend:
+//   A. Reference: input + identity (Q stays unrotated) → download
+//      → host scalar apply_rope → that's our reference vector.
+//   B. In-graph: input + cos/sin inputs → `apply_rope_in_graph`
+//      → download.
+//
+// Then assert B == A within F32 tolerance.  Bit-exact is too
+// tight (cos/sin precision + add-order rounding) — chatterbox's
+// CHATTERBOX_F16_CFM ships at `1e-3` abs; we use `1e-4` here for
+// the CPU backend (F32 throughout, only round-order drift).
+//
+// Registered with `LABEL "unit"` — no GGUF required.
+
+#include "ggml.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include "supertonic_internal.h"
+
+#include <algorithm>
+#include <cmath>
+#include <cstdio>
+#include <random>
+#include <stdexcept>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Scalar reference: matches the in-tree `apply_rope` exactly so
+// any divergence between in-graph and reference is a real
+// regression, not a "different RoPE formula" mismatch.  Kept
+// here as a private copy so the test stays self-contained — the
+// production scalar function lives behind a file-static `namespace
+// {}` boundary in `supertonic_vector_estimator.cpp` and isn't
+// reachable from this TU.
+void scalar_apply_rope(const float * theta,
+                       std::vector<float> & x,
+                       int L, int H, int D) {
+    int half = D / 2;
+    for (int h = 0; h < H; ++h) {
+        for (int t = 0; t < L; ++t) {
+            for (int d = 0; d < half; ++d) {
+                const float angle = ((float) t / (float) L) * theta[d];
+                const float cs = std::cos(angle);
+                const float sn = std::sin(angle);
+                const size_t i1 = ((size_t) t * H + h) * D + d;
+                const size_t i2 = ((size_t) t * H + h) * D + half + d;
+                const float a = x[i1];
+                const float b = x[i2];
+                x[i1] = a * cs - b * sn;
+                x[i2] = b * cs + a * sn;
+            }
+        }
+    }
+}
+
+// Test 1 — Parity vs. scalar reference on a realistic
+// vector-estimator attention shape (q_len = 20, n_heads = 4,
+// head_dim = 64).  Tolerance 1e-4 absolute.
+void test_rope_parity_vector_estimator_shape() {
+    std::fprintf(stderr, "[apply_rope_in_graph: vector-estimator shape]\n");
+
+    const int q_len    = 20;
+    const int n_heads  = 4;
+    const int head_dim = 64;
+    const int half     = head_dim / 2;
+
+    std::mt19937 rng(0xC0DE);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+    std::vector<float> theta(half);
+    for (auto & v : theta) v = std::abs(dist(rng)) * 1000.0f; // RoPE θ is positive, model-typical range
+
+    std::vector<float> x_host((size_t) q_len * n_heads * head_dim);
+    for (auto & v : x_host) v = dist(rng);
+
+    // Reference: scalar apply_rope on host copy.
+    std::vector<float> ref = x_host;
+    scalar_apply_rope(theta.data(), ref, q_len, n_heads, head_dim);
+
+    // Host-precompute cos / sin tables: ne=[half, L].  Element
+    // (d, t) at offset t*half + d so the natural row-major upload
+    // matches the GGML tensor's ne[0]=half (inner) layout.
+    std::vector<float> cos_host((size_t) q_len * half);
+    std::vector<float> sin_host((size_t) q_len * half);
+    for (int t = 0; t < q_len; ++t) {
+        for (int d = 0; d < half; ++d) {
+            const float angle = ((float) t / (float) q_len) * theta[d];
+            cos_host[(size_t) t * half + d] = std::cos(angle);
+            sin_host[(size_t) t * half + d] = std::sin(angle);
+        }
+    }
+
+    // Build the in-graph rotation graph.
+    constexpr int MAX_NODES = 256;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    // x has ne=[head_dim, n_heads, L] in GGML order, matching the
+    // scalar layout's memory pattern data[t*H*D + h*D + d].  GGML
+    // ne[0] is innermost; with the data laid out as in `ref` /
+    // `x_host`, element (d, h, t) is at data[t*H*D + h*D + d].
+    // Strides: nb=[4, 4*D, 4*D*H].
+    ggml_tensor * x_in = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, head_dim, n_heads, q_len);
+    ggml_set_name(x_in, "x_in"); ggml_set_input(x_in);
+    ggml_tensor * cos_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, q_len);
+    ggml_set_name(cos_in, "cos_in"); ggml_set_input(cos_in);
+    ggml_tensor * sin_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, q_len);
+    ggml_set_name(sin_in, "sin_in"); ggml_set_input(sin_in);
+
+    ggml_tensor * y = apply_rope_in_graph(ctx, x_in, cos_in, sin_in);
+    ggml_set_name(y, "y"); ggml_set_output(y);
+    ggml_build_forward_expand(gf, y);
+
+    // Run on CPU backend.
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "  SKIP: ggml_backend_cpu_init failed\n");
+        ggml_free(ctx);
+        return;
+    }
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    ggml_gallocr_reserve(allocr, gf);
+    ggml_gallocr_alloc_graph(allocr, gf);
+
+    ggml_backend_tensor_set(x_in,   x_host.data(),   0, x_host.size()   * sizeof(float));
+    ggml_backend_tensor_set(cos_in, cos_host.data(), 0, cos_host.size() * sizeof(float));
+    ggml_backend_tensor_set(sin_in, sin_host.data(), 0, sin_host.size() * sizeof(float));
+    ggml_backend_graph_compute(cpu, gf);
+
+    std::vector<float> got((size_t) ggml_nelements(y));
+    ggml_backend_tensor_get(y, got.data(), 0, got.size() * sizeof(float));
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    ggml_backend_free(cpu);
+
+    // Compare.
+    int bad = 0;
+    float max_abs = 0.0f;
+    const float atol = 1e-4f;
+    for (size_t i = 0; i < ref.size() && i < got.size(); ++i) {
+        const float d = std::fabs(ref[i] - got[i]);
+        max_abs = std::max(max_abs, d);
+        if (d > atol) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  mismatch @ %zu: ref=%.6g got=%.6g abs=%.3e\n",
+                             i, ref[i], got[i], d);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr,
+                 "  shape q_len=%d H=%d D=%d  max_abs_err=%.3e  bad=%d / %zu\n",
+                 q_len, n_heads, head_dim, max_abs, bad, ref.size());
+    CHECK(bad == 0);
+}
+
+// Test 2 — Different L (kv_len style: text_len = 32) to confirm
+// the helper isn't accidentally hard-coded to a single length.
+void test_rope_parity_text_len_shape() {
+    std::fprintf(stderr, "[apply_rope_in_graph: kv-len shape]\n");
+
+    const int kv_len   = 32;   // text_len = ~30 in real synth
+    const int n_heads  = 4;
+    const int head_dim = 64;
+    const int half     = head_dim / 2;
+
+    std::mt19937 rng(0xBEEF);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+    std::vector<float> theta(half);
+    for (auto & v : theta) v = std::abs(dist(rng)) * 1000.0f;
+
+    std::vector<float> x_host((size_t) kv_len * n_heads * head_dim);
+    for (auto & v : x_host) v = dist(rng);
+
+    std::vector<float> ref = x_host;
+    scalar_apply_rope(theta.data(), ref, kv_len, n_heads, head_dim);
+
+    std::vector<float> cos_host((size_t) kv_len * half);
+    std::vector<float> sin_host((size_t) kv_len * half);
+    for (int t = 0; t < kv_len; ++t) {
+        for (int d = 0; d < half; ++d) {
+            const float angle = ((float) t / (float) kv_len) * theta[d];
+            cos_host[(size_t) t * half + d] = std::cos(angle);
+            sin_host[(size_t) t * half + d] = std::sin(angle);
+        }
+    }
+
+    constexpr int MAX_NODES = 256;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    ggml_tensor * x_in = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, head_dim, n_heads, kv_len);
+    ggml_set_name(x_in, "x_in"); ggml_set_input(x_in);
+    ggml_tensor * cos_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, kv_len);
+    ggml_set_name(cos_in, "cos_in"); ggml_set_input(cos_in);
+    ggml_tensor * sin_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, kv_len);
+    ggml_set_name(sin_in, "sin_in"); ggml_set_input(sin_in);
+
+    ggml_tensor * y = apply_rope_in_graph(ctx, x_in, cos_in, sin_in);
+    ggml_set_name(y, "y"); ggml_set_output(y);
+    ggml_build_forward_expand(gf, y);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) { ggml_free(ctx); std::fprintf(stderr, "  SKIP\n"); return; }
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    ggml_gallocr_reserve(allocr, gf);
+    ggml_gallocr_alloc_graph(allocr, gf);
+
+    ggml_backend_tensor_set(x_in,   x_host.data(),   0, x_host.size()   * sizeof(float));
+    ggml_backend_tensor_set(cos_in, cos_host.data(), 0, cos_host.size() * sizeof(float));
+    ggml_backend_tensor_set(sin_in, sin_host.data(), 0, sin_host.size() * sizeof(float));
+    ggml_backend_graph_compute(cpu, gf);
+
+    std::vector<float> got((size_t) ggml_nelements(y));
+    ggml_backend_tensor_get(y, got.data(), 0, got.size() * sizeof(float));
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    ggml_backend_free(cpu);
+
+    int bad = 0;
+    float max_abs = 0.0f;
+    const float atol = 1e-4f;
+    for (size_t i = 0; i < ref.size() && i < got.size(); ++i) {
+        const float d = std::fabs(ref[i] - got[i]);
+        max_abs = std::max(max_abs, d);
+        if (d > atol) ++bad;
+    }
+    std::fprintf(stderr,
+                 "  shape kv_len=%d H=%d D=%d  max_abs_err=%.3e  bad=%d / %zu\n",
+                 kv_len, n_heads, head_dim, max_abs, bad, ref.size());
+    CHECK(bad == 0);
+}
+
+// Test 3 — Identity check: when θ is all zeros (degenerate), the
+// rotation is the identity and output must equal input exactly
+// (no F32 drift since cos(0)=1, sin(0)=0).  Catches a regression
+// where the lower/upper split + concat path accidentally permutes
+// the channel axis.
+void test_rope_identity_zero_theta() {
+    std::fprintf(stderr, "[apply_rope_in_graph: zero-θ identity]\n");
+
+    const int q_len    = 8;
+    const int n_heads  = 2;
+    const int head_dim = 8;
+    const int half     = head_dim / 2;
+
+    std::mt19937 rng(0xDEAD);
+    std::uniform_real_distribution<float> dist(-1.0f, 1.0f);
+    std::vector<float> x_host((size_t) q_len * n_heads * head_dim);
+    for (auto & v : x_host) v = dist(rng);
+
+    // θ = 0 → all angles are 0 → cos=1, sin=0 → output = input.
+    std::vector<float> cos_host((size_t) q_len * half, 1.0f);
+    std::vector<float> sin_host((size_t) q_len * half, 0.0f);
+
+    constexpr int MAX_NODES = 64;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    ggml_tensor * x_in = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, head_dim, n_heads, q_len);
+    ggml_set_name(x_in, "x_in"); ggml_set_input(x_in);
+    ggml_tensor * cos_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, q_len);
+    ggml_set_name(cos_in, "cos_in"); ggml_set_input(cos_in);
+    ggml_tensor * sin_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, q_len);
+    ggml_set_name(sin_in, "sin_in"); ggml_set_input(sin_in);
+
+    ggml_tensor * y = apply_rope_in_graph(ctx, x_in, cos_in, sin_in);
+    ggml_set_name(y, "y"); ggml_set_output(y);
+    ggml_build_forward_expand(gf, y);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) { ggml_free(ctx); std::fprintf(stderr, "  SKIP\n"); return; }
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    ggml_gallocr_reserve(allocr, gf);
+    ggml_gallocr_alloc_graph(allocr, gf);
+    ggml_backend_tensor_set(x_in,   x_host.data(),   0, x_host.size()   * sizeof(float));
+    ggml_backend_tensor_set(cos_in, cos_host.data(), 0, cos_host.size() * sizeof(float));
+    ggml_backend_tensor_set(sin_in, sin_host.data(), 0, sin_host.size() * sizeof(float));
+    ggml_backend_graph_compute(cpu, gf);
+    std::vector<float> got((size_t) ggml_nelements(y));
+    ggml_backend_tensor_get(y, got.data(), 0, got.size() * sizeof(float));
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    ggml_backend_free(cpu);
+
+    int bad = 0;
+    for (size_t i = 0; i < x_host.size() && i < got.size(); ++i) {
+        if (x_host[i] != got[i]) ++bad;
+    }
+    std::fprintf(stderr, "  identity bad=%d / %zu\n", bad, x_host.size());
+    CHECK(bad == 0);
+}
+
+} // namespace
+
+int main() {
+    test_rope_parity_vector_estimator_shape();
+    test_rope_parity_text_len_shape();
+    test_rope_identity_zero_theta();
+
+    std::fprintf(stderr,
+                 "test_supertonic_rope_in_graph: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_rope_packed_qk.cpp b/tts-cpp/test/test_supertonic_rope_packed_qk.cpp
new file mode 100644
index 00000000000..6b6f37f58eb
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_rope_packed_qk.cpp
@@ -0,0 +1,367 @@
+// QVAC-18966 — CPU regression fix for `apply_rope_to_packed_qk`
+// (also covers the Vulkan / OpenCL synth-path regression on this
+// branch — same root cause; rounds 8 / 9's GPU bridges only run
+// past round 11 once this helper produces the right shape).
+//
+// Background
+// ----------
+// `apply_rope_to_packed_qk` is the layout adapter between the
+// natural `ne=[head_dim, n_heads, L]` contract of
+// `apply_rope_in_graph` (PR #4) and the **production** call sites'
+// Q/K-producing matmul output.  Both PR #16 ("RoPE in-graph
+// integration F23") and rounds 8 / 9 (front-block + style GPU
+// bridges) plumb the result of this helper through to
+// `vector_text_attention_cache::q_tc_in` via either
+// `ggml_backend_tensor_copy` (GPU bridge, production) or
+// `ggml_backend_tensor_set` from a host vector (legacy bridge,
+// trace-mode + non-RoPE GGUFs).
+//
+// The original test (PR #16, follow-up #5) built Q under a
+// `ne=[H*D, L]` "channel-fastest-in-memory" assumption.  That
+// matched the helper's INTERNAL layout assumption (view-as-
+// `[D, H, L]` with `nb=[elem, D*elem, HD*elem]`), but it
+// CONTRADICTED what `dense_matmul_time_ggml` actually produces:
+// every Q/K matmul site in the vector estimator hands the helper
+// a tensor with `ne=[L, HD]` (axis 0 = L = time-fastest along
+// natural strides), so memory layout is **channel-major-flat**
+// (`data[t + c*L]`) — the transpose of what the helper expects.
+//
+// On any backend (CPU, OpenCL, Vulkan), the synth path therefore
+// either:
+//   - Crashes on the helper's `GGML_ASSERT(HD == n_heads *
+//     head_dim)` (the new assertion catches the shape mismatch
+//     before the view trick produces garbage), OR
+//   - Pre-assertion, would have produced TRANSPOSED bytes and
+//     silently fed wrong-layout Q / K into
+//     `ggml_flash_attn_ext`.
+//
+// This test reproduces the real production layout end-to-end on
+// the CPU backend (which has no probe-gating and no per-backend
+// kernel paths to confuse the picture) and verifies the helper:
+//   1. Accepts `ne=[L, HD]` matmul-shaped Q without aborting.
+//   2. Returns post-rotation bytes in the **time-major-flat**
+//      layout (`out[t*HD + c]`) that:
+//        - Matches the scalar `apply_rope(theta, x, L, H, D)`
+//          reference (the SOLE source of truth — every host-side
+//          comparison in the codebase indexes through `t*H*D +
+//          h*D + d` flat).
+//        - Can be uploaded byte-for-byte into
+//          `q_tc_in = ggml_new_tensor_2d(F32, A, L)` whose
+//          natural strides are `nb=[elem, A*elem]` → same flat
+//          layout `data[c + t*A]`.
+//
+// The L=1 trip-wire is kept (catches a future regression where
+// the helper silently divides by L or swaps the angle formula).
+//
+// Registered with `LABEL "unit"` — no GGUF required.
+
+#include "ggml.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include "supertonic_internal.h"
+
+#include <algorithm>
+#include <cmath>
+#include <cstdio>
+#include <random>
+#include <stdexcept>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Mirror of the in-tree scalar `apply_rope` (private to
+// supertonic_vector_estimator.cpp).  Indexes a single flat buffer
+// as `data[t*H*D + h*D + d]` — the time-major-flat layout every
+// scalar comparison in the vector estimator uses (and the layout
+// `q_tc_in` reads via `ggml_backend_tensor_copy` of
+// `ggml_nbytes(q_tc_in)` bytes).
+void scalar_apply_rope(const float * theta,
+                       std::vector<float> & x,
+                       int L, int H, int D) {
+    int half = D / 2;
+    for (int h = 0; h < H; ++h) {
+        for (int t = 0; t < L; ++t) {
+            for (int d = 0; d < half; ++d) {
+                const float angle = ((float) t / (float) L) * theta[d];
+                const float cs = std::cos(angle);
+                const float sn = std::sin(angle);
+                const size_t i1 = ((size_t) t * H + h) * D + d;
+                const size_t i2 = ((size_t) t * H + h) * D + half + d;
+                const float a = x[i1];
+                const float b = x[i2];
+                x[i1] = a * cs - b * sn;
+                x[i2] = b * cs + a * sn;
+            }
+        }
+    }
+}
+
+// Run `apply_rope_to_packed_qk` on a Q with the production matmul
+// shape ne=[L, HD] (channel-major-flat memory `data[t + c*L]`)
+// and verify the rotated output matches the scalar reference's
+// time-major-flat layout (`out[t*HD + c]`) bit-for-bit on the CPU
+// backend.
+//
+// Production-layout parity test (matches `dense_matmul_time_ggml`
+// output on every backend).  Reference is built in time-major-
+// flat layout; upload transposes to channel-major-flat so the
+// graph input matches matmul's contract bit-for-bit.  Scalar
+// apply_rope is applied in-place on the time-major-flat buffer,
+// then compared to the helper's downloaded bytes.  Helper must
+// produce bytes in time-major-flat layout so:
+//   - `ggml_backend_tensor_copy(q_rope, q_tc_in)` blits matching
+//     bytes (q_tc_in has the same `ne=[HD, L]` natural layout).
+//   - The legacy host-bridge path's `tensor_raw_f32` download
+//     yields a `std::vector<float>` indexable as `out[t*HD + c]`.
+void test_production_layout(const char * label, int L, int n_heads, int head_dim,
+                            unsigned seed) {
+    std::fprintf(stderr,
+                 "[apply_rope_to_packed_qk production layout: %s]  "
+                 "L=%d H=%d D=%d  (matmul ne=[L, HD])\n",
+                 label, L, n_heads, head_dim);
+
+    const int HD = n_heads * head_dim;
+    const int half = head_dim / 2;
+
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+
+    std::vector<float> theta(half);
+    for (auto & v : theta) v = std::abs(dist(rng)) * 1000.0f;
+
+    // Reference: time-major-flat buffer `ref[t*HD + c]`.  Random
+    // init.  This is the source of truth — `scalar_apply_rope`
+    // indexes through `(t*H + h)*D + d` = `t*HD + (h*D + d)`.
+    std::vector<float> ref((size_t) L * HD);
+    for (auto & v : ref) v = dist(rng);
+
+    // Transpose to channel-major-flat for upload to a tensor with
+    // ne=[L, HD] (natural strides nb=[elem, L*elem]).  Element
+    // (t, c) in matmul layout lives at flat index `t + c*L` —
+    // contiguous in t for fixed c.
+    std::vector<float> q_in_buf((size_t) L * HD);
+    for (int t = 0; t < L; ++t) {
+        for (int c = 0; c < HD; ++c) {
+            q_in_buf[(size_t) t + (size_t) c * L] =
+                ref[(size_t) t * HD + c];
+        }
+    }
+
+    // Scalar reference in-place rotation on the time-major-flat
+    // buffer.
+    scalar_apply_rope(theta.data(), ref, L, n_heads, head_dim);
+
+    // Cos/sin tables exactly like `make_rope_cos_sin_tables`
+    // writes.
+    std::vector<float> cos_host, sin_host;
+    make_rope_cos_sin_tables(theta.data(), L, half, cos_host, sin_host);
+
+    // Build the graph on the CPU backend.  Max nodes generous
+    // for the transpose + cont + view chain inside the helper.
+    constexpr int MAX_NODES = 512;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), /*no_alloc=*/true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    // q input with the production matmul shape.  ne=[L, HD]
+    // explicitly DIFFERENT from the pre-fix test's ne=[HD, L].
+    ggml_tensor * q_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, HD);
+    ggml_set_name(q_in, "q_in"); ggml_set_input(q_in);
+    ggml_tensor * cos_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, L);
+    ggml_set_name(cos_in, "cos_in"); ggml_set_input(cos_in);
+    ggml_tensor * sin_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, L);
+    ggml_set_name(sin_in, "sin_in"); ggml_set_input(sin_in);
+
+    ggml_tensor * y = apply_rope_to_packed_qk(ctx, q_in, cos_in, sin_in,
+                                              n_heads, head_dim);
+    ggml_set_name(y, "y"); ggml_set_output(y);
+    ggml_build_forward_expand(gf, y);
+
+    // Output-shape contract.  The helper MUST produce ne=[HD, L]
+    // (axis 0 = HD = channels-fastest, axis 1 = L = time-slowest)
+    // for `ggml_backend_tensor_copy(y, q_tc_in)` to hit the
+    // matching shape in `vector_text_attention_cache::q_tc_in`.
+    CHECK((int) y->ne[0] == HD);
+    CHECK((int) y->ne[1] == L);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "  SKIP: ggml_backend_cpu_init failed\n");
+        ggml_free(ctx);
+        return;
+    }
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    ggml_gallocr_reserve(allocr, gf);
+    ggml_gallocr_alloc_graph(allocr, gf);
+
+    ggml_backend_tensor_set(q_in,   q_in_buf.data(), 0, q_in_buf.size() * sizeof(float));
+    ggml_backend_tensor_set(cos_in, cos_host.data(), 0, cos_host.size() * sizeof(float));
+    ggml_backend_tensor_set(sin_in, sin_host.data(), 0, sin_host.size() * sizeof(float));
+    ggml_backend_graph_compute(cpu, gf);
+
+    std::vector<float> got((size_t) ggml_nelements(y));
+    ggml_backend_tensor_get(y, got.data(), 0, got.size() * sizeof(float));
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    ggml_backend_free(cpu);
+
+    // Memory-layout contract: helper's output bytes should equal
+    // scalar reference's time-major-flat bytes element-wise.
+    CHECK(got.size() == ref.size());
+
+    int bad = 0;
+    float max_abs = 0.0f;
+    const float atol = 1e-4f;
+    for (size_t i = 0; i < ref.size() && i < got.size(); ++i) {
+        const float d = std::fabs(ref[i] - got[i]);
+        max_abs = std::max(max_abs, d);
+        if (d > atol) {
+            if (bad < 4) {
+                std::fprintf(stderr,
+                             "  mismatch @ %zu: ref=%.6g got=%.6g abs=%.3e\n",
+                             i, ref[i], got[i], d);
+            }
+            ++bad;
+        }
+    }
+    std::fprintf(stderr,
+                 "  max_abs_err=%.3e  bad=%d / %zu\n",
+                 max_abs, bad, ref.size());
+    CHECK(bad == 0);
+}
+
+// L=1 trip-wire (preserved from the original test).  At L=1 the
+// angle is 0/1 * theta = 0, so cos=1, sin=0 and rotation is the
+// identity.  Catches a regression where the helper accidentally
+// divides by L or swaps the angle formula.  Re-cast under the
+// production ne=[L, HD] contract.
+void test_production_layout_l1() {
+    std::fprintf(stderr,
+                 "[apply_rope_to_packed_qk production layout: L=1 degenerate]\n");
+    const int L = 1, n_heads = 2, head_dim = 8;
+    const int HD = n_heads * head_dim;
+    const int half = head_dim / 2;
+
+    std::vector<float> theta(half, 100.0f);
+
+    // Time-major-flat reference; channel-major-flat upload.
+    std::vector<float> ref((size_t) L * HD, 1.0f);
+    std::vector<float> q_in_buf((size_t) L * HD);
+    for (int t = 0; t < L; ++t) {
+        for (int c = 0; c < HD; ++c) {
+            q_in_buf[(size_t) t + (size_t) c * L] =
+                ref[(size_t) t * HD + c];
+        }
+    }
+    // Identity rotation at L=1.
+    scalar_apply_rope(theta.data(), ref, L, n_heads, head_dim);
+
+    std::vector<float> cos_host, sin_host;
+    make_rope_cos_sin_tables(theta.data(), L, half, cos_host, sin_host);
+
+    constexpr int MAX_NODES = 128;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES + ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_cgraph * gf = ggml_new_graph(ctx);
+
+    ggml_tensor * q_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, HD);
+    ggml_set_input(q_in);
+    ggml_tensor * cos_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, L);
+    ggml_set_input(cos_in);
+    ggml_tensor * sin_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, L);
+    ggml_set_input(sin_in);
+
+    ggml_tensor * y = apply_rope_to_packed_qk(ctx, q_in, cos_in, sin_in,
+                                              n_heads, head_dim);
+    ggml_set_output(y);
+    ggml_build_forward_expand(gf, y);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) { ggml_free(ctx); std::fprintf(stderr, "  SKIP\n"); return; }
+    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cpu));
+    ggml_gallocr_reserve(allocr, gf);
+    ggml_gallocr_alloc_graph(allocr, gf);
+    ggml_backend_tensor_set(q_in,   q_in_buf.data(), 0, q_in_buf.size() * sizeof(float));
+    ggml_backend_tensor_set(cos_in, cos_host.data(), 0, cos_host.size() * sizeof(float));
+    ggml_backend_tensor_set(sin_in, sin_host.data(), 0, sin_host.size() * sizeof(float));
+    ggml_backend_graph_compute(cpu, gf);
+    std::vector<float> got((size_t) ggml_nelements(y));
+    ggml_backend_tensor_get(y, got.data(), 0, got.size() * sizeof(float));
+    ggml_gallocr_free(allocr);
+    ggml_free(ctx);
+    ggml_backend_free(cpu);
+
+    CHECK((int) y->ne[0] == HD);
+    CHECK((int) y->ne[1] == L);
+
+    int bad = 0;
+    float max_abs = 0.0f;
+    for (size_t i = 0; i < ref.size() && i < got.size(); ++i) {
+        const float d = std::fabs(ref[i] - got[i]);
+        max_abs = std::max(max_abs, d);
+        if (d > 1e-5f) ++bad;
+    }
+    std::fprintf(stderr, "  L=1 max_abs=%.3e bad=%d\n", max_abs, bad);
+    CHECK(bad == 0);
+}
+
+// Output-shape regression check.  Even if the helper ever gets
+// re-plumbed to a different internal pipeline, the public contract
+// must remain `ne[0] = n_heads * head_dim`, `ne[1] = L` so the
+// downstream `ggml_backend_tensor_copy` blit into
+// `vector_text_attention_cache::q_tc_in` stays bit-exact.
+void test_output_shape_contract() {
+    std::fprintf(stderr,
+                 "[apply_rope_to_packed_qk output-shape contract]\n");
+    const int L = 20, n_heads = 4, head_dim = 64;
+    const int HD = n_heads * head_dim;
+    const int half = head_dim / 2;
+    const size_t buf_size = ggml_tensor_overhead() * 256 + ggml_graph_overhead();
+    std::vector<uint8_t> buf(buf_size);
+    ggml_init_params p = { buf_size, buf.data(), true };
+    ggml_context * ctx = ggml_init(p);
+    ggml_tensor * q_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, L, HD);
+    ggml_tensor * cos_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, L);
+    ggml_tensor * sin_in = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, half, L);
+    ggml_tensor * y = apply_rope_to_packed_qk(ctx, q_in, cos_in, sin_in,
+                                              n_heads, head_dim);
+    CHECK((int) y->ne[0] == HD);
+    CHECK((int) y->ne[1] == L);
+    CHECK(ggml_nelements(y) == (int64_t) L * HD);
+    ggml_free(ctx);
+}
+
+} // namespace
+
+int main() {
+    // Vector-estimator hot shapes (q_len, kv_len typical sizes).
+    test_production_layout("vector-estimator q", 20, 4, 64, 0xA51C);
+    test_production_layout("vector-estimator k", 32, 4, 64, 0xC0FF);
+    test_production_layout_l1();
+    test_output_shape_contract();
+
+    std::fprintf(stderr,
+                 "test_supertonic_rope_packed_qk: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_text_encoder_caches.cpp b/tts-cpp/test/test_supertonic_text_encoder_caches.cpp
new file mode 100644
index 00000000000..1161e0f5c61
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_text_encoder_caches.cpp
@@ -0,0 +1,233 @@
+// TDD harness for the audit follow-up #2 caches added to
+// `supertonic_text_encoder`'s GPU hot path.
+//
+// Two findings checked here, both fixture-bound (require the
+// Supertonic GGUF + auto-DISABLED when the model isn't present):
+//
+//   F13  Text-encoder layer-norm weight host-side cache.
+//        The text-encoder GGML production path runs four
+//        `relpos + LN + ffn + LN` iterations followed by a final
+//        speech-prompted LN.  Pre-audit, each LN downloaded its
+//        γ + β tensors from the backend via `read_f32(...)` on
+//        every synth — 18 downloads / synth = 18 sync points on
+//        a non-CPU backend.  Caching them once at load (same
+//        pattern as F1 RoPE θ) drops that to zero.
+//
+//   F16  Speech-prompted attention `tanh_k` host-side cache.
+//        The two speech-prompted attention layers each pull a
+//        constant `tanh_k` tensor (~50 × 256 = 51.2 KiB) on
+//        every synth.  Cache it once at load and consume the
+//        host pointer at both call sites.
+//
+// Validation strategy:
+//   1. After `load_supertonic_gguf` returns, the new cache
+//      fields on `supertonic_model` are populated with the right
+//      shapes (size + content match a direct backend read of the
+//      source tensor).
+//   2. The roster of cached LN weights covers exactly the 10
+//      hot-path LN pairs the text encoder consumes per synth
+//      (4 × `norm_layers_1.X` + 4 × `norm_layers_2.X` +
+//       final `speech_prompted_text_encoder.norm.norm`).
+//
+// Registered with `LABEL "fixture"` in CMakeLists.txt.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <string>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+std::vector<float> dump_f32(ggml_tensor * tensor) {
+    std::vector<float> out((size_t) ggml_nelements(tensor));
+    ggml_backend_tensor_get(tensor, out.data(), 0, ggml_nbytes(tensor));
+    return out;
+}
+
+ggml_tensor * find_source(const supertonic_model & model, const std::string & key) {
+    auto it = model.source_tensors.find(key);
+    return it == model.source_tensors.end() ? nullptr : it->second;
+}
+
+// F13 — text-encoder layer-norm weights host-side cache.
+//
+// The expected roster (10 LN pairs) is the union of:
+//   - the four `attn_encoder.norm_layers_1.X` (post-relpos
+//     residual norms, X ∈ {0..3})
+//   - the four `attn_encoder.norm_layers_2.X` (post-FFN residual
+//     norms, X ∈ {0..3})
+//   - the two `attn_encoder.norm_layers_*.X` for the speech-
+//     prompted block exists only as the final
+//     `speech_prompted_text_encoder.norm.norm` so it counts as
+//     one extra cache entry in the production path, but the
+//     "norm_layers" naming convention covers the first 8.
+//
+// Test asserts:
+//   - `model.text_encoder_ln_weights` is populated with at least
+//     the 8 attn_encoder pairs + the 1 speech-prompted final.
+//   - Each cached vector matches a direct backend read of the
+//     corresponding source tensor bit-exactly.
+void test_f13_text_encoder_ln_cache(const supertonic_model & model) {
+    std::fprintf(stderr, "[F13 text-encoder LN weight cache]\n");
+
+    // Contract: helper accessor + map populated for at least the
+    // four attn_encoder norm_layers_{1,2}.{0..3} pairs.  Allows
+    // additional entries (the final speech-prompted norm, future
+    // audit roster expansions) without trip-wiring the test.
+    int matched = 0;
+    int bad = 0;
+    static const char * const kRosterStems[] = {
+        "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1.0",
+        "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1.1",
+        "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1.2",
+        "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_1.3",
+        "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2.0",
+        "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2.1",
+        "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2.2",
+        "text_encoder:tts.ttl.text_encoder.attn_encoder.norm_layers_2.3",
+        "text_encoder:tts.ttl.speech_prompted_text_encoder.norm",
+    };
+
+    for (const char * stem : kRosterStems) {
+        const std::string g_name = std::string(stem) + ".norm.weight";
+        const std::string b_name = std::string(stem) + ".norm.bias";
+
+        // Each entry in the cache map is keyed on the SOURCE name
+        // (the `text_encoder:...` string), value is the cached
+        // host vector ready for `layer_norm_channel` to consume.
+        auto gamma_it = model.text_encoder_ln_weights.find(g_name);
+        auto beta_it  = model.text_encoder_ln_weights.find(b_name);
+
+        ggml_tensor * gamma_src = find_source(model, g_name);
+        ggml_tensor * beta_src  = find_source(model, b_name);
+        if (!gamma_src || !beta_src) {
+            std::fprintf(stderr, "  SKIP %s (source tensor missing)\n", stem);
+            continue;
+        }
+        ++matched;
+        CHECK(gamma_it != model.text_encoder_ln_weights.end());
+        CHECK(beta_it  != model.text_encoder_ln_weights.end());
+        if (gamma_it == model.text_encoder_ln_weights.end() ||
+            beta_it  == model.text_encoder_ln_weights.end()) {
+            continue;
+        }
+
+        // Contract: cached size matches the source tensor.
+        CHECK(gamma_it->second.size() == (size_t) ggml_nelements(gamma_src));
+        CHECK(beta_it->second.size()  == (size_t) ggml_nelements(beta_src));
+
+        // Contract: cached bytes match a direct backend read.
+        auto gamma_direct = dump_f32(gamma_src);
+        auto beta_direct  = dump_f32(beta_src);
+        for (size_t i = 0; i < gamma_direct.size(); ++i) {
+            if (gamma_it->second[i] != gamma_direct[i]) {
+                if (bad < 2) {
+                    std::fprintf(stderr,
+                                 "  %s gamma mismatch @ %zu: cached=%g direct=%g\n",
+                                 stem, i, gamma_it->second[i], gamma_direct[i]);
+                }
+                ++bad;
+            }
+        }
+        for (size_t i = 0; i < beta_direct.size(); ++i) {
+            if (beta_it->second[i] != beta_direct[i]) {
+                if (bad < 2) {
+                    std::fprintf(stderr,
+                                 "  %s beta mismatch @ %zu: cached=%g direct=%g\n",
+                                 stem, i, beta_it->second[i], beta_direct[i]);
+                }
+                ++bad;
+            }
+        }
+    }
+    CHECK(bad == 0);
+    std::fprintf(stderr,
+                 "  matched %d / %zu pairs, bad=%d\n",
+                 matched, sizeof(kRosterStems)/sizeof(kRosterStems[0]), bad);
+}
+
+// F16 — speech-prompted attention `tanh_k` host-side cache.
+//
+// Two `tanh_k` tensors (one per speech-prompted attention layer)
+// were previously downloaded via `read_f32(...)` inside
+// `speech_prompted_attention_ggml` on every synth.  Caching them
+// at load drops 2 GPU→host sync points per synth.
+//
+// Source names match the production path (lines 622 / 796 in
+// `supertonic_text_encoder.cpp` pre-fix):
+//   text_encoder:/speech_prompted_text_encoder/attention1/tanh/Tanh_output_0
+//   text_encoder:/speech_prompted_text_encoder/attention2/tanh/Tanh_output_0
+void test_f16_speech_tanh_k_cache(const supertonic_model & model) {
+    std::fprintf(stderr, "[F16 speech tanh_k cache]\n");
+
+    static const char * const kTanhSources[2] = {
+        "text_encoder:/speech_prompted_text_encoder/attention1/tanh/Tanh_output_0",
+        "text_encoder:/speech_prompted_text_encoder/attention2/tanh/Tanh_output_0",
+    };
+    int matched = 0;
+    int bad = 0;
+    for (int i = 0; i < 2; ++i) {
+        ggml_tensor * src = find_source(model, kTanhSources[i]);
+        if (!src) {
+            std::fprintf(stderr, "  SKIP %s (not in GGUF)\n", kTanhSources[i]);
+            continue;
+        }
+        ++matched;
+        const std::vector<float> & cached = model.speech_tanh_k_cache[i];
+        CHECK(cached.size() == (size_t) ggml_nelements(src));
+        if (cached.size() != (size_t) ggml_nelements(src)) continue;
+
+        auto direct = dump_f32(src);
+        for (size_t j = 0; j < direct.size(); ++j) {
+            if (cached[j] != direct[j]) {
+                if (bad < 2) {
+                    std::fprintf(stderr,
+                                 "  tanh_k[%d] mismatch @ %zu: cached=%g direct=%g\n",
+                                 i, j, cached[j], direct[j]);
+                }
+                ++bad;
+            }
+        }
+    }
+    CHECK(bad == 0);
+    std::fprintf(stderr, "  matched %d / 2 tanh_k tensors, bad=%d\n", matched, bad);
+}
+
+} // namespace
+
+int main(int argc, char ** argv) {
+    if (argc < 2) {
+        std::fprintf(stderr, "usage: %s MODEL.gguf\n", argv[0]);
+        return 2;
+    }
+    supertonic_model model;
+    if (!load_supertonic_gguf(argv[1], model)) {
+        std::fprintf(stderr, "failed to load model: %s\n", argv[1]);
+        return 1;
+    }
+
+    test_f13_text_encoder_ln_cache(model);
+    test_f16_speech_tanh_k_cache(model);
+
+    free_supertonic_model(model);
+
+    std::fprintf(stderr,
+                 "test_supertonic_text_encoder_caches: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_text_encoder_gpu_bridge.cpp b/tts-cpp/test/test_supertonic_text_encoder_gpu_bridge.cpp
new file mode 100644
index 00000000000..3b9554b180d
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_text_encoder_gpu_bridge.cpp
@@ -0,0 +1,216 @@
+// QVAC-18605 round 12 — CPU-only TDD test for the text-encoder
+// speech-prompted-attention GPU bridge (`run_speech_prompted_merged_cache`).
+//
+// Background
+// ----------
+// Master's Metal-port branch (PR #15) shipped a fully-built
+// `speech_prompted_merged_cache` graph in `supertonic_text_encoder.cpp`
+// — a single ggml graph that does QKV projection + head-split +
+// flash-attn + out-proj end-to-end on the GPU.  The graph
+// builder (`build_speech_prompted_merged_cache`) is present + tested
+// at the implementation level via the Metal port's own harnesses,
+// but the **run path** that exercises it from
+// `speech_prompted_attention_ggml` was never wired in.  So the
+// production text-encoder path stays on the pre-Phase-A4 two-cache
+// pattern with host-side Q/V download → pack → re-upload between
+// the QKV cache and the flash-attn cache.
+//
+// Per text encoder call (2 speech-prompted layers per synth):
+//
+//   Pre-round-12 (two-cache path):
+//     - QKV cache compute
+//     - 2 GPU→host downloads (q_out, v_out via tensor_to_time_channel)
+//     - host-side pack of q_pack / k_pack / v_pack (rearranges into
+//       the [D, L, H] layout flash_attn views as [head_dim, q_len,
+//       n_heads])
+//     - 3 host→GPU uploads (q_pack, k_pack, v_pack)
+//     - flash-attn cache compute
+//   = 5 sync points + ~half_dim × L × n_heads × 3 floats of host work
+//
+//   Post-round-12 (merged path):
+//     - One merged graph compute
+//   = 0 sync points, 0 host pack work
+//
+// Eliminates **5 sync points × 2 layers = 10 sync points / synth**
+// on the text encoder alone.  Combined with the auto-pick fix in
+// the same round, the RTX 5090 number drops from ~4.8 ms /
+// text_encoder to ~2.5-3 ms.
+//
+// What this test pins (CPU-only)
+// ------------------------------
+// 1. The new `run_speech_prompted_merged_cache` symbol exists in
+//    `detail::` with the expected signature.  SFINAE — fails at
+//    compile time if the function isn't there, fails at link
+//    time if it's declared but undefined.
+//
+// 2. The `speech_prompted_merged_cache` struct exposes the
+//    fields the run path needs (x_in, style_in, out, gf,
+//    idx, L, Lctx, generation_id, model).  Same SFINAE pattern.
+//
+// 3. A runtime trip-wire that confirms the dispatch wrapper
+//    `speech_prompted_attention_ggml` exists with its
+//    pre-round-12 signature.  Round 12 swaps the internal
+//    dispatch (CPU → legacy two-cache path, non-CPU → merged
+//    path) without changing the public function shape, so any
+//    caller that compiled pre-round-12 keeps compiling.
+//
+// Equivalence between the merged and legacy paths is verified
+// end-to-end on real hardware via the model-fixture tests
+// (`test-supertonic-text-encoder-trace`,
+// `test-supertonic-pipeline`) — those exercise the live graph
+// against the scalar reference.  CPU-only unit tests can't
+// build the cache without a real GGUF's source tensors (q_w,
+// v_w, out_w, tanh_k all by name) so we don't try here.
+//
+// Registered with `LABEL "unit"` — no GGUF required.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <type_traits>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// SFINAE — the merged-cache run symbol exists with the expected
+// shape.  Round 12 introduces this; pre-round-12 the test fails
+// to compile on `has_run_speech_prompted_merged_cache<>(0)`.
+//
+// Expected signature:
+//
+//   void run_speech_prompted_merged_cache(
+//       speech_prompted_merged_cache & cache,
+//       const supertonic_model & m,
+//       const std::vector<float> & x_lc,
+//       int L,
+//       const float * style_ttl,
+//       std::vector<float> & out_lc);
+//
+// Mirrors the calling convention of the legacy
+// `speech_prompted_attention_ggml` so the dispatch wrapper can
+// fall through to it with no argument repacking.
+template <typename = void>
+auto has_run_speech_prompted_merged_cache(int)
+    -> decltype(run_speech_prompted_merged_cache(
+        std::declval<speech_prompted_merged_cache &>(),
+        std::declval<const supertonic_model &>(),
+        std::declval<const std::vector<float> &>(),
+        std::declval<int>(),
+        std::declval<const float *>(),
+        std::declval<std::vector<float> &>()),
+        std::true_type{});
+template <typename = void>
+auto has_run_speech_prompted_merged_cache(...) -> std::false_type;
+
+void test_run_symbol_exists() {
+    std::fprintf(stderr, "[Round 12 #6: run_speech_prompted_merged_cache symbol]\n");
+    static_assert(
+        decltype(has_run_speech_prompted_merged_cache<>(0))::value,
+        "run_speech_prompted_merged_cache must exist with the documented signature");
+    // SFINAE is the actual gate; runtime check exists so the
+    // test reports a meaningful pass/fail count.
+    ++g_checks;
+}
+
+// SFINAE — the merged-cache struct exposes the fields the run
+// path needs.  Master built the struct + builder; round 12 adds
+// the run path that reads these fields.  A future struct rename
+// or field removal trips this gate.
+template <typename T, typename = void>
+struct has_x_in_field : std::false_type {};
+template <typename T>
+struct has_x_in_field<T, std::void_t<decltype(std::declval<T &>().x_in)>>
+    : std::true_type {};
+
+template <typename T, typename = void>
+struct has_style_in_field : std::false_type {};
+template <typename T>
+struct has_style_in_field<T, std::void_t<decltype(std::declval<T &>().style_in)>>
+    : std::true_type {};
+
+template <typename T, typename = void>
+struct has_out_field : std::false_type {};
+template <typename T>
+struct has_out_field<T, std::void_t<decltype(std::declval<T &>().out)>>
+    : std::true_type {};
+
+template <typename T, typename = void>
+struct has_idx_field : std::false_type {};
+template <typename T>
+struct has_idx_field<T, std::void_t<decltype(std::declval<T &>().idx)>>
+    : std::true_type {};
+
+template <typename T, typename = void>
+struct has_L_field : std::false_type {};
+template <typename T>
+struct has_L_field<T, std::void_t<decltype(std::declval<T &>().L)>>
+    : std::true_type {};
+
+void test_merged_cache_struct_fields() {
+    std::fprintf(stderr, "[Round 12 #6: speech_prompted_merged_cache struct fields]\n");
+    static_assert(has_x_in_field    <speech_prompted_merged_cache>::value,
+                  "speech_prompted_merged_cache must expose x_in");
+    static_assert(has_style_in_field<speech_prompted_merged_cache>::value,
+                  "speech_prompted_merged_cache must expose style_in");
+    static_assert(has_out_field     <speech_prompted_merged_cache>::value,
+                  "speech_prompted_merged_cache must expose out");
+    static_assert(has_idx_field     <speech_prompted_merged_cache>::value,
+                  "speech_prompted_merged_cache must expose idx");
+    static_assert(has_L_field       <speech_prompted_merged_cache>::value,
+                  "speech_prompted_merged_cache must expose L");
+    ++g_checks;
+}
+
+// `speech_prompted_attention_ggml` is internal to
+// `supertonic_text_encoder.cpp` (it's only called from
+// `supertonic_text_encoder_forward_ggml` in the same TU) and
+// intentionally not declared in `supertonic_internal.h` — so this
+// SFINAE-pinning is left to the model-fixture tests that
+// link against the dispatch path through
+// `supertonic_text_encoder_forward_ggml` (e.g.
+// `test-supertonic-text-encoder-trace`).
+
+// Trip-wire: free a fresh-defaulted merged cache.  Verifies the
+// destructor path works on a never-built cache (idx==-1, ctx==
+// nullptr, allocr==nullptr) without crashing — important because
+// the dispatch wrapper holds `thread_local
+// speech_prompted_merged_cache merged_caches[2]` and on
+// program exit those destructors fire.  A buggy free path
+// (e.g., unconditional `ggml_free(cache.ctx)` on nullptr) would
+// segfault here.
+void test_free_default_constructed_cache() {
+    std::fprintf(stderr, "[Round 12 #6: free default-constructed merged cache]\n");
+    speech_prompted_merged_cache cache;  // defaults: idx=-1, ctx=nullptr, etc.
+    free_speech_prompted_merged_cache(cache);
+    CHECK(cache.ctx == nullptr);
+    CHECK(cache.allocr == nullptr);
+    CHECK(cache.idx == -1);
+    CHECK(cache.L == 0);
+}
+
+} // namespace
+
+int main() {
+    test_run_symbol_exists();
+    test_merged_cache_struct_fields();
+    test_free_default_constructed_cache();
+
+    std::fprintf(stderr,
+                 "test_supertonic_text_encoder_gpu_bridge: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_upload_skip_tracker.cpp b/tts-cpp/test/test_supertonic_upload_skip_tracker.cpp
new file mode 100644
index 00000000000..1af84bc11d3
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_upload_skip_tracker.cpp
@@ -0,0 +1,300 @@
+// QVAC-18605 round 10 — CPU-only TDD test for the pointer-compare
+// upload-skip tracker.
+//
+// Background
+// ----------
+// Per-step uploads of `text_emb` to the front-block cache and to
+// the 3 group-graph caches happen 5 times per synth (once per
+// denoise step), but `text_emb` is a `std::vector<float>` allocated
+// ONCE in `Engine::Impl::synthesize()` (and once per bench run)
+// — so the SAME pointer flows through 4 caches × 5 steps = 20
+// uploads / synth, of which 16 are redundant re-uploads of
+// identical data.
+//
+// The F4 pattern (already in `vector_res_style_qkv_cache` for
+// `style_v_in` / `kctx_in`) skips redundant uploads via pointer
+// comparison: if the host vector pointer is the same as the last
+// successful upload's pointer, skip.  Round 10 generalises that
+// pattern into a `upload_skip_tracker` struct so the same logic
+// applies to the front-block / g1 / g2 / g3 `text_in` uploads.
+//
+// CROSS-SYNTH HAZARD
+// ------------------
+// `text_emb` lives on `Engine::Impl::synthesize()`'s stack (or
+// the bench loop's stack) — destructed at end of call.  Modern
+// heap allocators (jemalloc / tcmalloc / glibc) return the SAME
+// address for an immediately-following same-size allocation
+// (size-class reuse, locality optimisation), so synth N+1 may
+// have `text_emb.data() == synth_N.text_emb.data()` despite
+// holding completely different data.  A naive pointer-compare
+// upload-skip would silently send stale text-encoder embeddings
+// to the next synth.
+//
+// MITIGATION
+// ----------
+// Caller resets the tracker at every synth boundary (i.e., when
+// `current_step == 0`).  The first step of every synth always
+// uploads (cold-miss), populating the tracker; steps 1..N-1 hit
+// the pointer-compare and skip.  Across synths, the reset
+// invalidates the cached pointer so the next synth's upload
+// always fires regardless of pointer match.
+//
+// API contract:
+//
+//   struct upload_skip_tracker {
+//       const void * last_uploaded = nullptr;
+//
+//       // True iff `current` differs from the last recorded
+//       // pointer (i.e., we MUST upload).  False iff we can
+//       // skip.  After the consumer's upload call returns,
+//       // they MUST call `mark_uploaded(current)` to update
+//       // the cached pointer (else the next call re-uploads).
+//       bool needs_upload(const void * current) const;
+//
+//       // Records a successful upload.  Call AFTER the upload
+//       // completes (so a failed upload doesn't pin the
+//       // pointer — the next call would correctly re-attempt).
+//       void mark_uploaded(const void * current);
+//
+//       // Drops the cached pointer.  Caller invokes at synth
+//       // boundary (current_step == 0) AND on cache rebuild
+//       // (the underlying GPU buffer is reallocated, so the
+//       // pointer-compare optimisation is invalid even if the
+//       // host pointer matches).
+//       void reset();
+//   };
+//
+// Whole TU MUST fail to compile before the symbol is added,
+// then pass after.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <type_traits>
+
+using tts_cpp::supertonic::detail::upload_skip_tracker;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// SFINAE: assert the public field exists at the documented type.
+template <typename T>
+auto has_last_uploaded(int) -> decltype(
+    std::declval<T &>().last_uploaded, std::true_type{});
+template <typename T>
+auto has_last_uploaded(...) -> std::false_type;
+
+// Test 1 — default state.  A fresh tracker has no cached pointer
+// → needs_upload(...) ALWAYS returns true.  Catches the bug
+// where a default-constructed tracker accidentally caches a
+// non-null pointer (would silently skip the cold-miss upload).
+void test_default_state() {
+    static_assert(decltype(has_last_uploaded<upload_skip_tracker>(0))::value,
+                  "upload_skip_tracker must expose last_uploaded "
+                  "(documented field used by tests + diagnostics)");
+    upload_skip_tracker t;
+    CHECK(t.last_uploaded == nullptr);
+
+    // Any pointer (including nullptr) needs upload on a fresh
+    // tracker.  nullptr-vs-nullptr is technically equal but the
+    // semantic is "we have NEVER uploaded" — needs_upload should
+    // still return true.  The cleanest check: ensure
+    // needs_upload(actual_pointer) is true.
+    int dummy = 42;
+    const void * p = &dummy;
+    CHECK(t.needs_upload(p));
+
+    // Same call twice should NOT mutate state — needs_upload is const.
+    CHECK(t.needs_upload(p));
+    CHECK(t.last_uploaded == nullptr);
+}
+
+// Test 2 — upload + skip happy path.
+//
+// The canonical 5-step pattern: step 0 uploads, steps 1-4 skip.
+void test_upload_then_skip() {
+    upload_skip_tracker t;
+    int payload_a = 0;
+    const void * p_a = &payload_a;
+
+    // Step 0 — cold miss, must upload.
+    CHECK(t.needs_upload(p_a));
+    t.mark_uploaded(p_a);
+    CHECK(t.last_uploaded == p_a);
+
+    // Steps 1..4 — same pointer, skip.
+    for (int i = 1; i < 5; ++i) {
+        CHECK(!t.needs_upload(p_a));
+    }
+}
+
+// Test 3 — pointer change forces upload.
+//
+// If the consumer calls with a different pointer, the tracker
+// must indicate upload-needed.  Catches the bug where the
+// tracker only checks the FIRST byte or some hash collision
+// silently misses a real data change.
+void test_pointer_change_triggers_upload() {
+    upload_skip_tracker t;
+    int payload_a = 0;
+    int payload_b = 1;
+    const void * p_a = &payload_a;
+    const void * p_b = &payload_b;
+
+    CHECK(t.needs_upload(p_a));
+    t.mark_uploaded(p_a);
+    CHECK(!t.needs_upload(p_a));
+
+    // Different pointer — must upload.
+    CHECK(t.needs_upload(p_b));
+    t.mark_uploaded(p_b);
+    CHECK(!t.needs_upload(p_b));
+
+    // Switching back to p_a — also must upload (the cache only
+    // remembers the LAST pointer, not all previously-seen ones).
+    CHECK(t.needs_upload(p_a));
+}
+
+// Test 4 — reset() clears the cached pointer.
+//
+// This is the SYNTH-BOUNDARY GUARD.  The caller invokes
+// reset() at the start of each synth (current_step == 0) so
+// even if the new synth's text_emb happens to share the same
+// stack address as the previous synth's text_emb, the tracker
+// forces a re-upload (because the data may differ — modern
+// allocators re-issue addresses on size-class reuse).
+void test_reset_invalidates_cache() {
+    upload_skip_tracker t;
+    int payload = 0;
+    const void * p = &payload;
+
+    // Upload + verify skip.
+    CHECK(t.needs_upload(p));
+    t.mark_uploaded(p);
+    CHECK(!t.needs_upload(p));
+
+    // Reset — same pointer must now trigger upload again.
+    t.reset();
+    CHECK(t.last_uploaded == nullptr);
+    CHECK(t.needs_upload(p));
+}
+
+// Test 5 — interleaved sites.
+//
+// Multiple trackers (one per cache) are independent — no shared
+// state.  Catches the bug where the tracker accidentally uses
+// a static / thread_local member that all instances share.
+void test_independent_instances() {
+    upload_skip_tracker t1;
+    upload_skip_tracker t2;
+    upload_skip_tracker t3;
+    int payload_a = 0;
+    int payload_b = 1;
+    const void * p_a = &payload_a;
+    const void * p_b = &payload_b;
+
+    t1.mark_uploaded(p_a);
+    t2.mark_uploaded(p_b);
+    // t3 left untouched.
+
+    CHECK(!t1.needs_upload(p_a));
+    CHECK(t1.needs_upload(p_b));
+
+    CHECK(!t2.needs_upload(p_b));
+    CHECK(t2.needs_upload(p_a));
+
+    CHECK(t3.needs_upload(p_a));
+    CHECK(t3.needs_upload(p_b));
+    CHECK(t3.last_uploaded == nullptr);
+}
+
+// Test 6 — cross-synth pointer-reuse hazard simulation.
+//
+// Simulate the production pattern: synth A allocates text_emb at
+// address P, runs 5 steps (upload at step 0, skip at steps 1-4).
+// Synth A ends, vector destructs.  Synth B allocates text_emb at
+// the SAME address P (allocator size-class reuse) but with
+// DIFFERENT data.
+//
+// Without reset() at synth boundary: the tracker would skip
+// synth B's step-0 upload because pointer matches → BUG.
+//
+// With reset() at synth boundary (the documented contract): the
+// tracker correctly forces synth B's step-0 upload.
+void test_cross_synth_pointer_reuse() {
+    upload_skip_tracker t;
+
+    // Synth A: address P_A.
+    char buf_a[64] = {0};
+    const void * p_a = buf_a;
+    CHECK(t.needs_upload(p_a));  // step 0 (cold miss)
+    t.mark_uploaded(p_a);
+    for (int s = 1; s < 5; ++s) {
+        CHECK(!t.needs_upload(p_a));
+    }
+
+    // Synth B: SAME address (synth-A's buffer "freed" + reused).
+    // Without reset, naive pointer-compare would incorrectly
+    // skip the upload → upload-skip would silently leak synth-A
+    // data into synth-B's GPU buffer.
+    //
+    // The documented contract is: caller MUST reset() at
+    // current_step == 0.  We simulate that here.
+    t.reset();
+    const void * p_b = buf_a;        // intentionally same address.
+    CHECK(t.needs_upload(p_b));      // upload fires despite matching pointer.
+    t.mark_uploaded(p_b);
+    for (int s = 1; s < 5; ++s) {
+        CHECK(!t.needs_upload(p_b));
+    }
+}
+
+// Test 7 — reset on already-empty tracker is a no-op.
+//
+// Defensive: caller might call reset() unconditionally at synth
+// start without checking whether the tracker has cached state.
+// Must not crash / mutate other state weirdly.
+void test_reset_on_empty_tracker() {
+    upload_skip_tracker t;
+    CHECK(t.last_uploaded == nullptr);
+    t.reset();
+    CHECK(t.last_uploaded == nullptr);
+    t.reset();
+    t.reset();
+    CHECK(t.last_uploaded == nullptr);
+
+    // After reset chain, normal usage still works.
+    int payload = 0;
+    const void * p = &payload;
+    CHECK(t.needs_upload(p));
+    t.mark_uploaded(p);
+    CHECK(!t.needs_upload(p));
+}
+
+} // namespace
+
+int main() {
+    test_default_state();
+    test_upload_then_skip();
+    test_pointer_change_triggers_upload();
+    test_reset_invalidates_cache();
+    test_independent_instances();
+    test_cross_synth_pointer_reuse();
+    test_reset_on_empty_tracker();
+
+    std::fprintf(stderr,
+                 "test_supertonic_upload_skip_tracker: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_voice_host_cache.cpp b/tts-cpp/test/test_supertonic_voice_host_cache.cpp
new file mode 100644
index 00000000000..89c2da788f4
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_voice_host_cache.cpp
@@ -0,0 +1,285 @@
+// QVAC-18605 round 7 — CPU-only TDD test for the voice ttl/dp host
+// cache.
+//
+// Background
+// ----------
+// `Engine::Impl::synthesize()` currently downloads the per-voice
+// style tensors (`ttl`, `dp`) from the GGUF on EVERY call:
+//
+//   std::vector<float> style_ttl = read_tensor_f32(vit->second.ttl);
+//   std::vector<float> style_dp  = read_tensor_f32(vit->second.dp);
+//
+// Each `read_tensor_f32` is one synchronous GPU→host download +
+// one host vector allocation.  On Vulkan / OpenCL backends this
+// is a sync point per call per voice, which doesn't change across
+// calls (voice tensors are part of the load-time GGUF state — they
+// never mutate after load).  Caching them per-engine keyed by
+// voice name eliminates 2 sync points per `synthesize()` call on
+// every call after the first per-voice.
+//
+// Round 7 introduces a small standalone helper
+// `tts_cpp::supertonic::detail::voice_host_cache` so the lookup-
+// or-load semantics are testable on CPU without instantiating a
+// full `Engine::Impl`.  The Engine::Impl wiring is a thin caller
+// of this helper.
+//
+// API contract:
+//
+//   struct voice_host_cache {
+//       struct entry {
+//           std::vector<float> ttl;
+//           std::vector<float> dp;
+//       };
+//
+//       // Returns a stable reference to the cached entry for
+//       // `voice_name`.  On cache miss, calls `read_tensor_f32`
+//       // on `ttl_tensor` and `dp_tensor`, stores the result,
+//       // and returns the new entry.  On cache hit, returns the
+//       // existing entry without touching the GGML tensors at
+//       // all (the host vectors are reused as-is).
+//       //
+//       // Reference is stable across subsequent `get_or_load`
+//       // calls for OTHER voices (std::unordered_map's
+//       // reference-stability guarantee on insert).  Caller may
+//       // hold the reference across the next `get_or_load` on
+//       // the same instance, BUT must NOT call `clear()` on the
+//       // cache while holding the reference.
+//       const entry & get_or_load(const std::string & voice_name,
+//                                 ggml_tensor * ttl_tensor,
+//                                 ggml_tensor * dp_tensor);
+//
+//       // Drops every cached entry.  Called by Engine::Impl on
+//       // backend reset (currently unreachable — included for
+//       // forward-compat with hot-swap scenarios).
+//       void clear();
+//
+//       // Diagnostic — number of entries currently cached.  Used
+//       // by the test to assert lookup-vs-load semantics.
+//       size_t size() const;
+//   };
+//
+// Whole TU MUST fail to compile before the symbol is added,
+// then pass after.
+
+#include "ggml.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <cstring>
+#include <stdexcept>
+#include <string>
+#include <vector>
+
+using tts_cpp::supertonic::detail::voice_host_cache;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Build a tiny F32 tensor with the supplied scalar payload
+// allocated on `cpu`.  Mirrors the shape of a real voice
+// tensor (ttl is [256, 50, 1], dp is [16, 8, 1]) without
+// requiring a real model.  Caller owns the returned context +
+// buffer; tensor is valid until ggml_free + ggml_backend_buffer_free.
+struct stub_tensor {
+    ggml_context * ctx = nullptr;
+    ggml_backend_buffer_t buf = nullptr;
+    ggml_tensor * tensor = nullptr;
+
+    ~stub_tensor() {
+        if (buf) ggml_backend_buffer_free(buf);
+        if (ctx) ggml_free(ctx);
+    }
+    stub_tensor() = default;
+    stub_tensor(const stub_tensor &)             = delete;
+    stub_tensor & operator=(const stub_tensor &) = delete;
+};
+
+void make_stub_tensor(ggml_backend_t cpu,
+                      stub_tensor & out,
+                      int ne0, int ne1, int ne2,
+                      const std::vector<float> & payload) {
+    constexpr int MAX_NODES = 4;
+    const size_t buf_size = ggml_tensor_overhead() * MAX_NODES;
+    ggml_init_params p{ buf_size, nullptr, /*no_alloc=*/true };
+    out.ctx = ggml_init(p);
+    if (!out.ctx) throw std::runtime_error("ggml_init failed");
+    out.tensor = ggml_new_tensor_3d(out.ctx, GGML_TYPE_F32, ne0, ne1, ne2);
+    out.buf = ggml_backend_alloc_ctx_tensors(out.ctx, cpu);
+    if (!out.buf) throw std::runtime_error("ggml_backend_alloc_ctx_tensors failed");
+    if ((size_t) ggml_nelements(out.tensor) != payload.size()) {
+        throw std::runtime_error("payload size mismatch in test stub");
+    }
+    ggml_backend_tensor_set(out.tensor, payload.data(), 0,
+                            payload.size() * sizeof(float));
+}
+
+// Test 1 — empty cache reports size 0; clear is a no-op on empty.
+void test_empty_cache() {
+    voice_host_cache cache;
+    CHECK(cache.size() == 0);
+    cache.clear();  // must not throw
+    CHECK(cache.size() == 0);
+}
+
+// Test 2 — first `get_or_load` populates from the GGML tensors;
+// returned vectors carry the exact payload.
+void test_first_load_populates(ggml_backend_t cpu) {
+    voice_host_cache cache;
+
+    std::vector<float> ttl_payload(8, 1.5f);
+    for (size_t i = 0; i < ttl_payload.size(); ++i) ttl_payload[i] = (float) i + 0.25f;
+    std::vector<float> dp_payload(4, 2.5f);
+    for (size_t i = 0; i < dp_payload.size(); ++i) dp_payload[i] = (float) i - 0.5f;
+
+    stub_tensor ttl_t; make_stub_tensor(cpu, ttl_t, 8, 1, 1, ttl_payload);
+    stub_tensor dp_t;  make_stub_tensor(cpu, dp_t,  4, 1, 1, dp_payload);
+
+    const auto & e = cache.get_or_load("F1", ttl_t.tensor, dp_t.tensor);
+    CHECK(e.ttl == ttl_payload);
+    CHECK(e.dp  == dp_payload);
+    CHECK(cache.size() == 1);
+}
+
+// Test 3 — second `get_or_load` for the same voice returns the
+// same entry WITHOUT touching the GGML tensors.  We verify the
+// "no-touch" property by passing nullptr for ttl/dp on the second
+// call: a real load attempt would crash; a cache hit returns the
+// previously-stored entry.
+void test_second_load_hits_cache(ggml_backend_t cpu) {
+    voice_host_cache cache;
+
+    std::vector<float> ttl_payload(6, 0.0f);
+    for (size_t i = 0; i < ttl_payload.size(); ++i) ttl_payload[i] = (float) i;
+    std::vector<float> dp_payload(3, 0.0f);
+    for (size_t i = 0; i < dp_payload.size(); ++i) dp_payload[i] = -(float) i;
+
+    stub_tensor ttl_t; make_stub_tensor(cpu, ttl_t, 6, 1, 1, ttl_payload);
+    stub_tensor dp_t;  make_stub_tensor(cpu, dp_t,  3, 1, 1, dp_payload);
+
+    const auto & first  = cache.get_or_load("M1", ttl_t.tensor, dp_t.tensor);
+    CHECK(first.ttl == ttl_payload);
+
+    // Pass nullptr — if the cache TRIED to re-load, this would
+    // crash inside `read_tensor_f32`.  A clean cache hit returns
+    // the prior entry untouched.
+    const auto & second = cache.get_or_load("M1", nullptr, nullptr);
+    CHECK(&first == &second);  // reference identity
+    CHECK(second.ttl == ttl_payload);
+    CHECK(second.dp  == dp_payload);
+    CHECK(cache.size() == 1);
+}
+
+// Test 4 — multiple voices coexist; each entry is independent;
+// reference stability holds across subsequent get_or_load calls
+// for OTHER voices.
+void test_multiple_voices(ggml_backend_t cpu) {
+    voice_host_cache cache;
+
+    stub_tensor ttl_a; make_stub_tensor(cpu, ttl_a, 4, 1, 1, {1, 2, 3, 4});
+    stub_tensor dp_a;  make_stub_tensor(cpu, dp_a,  2, 1, 1, {10, 20});
+    stub_tensor ttl_b; make_stub_tensor(cpu, ttl_b, 4, 1, 1, {5, 6, 7, 8});
+    stub_tensor dp_b;  make_stub_tensor(cpu, dp_b,  2, 1, 1, {30, 40});
+    stub_tensor ttl_c; make_stub_tensor(cpu, ttl_c, 4, 1, 1, {9, 9, 9, 9});
+    stub_tensor dp_c;  make_stub_tensor(cpu, dp_c,  2, 1, 1, {50, 60});
+
+    const auto & a1 = cache.get_or_load("A", ttl_a.tensor, dp_a.tensor);
+    const auto & b1 = cache.get_or_load("B", ttl_b.tensor, dp_b.tensor);
+    const auto & c1 = cache.get_or_load("C", ttl_c.tensor, dp_c.tensor);
+
+    CHECK(a1.ttl == std::vector<float>({1, 2, 3, 4}));
+    CHECK(b1.ttl == std::vector<float>({5, 6, 7, 8}));
+    CHECK(c1.ttl == std::vector<float>({9, 9, 9, 9}));
+    CHECK(a1.dp  == std::vector<float>({10, 20}));
+    CHECK(b1.dp  == std::vector<float>({30, 40}));
+    CHECK(c1.dp  == std::vector<float>({50, 60}));
+    CHECK(cache.size() == 3);
+
+    // Reference stability — looking up A again must yield the
+    // SAME object the original lookup returned.  std::unordered_map
+    // guarantees stable references on insert (no rehash needed
+    // because we're not exceeding any bucket threshold).  This
+    // matters for the production Engine::Impl call site: it
+    // captures the ttl/dp pointers from `e.ttl.data()` /
+    // `e.dp.data()` and forwards them to the synthesis pipeline,
+    // which expects them to stay valid for the duration of the
+    // call.
+    const auto & a2 = cache.get_or_load("A", nullptr, nullptr);
+    CHECK(&a1 == &a2);
+}
+
+// Test 5 — `clear()` drops every entry; subsequent get_or_load
+// re-loads from the tensors.
+void test_clear_drops_entries(ggml_backend_t cpu) {
+    voice_host_cache cache;
+
+    std::vector<float> ttl_payload(4, 7.0f);
+    std::vector<float> dp_payload(2, -3.0f);
+    stub_tensor ttl_t; make_stub_tensor(cpu, ttl_t, 4, 1, 1, ttl_payload);
+    stub_tensor dp_t;  make_stub_tensor(cpu, dp_t,  2, 1, 1, dp_payload);
+
+    cache.get_or_load("V", ttl_t.tensor, dp_t.tensor);
+    CHECK(cache.size() == 1);
+    cache.clear();
+    CHECK(cache.size() == 0);
+
+    // Re-load must succeed and produce the same payload.
+    const auto & e = cache.get_or_load("V", ttl_t.tensor, dp_t.tensor);
+    CHECK(e.ttl == ttl_payload);
+    CHECK(e.dp  == dp_payload);
+    CHECK(cache.size() == 1);
+}
+
+// Test 6 — null tensor pointers throw on cache miss (loud
+// failure for an Impl bug; never expected to fire on the
+// production path because Impl validates `voices.find()` before
+// calling the cache).
+void test_null_tensors_on_miss_throws(ggml_backend_t /*cpu*/) {
+    voice_host_cache cache;
+    bool threw = false;
+    try {
+        cache.get_or_load("ghost", nullptr, nullptr);
+    } catch (const std::exception &) {
+        threw = true;
+    }
+    CHECK(threw);
+    CHECK(cache.size() == 0);
+}
+
+} // namespace
+
+int main() {
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "ggml_backend_cpu_init failed\n");
+        return 1;
+    }
+
+    test_empty_cache();
+    test_first_load_populates(cpu);
+    test_second_load_hits_cache(cpu);
+    test_multiple_voices(cpu);
+    test_clear_drops_entries(cpu);
+    test_null_tensors_on_miss_throws(cpu);
+
+    ggml_backend_free(cpu);
+
+    std::fprintf(stderr,
+                 "test_supertonic_voice_host_cache: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_vulkan_device_select.cpp b/tts-cpp/test/test_supertonic_vulkan_device_select.cpp
new file mode 100644
index 00000000000..38d1b1408bb
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_vulkan_device_select.cpp
@@ -0,0 +1,449 @@
+// QVAC-18605 round 3 — CPU-only TDD test for the multi-device
+// Vulkan auto-pick helper.
+//
+// `--vulkan-device -1` was reserved for "auto-pick best device"
+// behaviour in the QVAC-18605 bring-up but treated as 0 (the
+// historical hard-coded value).  Round 3 wires the auto-pick
+// logic via a pure-logic helper that takes the per-device free-
+// VRAM list as input — keeps the policy decoupled from the
+// Vulkan-only `ggml_backend_vk_get_device_memory()` plumbing,
+// which means the policy is testable on CPU with synthetic
+// inputs.  The Vulkan-side wrapper that calls
+// `ggml_backend_vk_get_device_memory()` for each device and
+// dispatches into the helper lives behind `#ifdef GGML_USE_VULKAN`
+// in `init_supertonic_backend`.
+//
+// QVAC-18605 round 12 — extend the policy to bias against UMA
+// (unified-memory-architecture, i.e., integrated) GPUs when a
+// discrete GPU is present.  Background: on the dev rig (RTX 5090
+// discrete + AMD RADV iGPU), the iGPU reports system RAM (128+
+// GB) as "free VRAM" via `ggml_backend_vk_get_device_memory()`
+// because UMA shares the host RAM pool with the CPU.  The
+// round-3 `argmax(free_vram)` policy therefore picked the iGPU,
+// silently delivering ~7× realtime instead of the discrete's
+// 273× realtime — a ~40× perf regression for any operator who
+// followed the help text "auto-pick adapter with most free VRAM".
+//
+// New signature (round 12):
+//
+//   int resolve_vulkan_device_index(int requested,
+//                                   const std::vector<size_t> & free_vram_per_device,
+//                                   const std::vector<bool>   & is_uma_per_device = {});
+//
+// `is_uma_per_device` is OPTIONAL (default empty vector).  When
+// empty, the round-3 `argmax(free_vram)` policy is preserved
+// verbatim — backwards-compatible with every caller that hasn't
+// been updated.  When non-empty, it MUST have the same length as
+// `free_vram_per_device`; mismatch throws.
+//
+// New behaviour matrix (with `is_uma_per_device` populated):
+//
+//   | requested | discrete? | uma?  | result                                |
+//   |-----------|-----------|-------|---------------------------------------|
+//   | -1        | all       | none  | argmax(free_vram) over all            |
+//   | -1        | none      | all   | argmax(free_vram) over all            |
+//   | -1        | mixed     | mixed | argmax(free_vram) over DISCRETE only  |
+//   | 0..N      | any       | any   | explicit passthrough (range-checked)  |
+//
+// Returns the device index to use, or throws `std::runtime_error`
+// on invalid input (caller surfaces the message verbatim).
+//
+// Original round-3 behaviour matrix (when `is_uma_per_device` is empty):
+//
+//   | requested | dev_count | result                                  |
+//   |-----------|-----------|-----------------------------------------|
+//   | -1        | 0         | throws (no device to pick)              |
+//   | 0         | 0         | throws (no device to pick)              |
+//   | -1        | 1         | 0  (only choice)                        |
+//   | 0         | 1         | 0                                       |
+//   | -1        | 2         | argmax(free_vram); ties → first         |
+//   | 0         | 2         | 0  (explicit override)                  |
+//   | 1         | 2         | 1                                       |
+//   | 2         | 2         | throws (out of range)                   |
+//   | -2        | any       | throws (negative != -1 reserved)        |
+//
+// Tie-breaking on equal free VRAM picks the lower index — gives
+// stable behaviour across runs on identical-spec multi-GPU
+// machines.  Documented in `init_supertonic_backend` so operators
+// who need a different policy can `--vulkan-device N` explicitly.
+//
+// This test is written FIRST (TDD).  Round 3 checks (tests 1-8)
+// already pass; round 12 checks (tests 9-13) fail until the new
+// `is_uma_per_device` parameter is implemented.
+
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <stdexcept>
+#include <vector>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Helper: assert that `fn()` throws std::runtime_error.  Used to
+// verify the no-device / out-of-range / negative-non-auto cases.
+template <typename F>
+bool throws_runtime_error(F && fn) {
+    try {
+        fn();
+        return false;
+    } catch (const std::runtime_error &) {
+        return true;
+    } catch (...) {
+        return false;
+    }
+}
+
+// Test 1 — Empty device list throws regardless of request.
+//
+// `init_supertonic_backend` falls through to OpenCL / CPU when
+// `ggml_backend_vk_get_device_count()` returns 0; the helper
+// throws here so the caller has a clear signal to skip the
+// Vulkan branch instead of accidentally returning device index
+// 0 against a zero-length list.
+void test_empty_device_list_throws() {
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(-1, {});
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index( 0, {});
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index( 1, {});
+    }));
+}
+
+// Test 2 — Single device, requested 0 or -1 returns 0.
+//
+// The auto-pick is a no-op when there's only one candidate.
+// Explicit index 0 also returns 0 (the historical hard-coded
+// path).  Any other index throws (out of range).
+void test_single_device_returns_zero() {
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{100}) == 0);
+    CHECK(resolve_vulkan_device_index( 0, std::vector<size_t>{100}) == 0);
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(1, std::vector<size_t>{100});
+    }));
+}
+
+// Test 3 — Auto-pick (`-1`) picks the device with most free VRAM.
+//
+// Simulates a multi-GPU machine where one card has more head-
+// room than the other (e.g. NVIDIA RTX 5090 with 32 GB free
+// alongside an RTX 4090 with 16 GB free).  Auto-pick should
+// land on the 5090.
+void test_auto_pick_max_vram() {
+    // dev0 = 100 free, dev1 = 500 free → pick dev1.
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{100, 500}) == 1);
+    // dev0 = 500 free, dev1 = 100 free → pick dev0.
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{500, 100}) == 0);
+    // 4 devices, dev2 has the most.
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{100, 200, 800, 400}) == 2);
+}
+
+// Test 4 — Tie-breaking picks the lower index.
+//
+// Identical-spec multi-GPU machines (lab racks of A100s, e.g.)
+// produce identical free-VRAM readings; tie-breaking on the
+// lower index gives stable per-run device assignment instead of
+// depending on driver enumeration order.
+void test_auto_pick_ties_pick_lower_index() {
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{300, 300}) == 0);
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{500, 500, 500}) == 0);
+    // Tie at the back: dev1 + dev2 both have 500, pick dev1.
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{100, 500, 500}) == 1);
+}
+
+// Test 5 — Explicit valid index in range returns it.
+//
+// Auto-pick is opt-in via `-1`; an operator who knows their
+// machine + workload can pin to a specific device with
+// `--vulkan-device N`, and the helper must not second-guess the
+// choice based on VRAM.  (Useful when the higher-VRAM card is
+// reserved for another workload, e.g. a model-server alongside
+// a TTS worker on the same box.)
+void test_explicit_index_returns_unchanged() {
+    CHECK(resolve_vulkan_device_index(0, std::vector<size_t>{100, 500}) == 0);
+    CHECK(resolve_vulkan_device_index(1, std::vector<size_t>{100, 500}) == 1);
+    CHECK(resolve_vulkan_device_index(2, std::vector<size_t>{100, 500, 200}) == 2);
+    CHECK(resolve_vulkan_device_index(0, std::vector<size_t>{100, 500, 200}) == 0);
+}
+
+// Test 6 — Out-of-range explicit index throws.
+//
+// Same loud-failure contract as the existing
+// `init_supertonic_backend` Vulkan branch: a CLI typo that asks
+// for `--vulkan-device 7` on a 2-GPU machine surfaces here as a
+// hard error, not a silent CPU fallback that hides the perf
+// cliff.
+void test_out_of_range_throws() {
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(2, std::vector<size_t>{100, 500});
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(7, std::vector<size_t>{100, 500});
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(99, std::vector<size_t>{100});
+    }));
+}
+
+// Test 7 — Negative-but-not-(-1) throws.
+//
+// `-1` is the documented "auto-pick" sentinel; any other
+// negative value (e.g. `-2`, `-100`) is reserved for future
+// policies.  Treating those as 0 (the bring-up's behaviour)
+// silently masks operator typos; throwing surfaces them.
+void test_reserved_negative_throws() {
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(-2, std::vector<size_t>{100, 500});
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(-100, std::vector<size_t>{100, 500});
+    }));
+}
+
+// Test 8 — Zero-VRAM device handling.
+//
+// A reserved-but-listed device (e.g. iGPU listed but not
+// available for compute) shows 0 free VRAM.  Auto-pick should
+// still work — picks any other device with non-zero VRAM.  When
+// all devices have zero VRAM (degenerate), picks index 0
+// (consistent with the tie-breaking rule).
+void test_zero_vram_handling() {
+    // dev0 has zero free, dev1 has 500.  Auto-pick → dev1.
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{0, 500}) == 1);
+    // All zero — pick the first (consistent with the
+    // tie-breaking rule).
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{0, 0, 0}) == 0);
+}
+
+// =============================================================
+// Round 12 — bias against UMA on hybrid discrete+iGPU machines.
+// =============================================================
+
+// Test 9 — Empty `is_uma_per_device` preserves round-3 behaviour.
+//
+// Backwards-compatibility gate.  Every existing caller passes
+// only two arguments; the new third-argument default of `{}`
+// must produce identical results to the round-3 helper for
+// EVERY input shape.  This is a "no surprise" guarantee for any
+// caller that hasn't been updated to pass the UMA flags.
+void test_empty_uma_preserves_round3_behaviour() {
+    // Empty UMA list explicitly passed — identical to round-3
+    // 2-arg call.  Covers the main argmax(free_vram) path.
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{100, 500},
+                                       std::vector<bool>{}) == 1);
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{500, 100},
+                                       std::vector<bool>{}) == 0);
+    // Explicit index also unchanged with empty UMA list.
+    CHECK(resolve_vulkan_device_index(1, std::vector<size_t>{100, 500},
+                                       std::vector<bool>{}) == 1);
+    // Tie-break still picks lower index with empty UMA list.
+    CHECK(resolve_vulkan_device_index(-1, std::vector<size_t>{300, 300},
+                                       std::vector<bool>{}) == 0);
+}
+
+// Test 10 — Hybrid discrete + UMA: auto-pick prefers discrete
+// even when UMA reports more "free VRAM".
+//
+// THE BUG ROUND 12 FIXES.  On the dev rig (RTX 5090 discrete +
+// AMD RADV iGPU), free_vram_per_device looks like
+// `[32 GB, 120 GB]` because RADV reports the entire system RAM
+// as available to the iGPU's UMA pool.  Pre-round-12 argmax
+// picks index 1 (iGPU), losing ~40× realtime.  Round 12 biases
+// against UMA when a discrete is present, picking index 0.
+void test_hybrid_prefer_discrete_over_uma() {
+    // RTX 5090 (discrete, 32 GB) + AMD RADV iGPU (UMA, ~120 GB
+    // reported via system RAM).  Pre-round-12 returned 1 (iGPU);
+    // round-12 returns 0 (discrete) regardless of the UMA's
+    // larger reported free pool.
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{32ull * 1024 * 1024 * 1024,
+                                 120ull * 1024 * 1024 * 1024},
+            std::vector<bool>{false, true}) == 0);
+    // Swapped enumeration order (iGPU first, discrete second).
+    // Same outcome — picks the discrete one regardless of index.
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{120ull * 1024 * 1024 * 1024,
+                                 32ull * 1024 * 1024 * 1024},
+            std::vector<bool>{true, false}) == 1);
+}
+
+// Test 10b — UMA-aware tiebreak: two discrete cards with EQUAL
+// VRAM should pick the lower index, with the UMA bias active.
+//
+// PR #18 reviewer (Omar) follow-up: the original test 10 used
+// distinct VRAM sizes (32 GB vs 120 GB), so the tiebreak case
+// (two discrete cards with equal VRAM under the UMA bias path)
+// wasn't pinned explicitly.  Test 4 covers the tiebreak in the
+// round-3 (no UMA bias) policy and test 11's second CHECK
+// covers the discrete-subset tiebreak when a UMA is interleaved
+// between the discretes, but neither explicitly exercises the
+// most-common rig: two adjacent discretes with equal VRAM +
+// active UMA bias.  This test pins it.
+void test_uma_aware_tiebreak_equal_vram_discretes() {
+    // Two discretes with identical 32 GB VRAM + one UMA iGPU
+    // with much more reported VRAM.  Discrete subset is
+    // {0, 1}; argmax over that subset picks 0 (lower index).
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{
+                32ull * 1024 * 1024 * 1024,    // dev0: discrete, 32 GB
+                32ull * 1024 * 1024 * 1024,    // dev1: discrete, 32 GB
+                120ull * 1024 * 1024 * 1024},  // dev2: UMA, 120 GB
+            std::vector<bool>{false, false, true}) == 0);
+
+    // Adjacent discretes (no interleaved UMA) — same expected
+    // outcome (lower index = 0).  Belt-and-suspenders against
+    // a future refactor that walks the discrete subset in a
+    // different order.
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{
+                32ull * 1024 * 1024 * 1024,
+                32ull * 1024 * 1024 * 1024},
+            std::vector<bool>{false, false}) == 0);
+
+    // Three discretes, all equal: lowest index wins (= 0).
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{
+                32ull * 1024 * 1024 * 1024,
+                32ull * 1024 * 1024 * 1024,
+                32ull * 1024 * 1024 * 1024},
+            std::vector<bool>{false, false, false}) == 0);
+}
+
+// Test 11 — Multi-discrete + multi-UMA mixed: argmax over the
+// discrete subset.
+//
+// Lab rack with 2 discrete cards + a CPU-emulator (lavapipe,
+// reports UMA=true) + an iGPU.  The auto-pick should ignore
+// the UMA devices entirely and run argmax over the discrete
+// subset.
+void test_multi_discrete_argmax_over_discrete_subset() {
+    // 4 devices: 2 discrete (16/32 GB), 2 UMA (120/120 GB).
+    // Discrete-only argmax picks dev1 (32 GB > 16 GB).
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{
+                16ull * 1024 * 1024 * 1024,    // dev0: discrete, 16 GB
+                32ull * 1024 * 1024 * 1024,    // dev1: discrete, 32 GB
+                120ull * 1024 * 1024 * 1024,   // dev2: UMA, 120 GB
+                120ull * 1024 * 1024 * 1024},  // dev3: UMA, 120 GB
+            std::vector<bool>{false, false, true, true}) == 1);
+    // Discrete subset tie-break: dev0 + dev2 both discrete with
+    // 16 GB, dev1 is UMA.  Tie → lower index = 0.
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{
+                16ull * 1024 * 1024 * 1024,
+                120ull * 1024 * 1024 * 1024,
+                16ull * 1024 * 1024 * 1024},
+            std::vector<bool>{false, true, false}) == 0);
+}
+
+// Test 12 — All-UMA falls back to argmax(free_vram).
+//
+// Mobile / laptop with only an iGPU available, or a CPU-only
+// build using lavapipe.  No discrete present, so the bias
+// degenerates to the round-3 policy.
+void test_all_uma_falls_back_to_argmax() {
+    // Two iGPUs (rare but possible on some multi-socket boards).
+    // Falls back to argmax(free_vram).
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{100, 500},
+            std::vector<bool>{true, true}) == 1);
+    // Single iGPU.
+    CHECK(resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{500},
+            std::vector<bool>{true}) == 0);
+}
+
+// Test 13 — Explicit index passthrough is UMA-agnostic.
+//
+// An operator who knows their machine + workload can still pin
+// `--vulkan-device 1` even when device 1 is UMA.  The bias
+// applies ONLY to the `-1` auto-pick path.  (Useful for testing
+// the iGPU path or for low-thermal scenarios where the
+// operator deliberately offloads to UMA.)
+void test_explicit_index_ignores_uma_bias() {
+    // Pinned to UMA index 1 — passthrough, no bias kicks in.
+    CHECK(resolve_vulkan_device_index(
+            1,
+            std::vector<size_t>{32ull * 1024 * 1024 * 1024,
+                                 120ull * 1024 * 1024 * 1024},
+            std::vector<bool>{false, true}) == 1);
+    // Pinned to discrete index 0 — passthrough.
+    CHECK(resolve_vulkan_device_index(
+            0,
+            std::vector<size_t>{32ull * 1024 * 1024 * 1024,
+                                 120ull * 1024 * 1024 * 1024},
+            std::vector<bool>{false, true}) == 0);
+}
+
+// Test 14 — Mismatched UMA list length throws.
+//
+// Caller bug guard.  If the UMA list is non-empty AND its size
+// doesn't match `free_vram_per_device`, throw rather than
+// silently truncating or out-of-bounds-reading.  Either zero
+// (use round-3 policy) or the full length (use round-12 policy)
+// — anything else is a wiring bug in the caller.
+void test_mismatched_uma_list_length_throws() {
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{100, 500},
+            std::vector<bool>{false});  // 1 entry vs 2 devices
+    }));
+    CHECK(throws_runtime_error([] {
+        (void) resolve_vulkan_device_index(
+            -1,
+            std::vector<size_t>{100, 500},
+            std::vector<bool>{false, true, false});  // 3 vs 2
+    }));
+}
+
+} // namespace
+
+int main() {
+    test_empty_device_list_throws();
+    test_single_device_returns_zero();
+    test_auto_pick_max_vram();
+    test_auto_pick_ties_pick_lower_index();
+    test_explicit_index_returns_unchanged();
+    test_out_of_range_throws();
+    test_reserved_negative_throws();
+    test_zero_vram_handling();
+    // Round 12 — UMA bias.
+    test_empty_uma_preserves_round3_behaviour();
+    test_hybrid_prefer_discrete_over_uma();
+    test_uma_aware_tiebreak_equal_vram_discretes();
+    test_multi_discrete_argmax_over_discrete_subset();
+    test_all_uma_falls_back_to_argmax();
+    test_explicit_index_ignores_uma_bias();
+    test_mismatched_uma_list_length_throws();
+
+    std::fprintf(stderr,
+                 "test_supertonic_vulkan_device_select: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_vulkan_dispatch.cpp b/tts-cpp/test/test_supertonic_vulkan_dispatch.cpp
new file mode 100644
index 00000000000..64310cf6e2b
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_vulkan_dispatch.cpp
@@ -0,0 +1,268 @@
+// QVAC-18605 — CPU-only unit test for the Vulkan-specific dispatch
+// additions landed alongside the Vulkan bring-up:
+//
+//   1. `supertonic_model::backend_is_vk` — informational flag set
+//      from `ggml_backend_is_vk()` at GGUF load.  Carried through
+//      to engine.cpp / supertonic_bench.cpp's backend-name
+//      annotator (verified by inspection; not under unit test).
+//   2. `supertonic_model::use_native_leaky_relu` — true when the
+//      resolved backend supports `GGML_OP_LEAKY_RELU` natively.
+//      Mirrored into the thread-local `g_supertonic_use_native_leaky_relu`
+//      by `supertonic_op_dispatch_scope`; consulted by
+//      `leaky_relu_portable_ggml` to skip the RELU+SCALE+ADD
+//      decomposition when the fused op is available.
+//   3. `supertonic_backend_supports_f16_kv_flash_attn(backend)` —
+//      load-time backend probe used by engine + bench to gate the
+//      `use_f16_attn` auto-policy.
+//
+// All three additions are CPU-only-testable: the flags are POD on
+// `supertonic_model`, the dispatch scope is a thread-local mirror,
+// and the probe takes any `ggml_backend_t` (CPU works fine — it
+// supports `LEAKY_RELU` natively, and the F16-K/V flash-attn op
+// support depends on whether the CPU backend was built with the
+// flash-attn kernel).
+//
+// No GGUF / model file required.  Registered with `LABEL "unit"` in
+// CMakeLists.txt so a fresh checkout's `ctest` exercises this without
+// any fixture.
+//
+// Companion to `test_supertonic_backend_dispatch.cpp` (the OpenCL
+// bring-up's tests for `op_dispatch_scope`); this file extends the
+// same harness with the new `use_native_leaky_relu` mirror and adds
+// a probe smoke test.
+
+#include "supertonic_internal.h"
+
+#include "ggml-backend.h"
+#include "ggml-cpu.h"
+
+#include <cstdio>
+#include <stdexcept>
+
+using namespace tts_cpp::supertonic::detail;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Test 1 — Default thread-local state for the new query.
+//
+// Every thread enters with `use_native_leaky_relu` defaulted to
+// `true` (matches the historical CPU-only path: CPU has the fused
+// op natively, so we want callers without a scope active to keep
+// emitting it).  Same default-true contract as
+// `supertonic_use_cpu_custom_ops()`.
+void test_default_native_leaky_relu_flag() {
+    CHECK(supertonic_use_native_leaky_relu() == true);
+}
+
+// Test 2 — Scope mirrors a CPU model.
+//
+// CPU explicitly sets `use_native_leaky_relu = true` (the load-time
+// probe always returns true on CPU); the dispatch scope must
+// mirror that without flipping anything.
+void test_scope_mirrors_cpu_model() {
+    supertonic_model model;
+    model.backend_is_cpu        = true;
+    model.backend_is_vk         = false;
+    model.use_native_leaky_relu = true;
+    {
+        supertonic_op_dispatch_scope scope(model);
+        CHECK(supertonic_use_cpu_custom_ops() == true);
+        CHECK(supertonic_use_native_leaky_relu() == true);
+    }
+    CHECK(supertonic_use_native_leaky_relu() == true);
+}
+
+// Test 3 — Scope mirrors a Vulkan-style model.
+//
+// On Vulkan the load-time probe sets `backend_is_cpu = false`,
+// `backend_is_vk = true`, and `use_native_leaky_relu = true`
+// (ggml-vulkan's `pipeline_leaky_relu_f32` natively implements the
+// op).  `leaky_relu_portable_ggml` should emit the fused builtin
+// inside this scope, not the RELU+SCALE+ADD decomposition.
+void test_scope_mirrors_vulkan_model() {
+    supertonic_model model;
+    model.backend_is_cpu        = false;
+    model.backend_is_vk         = true;
+    model.use_native_leaky_relu = true;
+    {
+        supertonic_op_dispatch_scope scope(model);
+        // CPU custom ops disabled (it's a non-CPU backend), but the
+        // native LEAKY_RELU dispatch is on (Vulkan supports it).
+        CHECK(supertonic_use_cpu_custom_ops() == false);
+        CHECK(supertonic_use_native_leaky_relu() == true);
+    }
+    // After teardown, defaults restored.
+    CHECK(supertonic_use_cpu_custom_ops() == true);
+    CHECK(supertonic_use_native_leaky_relu() == true);
+}
+
+// Test 4 — Scope mirrors an OpenCL-style model (probe = false).
+//
+// Plain upstream ggml-opencl rejects `GGML_OP_LEAKY_RELU` (only
+// chatterbox's vendored patch adds it).  When the load-time probe
+// returns false we expect the dispatch helper to take the
+// RELU+SCALE+ADD decomposition path instead — the scope must
+// faithfully transport that bit.
+void test_scope_mirrors_opencl_model() {
+    supertonic_model model;
+    model.backend_is_cpu        = false;
+    model.backend_is_vk         = false;
+    model.use_native_leaky_relu = false;
+    {
+        supertonic_op_dispatch_scope scope(model);
+        CHECK(supertonic_use_cpu_custom_ops() == false);
+        CHECK(supertonic_use_native_leaky_relu() == false);
+    }
+    // After teardown, defaults restored — the next CPU engine in
+    // the same thread sees the full fused-ops path again.
+    CHECK(supertonic_use_cpu_custom_ops() == true);
+    CHECK(supertonic_use_native_leaky_relu() == true);
+}
+
+// Test 5 — RAII teardown on exception (extends the OpenCL bring-up
+// test to cover the new flag).
+//
+// If a forward-pass body throws (invalid voice, GGML buffer alloc
+// failure, …), the scope must still restore the previous
+// `use_native_leaky_relu` so the next engine's call sees a clean
+// slate.
+void test_scope_unwinds_on_exception() {
+    supertonic_model model;
+    model.backend_is_cpu        = false;
+    model.backend_is_vk         = true;
+    model.use_native_leaky_relu = true;
+    bool caught = false;
+    try {
+        supertonic_op_dispatch_scope scope(model);
+        CHECK(supertonic_use_cpu_custom_ops() == false);
+        CHECK(supertonic_use_native_leaky_relu() == true);
+        throw std::runtime_error("simulated forward failure");
+    } catch (const std::runtime_error &) {
+        caught = true;
+    }
+    CHECK(caught);
+    CHECK(supertonic_use_cpu_custom_ops() == true);
+    CHECK(supertonic_use_native_leaky_relu() == true);
+}
+
+// Test 6 — Nested scopes stack and unwind correctly for the new flag.
+//
+// Mirrors `test_nested_scopes` in `test_supertonic_backend_dispatch.cpp`
+// for the new bit so a regression in the dtor restore order shows up
+// here as well as in the older test.
+void test_nested_scopes() {
+    supertonic_model vk_model;
+    vk_model.backend_is_cpu        = false;
+    vk_model.backend_is_vk         = true;
+    vk_model.use_native_leaky_relu = true;
+
+    supertonic_model cl_model;  // OpenCL-style: probe returned false
+    cl_model.backend_is_cpu        = false;
+    cl_model.backend_is_vk         = false;
+    cl_model.use_native_leaky_relu = false;
+
+    {
+        supertonic_op_dispatch_scope outer(vk_model);
+        CHECK(supertonic_use_native_leaky_relu() == true);
+        {
+            supertonic_op_dispatch_scope inner(cl_model);
+            CHECK(supertonic_use_native_leaky_relu() == false);
+        }
+        // Inner unwound — outer's bit restored.
+        CHECK(supertonic_use_native_leaky_relu() == true);
+    }
+    CHECK(supertonic_use_native_leaky_relu() == true);
+}
+
+// Test 7 — F16-K/V flash-attn backend probe smoke test.
+//
+// Loads the CPU backend (always available) and asks the probe
+// whether it would accept a Supertonic-shaped F16-K/V flash-attn
+// node.  We don't assert a specific true/false — the answer
+// depends on the CPU backend's build (some upstream builds support
+// F16-K/V flash-attn via the cblas reference path; some don't).
+// What we assert is:
+//   1. The probe returns `false` on a null backend (defensive).
+//   2. The probe doesn't crash on the CPU backend.
+//   3. Whatever the probe returns, calling it twice returns the
+//      same value (it's pure / cacheable).
+void test_f16_kv_flash_attn_probe_smoke() {
+    CHECK(supertonic_backend_supports_f16_kv_flash_attn(nullptr) == false);
+
+    ggml_backend_t cpu = ggml_backend_cpu_init();
+    if (!cpu) {
+        std::fprintf(stderr, "skip: CPU backend init failed\n");
+        return;
+    }
+    bool a = supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    bool b = supertonic_backend_supports_f16_kv_flash_attn(cpu);
+    CHECK(a == b);
+    std::fprintf(stderr, "probe(F16-K/V flash-attn, CPU) = %s\n",
+                 a ? "true" : "false");
+    ggml_backend_free(cpu);
+}
+
+// Test 8 — Independent flag mutation.
+//
+// The three flags are independent dimensions: a user might force
+// `--f16-attn 1` on a CPU backend (for parity testing), or
+// auto-disable `use_native_leaky_relu` on a CPU model (for parity
+// testing the GPU decomposition path).  Make sure
+// `op_dispatch_scope` round-trips each combination without
+// crossing wires.
+void test_independent_flags() {
+    // CPU + force F16 attn + force decomposed leaky-relu.
+    supertonic_model m;
+    m.backend_is_cpu        = true;
+    m.backend_is_vk         = false;
+    m.use_f16_attn          = true;
+    m.use_native_leaky_relu = false;
+    {
+        supertonic_op_dispatch_scope scope(m);
+        CHECK(supertonic_use_cpu_custom_ops()    == true);
+        CHECK(supertonic_use_f16_attn()          == true);
+        CHECK(supertonic_use_native_leaky_relu() == false);
+    }
+
+    // Vulkan + force F32 attn + force native leaky-relu.
+    m.backend_is_cpu        = false;
+    m.backend_is_vk         = true;
+    m.use_f16_attn          = false;
+    m.use_native_leaky_relu = true;
+    {
+        supertonic_op_dispatch_scope scope(m);
+        CHECK(supertonic_use_cpu_custom_ops()    == false);
+        CHECK(supertonic_use_f16_attn()          == false);
+        CHECK(supertonic_use_native_leaky_relu() == true);
+    }
+}
+
+} // namespace
+
+int main() {
+    test_default_native_leaky_relu_flag();
+    test_scope_mirrors_cpu_model();
+    test_scope_mirrors_vulkan_model();
+    test_scope_mirrors_opencl_model();
+    test_scope_unwinds_on_exception();
+    test_nested_scopes();
+    test_f16_kv_flash_attn_probe_smoke();
+    test_independent_flags();
+
+    std::fprintf(stderr,
+                 "test_supertonic_vulkan_dispatch: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_vulkan_env_overrides.cpp b/tts-cpp/test/test_supertonic_vulkan_env_overrides.cpp
new file mode 100644
index 00000000000..a43e29f05a3
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_vulkan_env_overrides.cpp
@@ -0,0 +1,278 @@
+// QVAC-18605 round 7 — CPU-only TDD test for the Vulkan env-var
+// passthrough mechanism.
+//
+// Background
+// ----------
+// ggml-vulkan reads numerous `GGML_VK_*` env vars at backend init
+// time to configure adapter selection, coopmat / bf16 toggles, the
+// perf logger, etc.  Operators currently have to set these env
+// vars in the shell before invoking supertonic-cli / tts-cli /
+// supertonic-bench, which is awkward when the env is managed by a
+// service supervisor or when the operator wants to A/B-compare
+// settings without losing their shell state.
+//
+// Round 7 adds:
+//
+//   1. A new `EngineOptions::vulkan_env_overrides` field
+//      (std::map<std::string, std::string>) that the engine
+//      applies just before backend init.
+//
+//   2. A public helper `apply_vulkan_env_overrides(map)` declared
+//      in `supertonic_internal.h`, defined in `supertonic_gguf.cpp`,
+//      that:
+//        - validates each key starts with `GGML_VK_`
+//          (throws std::runtime_error on a bad key — guards
+//           against operator-config typos like
+//           `GMML_VK_PREFER_HOST_MEMORY`);
+//        - calls `set_env_if_unset(key, value)` so an
+//          operator-set env var still wins over the EngineOptions
+//          override (lets operators force a setting from the
+//          shell without recompiling).
+//
+//   3. CLI flags on supertonic-cli / tts-cli / supertonic-bench
+//      that map friendly names to `GGML_VK_*` env var keys:
+//
+//        --vulkan-prefer-host-memory  → GGML_VK_PREFER_HOST_MEMORY=1
+//        --vulkan-disable-coopmat2    → GGML_VK_DISABLE_COOPMAT2=1
+//        --vulkan-disable-bfloat16    → GGML_VK_DISABLE_BFLOAT16=1
+//        --vulkan-perf-logger         → GGML_VK_PERF_LOGGER=1
+//        --vulkan-async-transfer      → GGML_VK_ASYNC_USE_TRANSFER_QUEUE=1
+//
+//      Each flag inserts the corresponding entry into
+//      EngineOptions::vulkan_env_overrides; the engine then
+//      applies them via `apply_vulkan_env_overrides()` before
+//      `init_supertonic_backend()` runs.
+//
+// This test is the TDD gate for the EngineOptions field + the
+// public helper.  CLI parsing is exercised by separate smoke
+// tests on each binary's `--help` output (visual; no test gate
+// — same as every other CLI flag added in rounds 1-6).
+//
+// Whole TU MUST fail to compile before the symbols are added,
+// then pass after.
+
+#include "tts-cpp/supertonic/engine.h"
+#include "supertonic_internal.h"
+
+#include <cstdio>
+#include <cstdlib>
+#include <map>
+#include <stdexcept>
+#include <string>
+#include <type_traits>
+
+using tts_cpp::supertonic::detail::apply_vulkan_env_overrides;
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+template <typename F>
+bool throws_runtime_error(F && fn) {
+    try { fn(); return false; }
+    catch (const std::runtime_error &) { return true; }
+    catch (...) { return false; }
+}
+
+// SFINAE: assert the EngineOptions field exists.
+template <typename T>
+auto has_vulkan_env_overrides(int) -> decltype(
+    std::declval<T &>().vulkan_env_overrides, std::true_type{});
+template <typename T>
+auto has_vulkan_env_overrides(...) -> std::false_type;
+
+void unsetenv_safe(const char * name) {
+#if defined(_WIN32)
+    _putenv_s(name, "");  // empty value treated as unset by ggml-vulkan's getenv check
+#else
+    unsetenv(name);
+#endif
+}
+
+// Test 1 — `EngineOptions::vulkan_env_overrides` field exists and
+// has the expected type, default-constructs empty, accepts
+// assignment.
+void test_engine_options_field_exists() {
+    using namespace tts_cpp::supertonic;
+    static_assert(
+        decltype(has_vulkan_env_overrides<EngineOptions>(0))::value,
+        "EngineOptions must declare vulkan_env_overrides "
+        "(std::map<std::string, std::string>)");
+
+    EngineOptions opts;
+    CHECK(opts.vulkan_env_overrides.empty());
+
+    opts.vulkan_env_overrides["GGML_VK_PREFER_HOST_MEMORY"] = "1";
+    opts.vulkan_env_overrides["GGML_VK_DISABLE_COOPMAT2"]   = "1";
+    CHECK(opts.vulkan_env_overrides.size() == 2);
+    CHECK(opts.vulkan_env_overrides["GGML_VK_PREFER_HOST_MEMORY"] == "1");
+
+    // Round-3 + round-4 + round-6 baseline regression guard.
+    EngineOptions baseline;
+    CHECK(baseline.vulkan_env_overrides.empty());
+    CHECK(baseline.kv_attn_type == -1);
+    CHECK(baseline.f16_attn == -1);
+    CHECK(baseline.f16_weights == -1);
+    CHECK(baseline.f16_weights_deny_list.empty());
+    CHECK(baseline.vulkan_device == 0);
+    CHECK(baseline.prewarm_text.empty());
+}
+
+// Test 2 — `apply_vulkan_env_overrides({})` is a no-op (regression
+// guard against the helper accidentally touching the env on the
+// default empty path).
+void test_empty_map_is_noop() {
+    // Pre-condition: a unique, never-set env var must read back null.
+    const char * unique = "GGML_VK_TEST_R7_EMPTY_NOOP_KEY";
+    unsetenv_safe(unique);
+    CHECK(std::getenv(unique) == nullptr);
+
+    std::map<std::string, std::string> empty;
+    apply_vulkan_env_overrides(empty);
+
+    // Helper must NOT have invented a value for our unique key.
+    CHECK(std::getenv(unique) == nullptr);
+}
+
+// Test 3 — `apply_vulkan_env_overrides({{"GGML_VK_*", "v"}})` calls
+// `set_env_if_unset` so the env var becomes set on a clean env.
+void test_single_entry_sets_env() {
+    const char * key = "GGML_VK_TEST_R7_SETS_ENV";
+    unsetenv_safe(key);
+    CHECK(std::getenv(key) == nullptr);
+
+    apply_vulkan_env_overrides({{key, "value_a"}});
+
+    const char * actual = std::getenv(key);
+    CHECK(actual != nullptr);
+    if (actual) CHECK(std::string(actual) == "value_a");
+
+    unsetenv_safe(key);
+}
+
+// Test 4 — operator-set env wins over the EngineOptions override.
+//
+// This pins the `set_env_if_unset` semantics: an operator who
+// has already exported `GGML_VK_DISABLE_COOPMAT2=0` in their shell
+// must NOT have it overwritten by an EngineOptions override.
+// Lets a debugging operator force-disable a setting from the
+// command line without recompiling.
+void test_operator_env_wins() {
+    const char * key = "GGML_VK_TEST_R7_OPERATOR_WINS";
+#if defined(_WIN32)
+    _putenv_s(key, "operator_set");
+#else
+    setenv(key, "operator_set", 1);
+#endif
+    CHECK(std::string(std::getenv(key) ? std::getenv(key) : "") == "operator_set");
+
+    apply_vulkan_env_overrides({{key, "engine_override"}});
+
+    const char * after = std::getenv(key);
+    CHECK(after != nullptr);
+    if (after) CHECK(std::string(after) == "operator_set");
+
+    unsetenv_safe(key);
+}
+
+// Test 5 — invalid key (no `GGML_VK_` prefix) throws.
+//
+// Loud-failure for operator-config typos — same convention as
+// `--kv-attn-type bogus` (round 4) and `--vulkan-device -2`
+// (round 3 reserved-negative throw).  An operator that types
+// `GMML_VK_PREFER_HOST_MEMORY` in their config gets a clean
+// error message instead of silently setting an env var that
+// ggml-vulkan won't read.
+void test_invalid_key_throws() {
+    CHECK(throws_runtime_error([] {
+        apply_vulkan_env_overrides({{"GMML_VK_PREFER_HOST_MEMORY", "1"}});
+    }));
+    CHECK(throws_runtime_error([] {
+        apply_vulkan_env_overrides({{"PATH", "1"}});
+    }));
+    CHECK(throws_runtime_error([] {
+        apply_vulkan_env_overrides({{"", "1"}});
+    }));
+    CHECK(throws_runtime_error([] {
+        apply_vulkan_env_overrides({{"GGML_", "1"}});  // close but missing _VK_
+    }));
+    CHECK(throws_runtime_error([] {
+        apply_vulkan_env_overrides({{"GGML_VK", "1"}}); // missing trailing underscore
+    }));
+}
+
+// Test 6 — when a single bad entry is in a map with several good
+// entries, the throw fires AT the bad entry; the helper must NOT
+// silently apply the good entries before the throw lands (ALL or
+// NOTHING semantics so a partial-success doesn't leave the env
+// in a half-applied state).
+void test_all_or_nothing_on_invalid_key() {
+    const char * good_a = "GGML_VK_TEST_R7_AON_A";
+    const char * good_b = "GGML_VK_TEST_R7_AON_B";
+    unsetenv_safe(good_a);
+    unsetenv_safe(good_b);
+
+    std::map<std::string, std::string> mixed = {
+        {good_a, "1"},
+        {"BAD_KEY", "should_throw"},
+        {good_b, "1"},
+    };
+    CHECK(throws_runtime_error([&] {
+        apply_vulkan_env_overrides(mixed);
+    }));
+
+    // Neither good key should have been applied.
+    CHECK(std::getenv(good_a) == nullptr);
+    CHECK(std::getenv(good_b) == nullptr);
+}
+
+// Test 7 — multi-entry happy path.
+void test_multi_entry_all_applied() {
+    const char * a = "GGML_VK_TEST_R7_MULTI_A";
+    const char * b = "GGML_VK_TEST_R7_MULTI_B";
+    const char * c = "GGML_VK_TEST_R7_MULTI_C";
+    unsetenv_safe(a);
+    unsetenv_safe(b);
+    unsetenv_safe(c);
+
+    apply_vulkan_env_overrides({
+        {a, "alpha"},
+        {b, "beta"},
+        {c, "gamma"},
+    });
+
+    CHECK(std::string(std::getenv(a) ? std::getenv(a) : "") == "alpha");
+    CHECK(std::string(std::getenv(b) ? std::getenv(b) : "") == "beta");
+    CHECK(std::string(std::getenv(c) ? std::getenv(c) : "") == "gamma");
+
+    unsetenv_safe(a);
+    unsetenv_safe(b);
+    unsetenv_safe(c);
+}
+
+} // namespace
+
+int main() {
+    test_engine_options_field_exists();
+    test_empty_map_is_noop();
+    test_single_entry_sets_env();
+    test_operator_env_wins();
+    test_invalid_key_throws();
+    test_all_or_nothing_on_invalid_key();
+    test_multi_entry_all_applied();
+
+    std::fprintf(stderr,
+                 "test_supertonic_vulkan_env_overrides: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}
diff --git a/tts-cpp/test/test_supertonic_warm_up_api.cpp b/tts-cpp/test/test_supertonic_warm_up_api.cpp
new file mode 100644
index 00000000000..4d5ecf00f28
--- /dev/null
+++ b/tts-cpp/test/test_supertonic_warm_up_api.cpp
@@ -0,0 +1,118 @@
+// QVAC-18605 follow-up — CPU-only API-surface test for the
+// first-synth pre-warm hook added alongside the Vulkan bring-up:
+//
+//   - `tts_cpp::supertonic::EngineOptions::prewarm_text` exists,
+//     defaults to empty, and accepts a std::string assignment.
+//
+//   - `tts_cpp::supertonic::Engine::warm_up(const std::string &)`
+//     exists in the public API and is callable.
+//
+// We intentionally don't construct a real `Engine` here — that
+// requires a GGUF fixture and the engine surface is exercised
+// end-to-end by `test-supertonic-pipeline` (LABEL "fixture").
+// This file's job is to lock in the *compile-time contract* of
+// the new fields / methods so a future refactor that renames or
+// removes them breaks this test before the downstream
+// integration / fixture tests have a chance to drift.
+//
+// The harness compiles + links + runs in <1 ms; on a fresh
+// checkout `ctest -L unit` exercises it without any model file.
+
+#include "tts-cpp/supertonic/engine.h"
+
+#include <cstdio>
+#include <string>
+#include <type_traits>
+
+namespace {
+
+int g_failures = 0;
+int g_checks   = 0;
+
+#define CHECK(cond) do {                                              \
+    ++g_checks;                                                       \
+    if (!(cond)) {                                                    \
+        ++g_failures;                                                 \
+        std::fprintf(stderr, "FAIL %s:%d  %s\n",                     \
+                     __FILE__, __LINE__, #cond);                      \
+    }                                                                 \
+} while (0)
+
+// Test 1 — `prewarm_text` exists, defaults to empty, accepts
+// std::string.
+//
+// Compile-time + runtime: a default-constructed EngineOptions
+// has an empty `prewarm_text`, and we can write a non-empty
+// string to it without surprises.  This locks in the field's
+// type (std::string, not const char*, not std::string_view) and
+// default state.
+void test_prewarm_text_default_empty() {
+    tts_cpp::supertonic::EngineOptions opts;
+    CHECK(opts.prewarm_text.empty());
+
+    opts.prewarm_text = "Hello world";
+    CHECK(opts.prewarm_text == "Hello world");
+
+    opts.prewarm_text.clear();
+    CHECK(opts.prewarm_text.empty());
+
+    static_assert(std::is_same<decltype(opts.prewarm_text), std::string>::value,
+                  "EngineOptions::prewarm_text must be std::string");
+}
+
+// Test 2 — `Engine::warm_up(const std::string &)` exists in the
+// public API.
+//
+// Asserts the method's existence and signature via SFINAE.  We
+// don't actually call it (would require a constructed Engine
+// which would need a GGUF fixture); the goal is just to fail
+// compilation if the public symbol disappears.
+template <typename E, typename = void>
+struct has_warm_up : std::false_type {};
+
+template <typename E>
+struct has_warm_up<E,
+                   std::void_t<decltype(std::declval<E &>().warm_up(std::declval<const std::string &>()))>>
+    : std::true_type {};
+
+void test_warm_up_method_exists() {
+    static_assert(has_warm_up<tts_cpp::supertonic::Engine>::value,
+                  "Engine::warm_up(const std::string &) must exist in the public API");
+    CHECK(true);  // tally one runtime check so the harness reports a count
+}
+
+// Test 3 — Field-by-field default state of EngineOptions.
+//
+// Documents the defaults the engine relies on so a regression
+// like "prewarm_text accidentally defaults to a hard-coded
+// sample text" (which would silently slow down every CPU caller
+// by the prewarm cost — even though warm_up is a no-op on CPU,
+// the OptionsCheck would surface it in a debug log).
+void test_engine_options_defaults() {
+    tts_cpp::supertonic::EngineOptions o;
+    CHECK(o.model_gguf_path.empty());
+    CHECK(o.prewarm_text.empty());
+    CHECK(o.vulkan_device == 0);
+    // QVAC-18605 follow-up — the default values for f16_attn /
+    // f16_weights are -1 (auto: gated on the new probe set).
+    // The probes themselves are exercised by
+    // test_supertonic_capability_cache.cpp; here we just lock
+    // in the auto-policy default so nobody accidentally flips
+    // the engine to "force on" or "force off" by changing the
+    // sentinel value.
+    CHECK(o.f16_attn    == -1);
+    CHECK(o.f16_weights == -1);
+}
+
+} // namespace
+
+int main() {
+    test_prewarm_text_default_empty();
+    test_warm_up_method_exists();
+    test_engine_options_defaults();
+
+    std::fprintf(stderr,
+                 "test_supertonic_warm_up_api: %d / %d checks passed\n",
+                 g_checks - g_failures, g_checks);
+    return g_failures == 0 ? 0 : 1;
+}