Skip to content

add_codeowners file#5

Merged
GustavoA1604 merged 1 commit into
tetherto:masterfrom
shikha-tether:add_codeowners_file
Nov 17, 2025
Merged

add_codeowners file#5
GustavoA1604 merged 1 commit into
tetherto:masterfrom
shikha-tether:add_codeowners_file

Conversation

@shikha-tether

Copy link
Copy Markdown

No description provided.

@GustavoA1604 GustavoA1604 merged commit 010d8f9 into tetherto:master Nov 17, 2025
GustavoA1604 added a commit that referenced this pull request May 7, 2026
The standalone setup-ggml.sh + patches/ tooling was dropped from
qvac-ext-lib-whisper.cpp/tts-cpp/ in the integration commit, but the
CMakeLists.txt still:
  * defaulted TTS_CPP_USE_SYSTEM_GGML=OFF, and
  * unconditionally compile-defined GGML_BACKEND_DL_PROJECT_PREFIX="speech-"
    on the bundled ggml target.
That combination quietly broke standalone bundled-ggml builds: the
filename-prefix patch was no longer applied, so libspeech-ggml-*.so
files existed on disk but ggml's runtime loader still searched for
libggml-*.so under GGML_BACKEND_DL=ON.  Vulkan / OpenCL / CUDA
backends silently failed to load on Android.

Fix per reviewer guidance: converge the speech stack on a single ggml
source-of-truth.  Standalone-bundled-ggml is no longer a supported
build mode out of this in-tree subtree; the canonical path is
`-DTTS_CPP_USE_SYSTEM_GGML=ON` against the QVAC speech-stack
`ggml-speech` vcpkg port (qvac-ext-ggml/speech branch), which ships
the patches pre-applied.

Edits:

- TTS_CPP_USE_SYSTEM_GGML default flipped from OFF to ON in this
  tree.  Docstring spells out the rationale + points users at the
  standalone github.com/gianni-cor/chatterbox.cpp repo if they need
  a bundled-ggml dev build with patches/ present.

- The bundled-ggml branch of `if (NOT TARGET ggml)` now refuses to
  configure when patches/ is absent: a FATAL_ERROR points at the
  right consumption path (vcpkg ggml-speech) and the standalone
  fallback.  Doesn't break in-tree-with-patches builds (parakeet-cpp
  in this same repo still ships patches/, so its bundled path is
  unaffected by this guard inside tts-cpp).

- Verified locally: `cmake -S tts-cpp -B build` (no flags) errors
  out at find_package(ggml CONFIG REQUIRED) with our new message
  pointing at the ggml-speech port; `cmake -S tts-cpp -B build
  -DTTS_CPP_USE_SYSTEM_GGML=OFF` errors out at the patches/ guard
  with the no-patches message.

- tts-cpp/scripts/setup-ggml.sh deleted: it referenced patches/
  that no longer exist; running it would have errored out anyway.
  The standalone repo keeps its own setup-ggml.sh; only the in-tree
  subtree drops it.

The standalone chatterbox.cpp repo (the one tts-cpp/ was copied
from) keeps TTS_CPP_USE_SYSTEM_GGML=OFF default + the patches/
folder + scripts/setup-ggml.sh.  This commit is therefore an
integration-time delta against that source, not a change to the
standalone build flow.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 added a commit that referenced this pull request May 7, 2026
After commit fa0d490 (review #5+#6) made bundled-add_subdirectory(ggml)
hard-error in this in-tree subtree when patches/ is absent, the
TTS_CPP_GGML_LIB_PREFIX block became dead code:

  if (TTS_CPP_GGML_LIB_PREFIX AND NOT TTS_CPP_USE_SYSTEM_GGML)

NOT TTS_CPP_USE_SYSTEM_GGML can never reach this `if` here -
configure has already FATAL_ERROR'd at the patches/-absent guard.
The option, the helper function, the foreach loop, the
GGML_BACKEND_DL_PROJECT_PREFIX define, and the STATUS message were
all unreachable.  The next maintainer flipping
-DTTS_CPP_GGML_LIB_PREFIX=OFF to disable prefixing would have been
silently confused when nothing changed.

Edits:

tts-cpp/CMakeLists.txt:
  - The option() declaration at line 22 removed.  Replaced with a
    one-paragraph cross-reference to the standalone chatterbox.cpp
    repo for the locally-rename flow + the rationale (ggml-speech
    vcpkg port emits the libspeech-ggml-* filenames itself).
  - The 41-line block at lines 131-176 (tts_cpp_apply_ggml_prefix
    function + foreach + target_compile_definitions +
    STATUS message) replaced with a 9-line note telling future
    readers where the standalone counterpart lives.

tts-cpp/README.md:
  - Useful CMake options table row for TTS_CPP_GGML_LIB_PREFIX
    rewritten with a strikethrough + "n/a in this subtree" cell:
    explains the standalone option exists at chatterbox.cpp upstream,
    why it's unnecessary here (ggml-speech vcpkg port handles the
    rename at its own build time), and that the file-prefix surface
    is whatever vcpkg installs.

Doc-only behavior visible to consumers: the integrated subtree no
longer has a TTS_CPP_GGML_LIB_PREFIX option at all.  Build behaviour
unchanged - the vcpkg find_package path was already taking effect
and emitting libspeech-ggml-* as designed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 12, 2026
… integration (F20+F23)

Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.

Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.

Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries.  cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).

Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.

Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire.  Bit-exact (max_abs_err=0.0).  Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).

Full sweep verification:
  - 9 / 9 supertonic source files: clean syntax-check
  - 21 / 21 test files: clean syntax-check
  - 98 / 98 CPU-only unit-test checks pass across
    test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
    backend-dispatch, f16-attn-parity, profile-csv}.

Audit pass tetherto#5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit that referenced this pull request May 13, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR #16 audit follow-up #5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 14, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
… integration (F20+F23)

Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.

Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.

Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries.  cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).

Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.

Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire.  Bit-exact (max_abs_err=0.0).  Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).

Full sweep verification:
  - 9 / 9 supertonic source files: clean syntax-check
  - 21 / 21 test files: clean syntax-check
  - 98 / 98 CPU-only unit-test checks pass across
    test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
    backend-dispatch, f16-attn-parity, profile-csv}.

Audit pass tetherto#5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…lidation + Q8_0 K/V finding

Round 13 is a strict-improvement-only follow-up to round 12:
no code path is removed, no optimisation is rolled back, and
the end-to-end perf on every backend stays at the round-12
level.  Two deliverables, both no-regret:

== 1. New helper `alloc_input_scratchpad_or_throw` ==

Round 12 tetherto#5 inlined the "try pinned-host first, fall back to
default backend buffer, throw on both-fail" idiom at 4 cache
sites (front block + 3 group caches):

    cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(model, cache.input_ctx);
    if (!cache.input_buf) {
        cache.input_buf = ggml_backend_alloc_ctx_tensors(cache.input_ctx, model.backend);
        if (!cache.input_buf) {
            // per-cache teardown + throw with cache-specific message
        }
    }

Round 13 factors it into one helper.  Each caller becomes:

    cache.input_buf = alloc_input_scratchpad_or_throw(
        model, cache.input_ctx, "vector_group_graph_cache");

Same correctness contract — CPU / Metal / OpenCL fall back to
default backend buffer; Vulkan tries pinned-host first.
Defensive failure modes consolidated: null model.backend, null
input_ctx, null cache_name all throw std::runtime_error with a
message that includes the cache name, instead of segfaulting in
an error-handler path.  Single point of maintenance for the
pattern; future cache builds that want pinned-host inputs use
the helper directly.

`test_supertonic_input_scratchpad.cpp` (NEW, 9 / 9 checks) pins
the contract via SFINAE on the symbol + CPU-fallback round-trip
through `ggml_backend_tensor_set` / `get` + null-arg throws +
empty-ctx error message includes the cache name.  CPU-only —
no GGUF fixture required.  CI test count goes from 24 / 24
(round 12) to 25 / 25 (round 13).

Perf impact: zero — same code path, same allocations, same data
movement, just one fewer level of nesting at each call site.

== 2. Q8_0 K/V no-win documented for RTX 5090 ==

Round 4 shipped the `--kv-attn-type q8_0` CLI option and bench
output advertises `q8_0_kv_attn=available`.  Round 13 measures
the trade-off on the test rig (RTX 5090, 1.79 TB/s memory
bandwidth, long prompt 206 chars / 18 s audio):

    --kv-attn-type f16:  total=31.11 ms (588x realtime)  <- default
    --kv-attn-type q8_0: total=31.84 ms (575x realtime)  <- 2 % slower

The F32->Q8_0 cast overhead exceeds the saved K/V upload
bandwidth on a high-bandwidth discrete GPU.  Operator guidance:
stick with the F16 default on RTX 5090 and similar high-
bandwidth discretes.  Q8_0 is shipped for adapters where the
K/V upload bottlenecks the synth (older PCIe 3.0, lower-end
discretes, iGPUs with slow BAR); cross-over point to be
measured per-adapter by operators using `--bench-per-step`
from round 7.

== Test plan ==

  ctest --test-dir tts-cpp/build         -L unit
  -> 25 / 25 PASS (was 24 / 24 in round 12; +1 input-scratchpad)
  ctest --test-dir tts-cpp/build-vulkan  -L unit
  -> 25 / 25 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter
writes a valid WAV.

Perf on RTX 5090 (10 runs + 3 warmup, long prompt):
  Round 12 baseline:  med= 31.11 ms  (588x realtime)
  Round 13:           med= 31.71 ms  (577x realtime)
  -> within run-to-run noise; no regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…lidation + Q8_0 K/V finding

Round 13 is a strict-improvement-only follow-up to round 12:
no code path is removed, no optimisation is rolled back, and
the end-to-end perf on every backend stays at the round-12
level.  Two deliverables, both no-regret:

== 1. New helper `alloc_input_scratchpad_or_throw` ==

Round 12 tetherto#5 inlined the "try pinned-host first, fall back to
default backend buffer, throw on both-fail" idiom at 4 cache
sites (front block + 3 group caches):

    cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(model, cache.input_ctx);
    if (!cache.input_buf) {
        cache.input_buf = ggml_backend_alloc_ctx_tensors(cache.input_ctx, model.backend);
        if (!cache.input_buf) {
            // per-cache teardown + throw with cache-specific message
        }
    }

Round 13 factors it into one helper.  Each caller becomes:

    cache.input_buf = alloc_input_scratchpad_or_throw(
        model, cache.input_ctx, "vector_group_graph_cache");

Same correctness contract — CPU / Metal / OpenCL fall back to
default backend buffer; Vulkan tries pinned-host first.
Defensive failure modes consolidated: null model.backend, null
input_ctx, null cache_name all throw std::runtime_error with a
message that includes the cache name, instead of segfaulting in
an error-handler path.  Single point of maintenance for the
pattern; future cache builds that want pinned-host inputs use
the helper directly.

`test_supertonic_input_scratchpad.cpp` (NEW, 9 / 9 checks) pins
the contract via SFINAE on the symbol + CPU-fallback round-trip
through `ggml_backend_tensor_set` / `get` + null-arg throws +
empty-ctx error message includes the cache name.  CPU-only —
no GGUF fixture required.  CI test count goes from 24 / 24
(round 12) to 25 / 25 (round 13).

Perf impact: zero — same code path, same allocations, same data
movement, just one fewer level of nesting at each call site.

== 2. Q8_0 K/V no-win documented for RTX 5090 ==

Round 4 shipped the `--kv-attn-type q8_0` CLI option and bench
output advertises `q8_0_kv_attn=available`.  Round 13 measures
the trade-off on the test rig (RTX 5090, 1.79 TB/s memory
bandwidth, long prompt 206 chars / 18 s audio):

    --kv-attn-type f16:  total=31.11 ms (588x realtime)  <- default
    --kv-attn-type q8_0: total=31.84 ms (575x realtime)  <- 2 % slower

The F32->Q8_0 cast overhead exceeds the saved K/V upload
bandwidth on a high-bandwidth discrete GPU.  Operator guidance:
stick with the F16 default on RTX 5090 and similar high-
bandwidth discretes.  Q8_0 is shipped for adapters where the
K/V upload bottlenecks the synth (older PCIe 3.0, lower-end
discretes, iGPUs with slow BAR); cross-over point to be
measured per-adapter by operators using `--bench-per-step`
from round 7.

== Test plan ==

  ctest --test-dir tts-cpp/build         -L unit
  -> 25 / 25 PASS (was 24 / 24 in round 12; +1 input-scratchpad)
  ctest --test-dir tts-cpp/build-vulkan  -L unit
  -> 25 / 25 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter
writes a valid WAV.

Perf on RTX 5090 (10 runs + 3 warmup, long prompt):
  Round 12 baseline:  med= 31.11 ms  (588x realtime)
  Round 13:           med= 31.71 ms  (577x realtime)
  -> within run-to-run noise; no regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 pushed a commit that referenced this pull request May 15, 2026
`apply_rope_to_packed_qk` (PR #16 audit follow-up #5) was written
assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`.  In fact
the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)`
fallback) produces `ne=[L, HD]` with channel-major-flat memory
(`data[t + c*L]`) — the bit-exact transpose of the helper's
input contract.  Every CPU synth with `--n-gpu-layers 0` against
a GGUF carrying `vector_rope_theta` aborts at the helper's
defensive assertion on the first denoise step:

  supertonic_internal.h:742:
    GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed
  apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml
  → supertonic_vector_step_ggml → supertonic_vector_loop_ggml

The CPU unit test that landed alongside the helper hand-built
Q under the wrong `[HD, L]` shape, so the failure mode was
invisible to CI.

Fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]`.  Reference built in
   scalar `apply_rope`'s native time-major-flat layout; test
   verifies the helper's output bytes match bit-for-bit AND
   pins `y->ne[0] = HD, y->ne[1] = L` so the downstream
   `q_tc_in` blit cannot regress on layout.  Committed RED
   first, observed to abort at the same assertion the
   production crash hits.

2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip
   from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]`
   time-major-flat (the layout `q_tc_in` expects).  Rest of
   the pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V has no RoPE to mask the layout flip — open-code the same
   `ggml_cont(ggml_transpose(...))` at the V matmul output in
   `build_group_graph_cache` and the front-block path in
   `supertonic_vector_trace_proj_ggml` so the GPU-bridge
   `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact
   bytes.  Style sq/sk/sv left untouched — this branch has no
   GPU bridge for style attention, so the host-vector path
   via `tensor_to_time_channel` is already correct.

4. Legacy host-bridge downloads of post-RoPE Q/K and
   post-transpose V switched from `tensor_to_time_channel` to
   `tensor_raw_f32`.  The new graph-side layout puts the bytes
   already in the time-major-flat shape scalar `apply_rope` /
   `flash_attention_qkv` host references read, so the raw
   download is the correct call; `tensor_to_time_channel`
   would apply the transpose-of-the-transpose and feed
   wrong-orientation Q/K/V into the attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV |
| CPU long-text synth | abort | writes 6.25s WAV |
| Multi-voice (F1 / M1) | abort | both work |
| Determinism (same seed × 2) | n/a | bit-identical |

- `test-supertonic-rope-packed-qk`: 14 / 14 checks,
  `max_abs_err = 0.000e+00`.
- CPU `ctest -L unit`: 12 / 12 tests, 0 regressions.

Audio sanity on the exact QVAC-18966 reproduction command:
99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like
dynamics, not silence / clipping / garbage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
…lidation + Q8_0 K/V finding

Round 13 is a strict-improvement-only follow-up to round 12:
no code path is removed, no optimisation is rolled back, and
the end-to-end perf on every backend stays at the round-12
level.  Two deliverables, both no-regret:

== 1. New helper `alloc_input_scratchpad_or_throw` ==

Round 12 tetherto#5 inlined the "try pinned-host first, fall back to
default backend buffer, throw on both-fail" idiom at 4 cache
sites (front block + 3 group caches):

    cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(model, cache.input_ctx);
    if (!cache.input_buf) {
        cache.input_buf = ggml_backend_alloc_ctx_tensors(cache.input_ctx, model.backend);
        if (!cache.input_buf) {
            // per-cache teardown + throw with cache-specific message
        }
    }

Round 13 factors it into one helper.  Each caller becomes:

    cache.input_buf = alloc_input_scratchpad_or_throw(
        model, cache.input_ctx, "vector_group_graph_cache");

Same correctness contract — CPU / Metal / OpenCL fall back to
default backend buffer; Vulkan tries pinned-host first.
Defensive failure modes consolidated: null model.backend, null
input_ctx, null cache_name all throw std::runtime_error with a
message that includes the cache name, instead of segfaulting in
an error-handler path.  Single point of maintenance for the
pattern; future cache builds that want pinned-host inputs use
the helper directly.

`test_supertonic_input_scratchpad.cpp` (NEW, 9 / 9 checks) pins
the contract via SFINAE on the symbol + CPU-fallback round-trip
through `ggml_backend_tensor_set` / `get` + null-arg throws +
empty-ctx error message includes the cache name.  CPU-only —
no GGUF fixture required.  CI test count goes from 24 / 24
(round 12) to 25 / 25 (round 13).

Perf impact: zero — same code path, same allocations, same data
movement, just one fewer level of nesting at each call site.

== 2. Q8_0 K/V no-win documented for RTX 5090 ==

Round 4 shipped the `--kv-attn-type q8_0` CLI option and bench
output advertises `q8_0_kv_attn=available`.  Round 13 measures
the trade-off on the test rig (RTX 5090, 1.79 TB/s memory
bandwidth, long prompt 206 chars / 18 s audio):

    --kv-attn-type f16:  total=31.11 ms (588x realtime)  <- default
    --kv-attn-type q8_0: total=31.84 ms (575x realtime)  <- 2 % slower

The F32->Q8_0 cast overhead exceeds the saved K/V upload
bandwidth on a high-bandwidth discrete GPU.  Operator guidance:
stick with the F16 default on RTX 5090 and similar high-
bandwidth discretes.  Q8_0 is shipped for adapters where the
K/V upload bottlenecks the synth (older PCIe 3.0, lower-end
discretes, iGPUs with slow BAR); cross-over point to be
measured per-adapter by operators using `--bench-per-step`
from round 7.

== Test plan ==

  ctest --test-dir tts-cpp/build         -L unit
  -> 25 / 25 PASS (was 24 / 24 in round 12; +1 input-scratchpad)
  ctest --test-dir tts-cpp/build-vulkan  -L unit
  -> 25 / 25 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter
writes a valid WAV.

Perf on RTX 5090 (10 runs + 3 warmup, long prompt):
  Round 12 baseline:  med= 31.11 ms  (588x realtime)
  Round 13:           med= 31.71 ms  (577x realtime)
  -> within run-to-run noise; no regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
…sumption + voice cache threading + round-5 gap

Pure docs / comments change.  No production-logic surface
modified.  CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit`
25 / 25; CPU + Vulkan end-to-end synth produce valid speech
WAVs (99.7% non-zero samples, healthy rms).

Addresses three reviewer asks on PR tetherto#18:

1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md).
   Adds an explicit "Note on the round 5 gap" section between
   round 4 and round 7 documenting that the round-4 plan
   reserved the name "Round 5 = pinned-host-buffer per-step
   uploads" as a placeholder, that the actual implementation
   was deferred behind round-7's bench observability
   prerequisite, and that it ultimately landed as round 12 tetherto#5.
   No code was dropped; round numbers stay contiguous so PR
   descriptions and CI logs match the round labels in this log
   without rebase churn.

2. UMA-bias assumption (supertonic_gguf.cpp —
   resolve_vulkan_device_index).  Adds a long comment in the
   requested == -1 auto-pick branch documenting the assumption
   that is_uma_per_device[i] is sourced from
   ggml_backend_dev_get_props().type and the failure mode when
   a discrete adapter's driver mis-reports its type as _IGPU
   (some Thunderbolt eGPU configs; some ARM SoC dGPU paths).
   Three sub-cases enumerated: (a) discrete-only with
   mis-classification falls through to round-3 all-device
   argmax and still picks discrete by free-VRAM (coincidentally
   correct), (b) mixed UMA-iGPU + mis-classified-discrete picks
   iGPU silently (regression vs. round 3 — operator escape
   hatch: --vulkan-device N is UMA-agnostic and
   --vulkan-perf-logger exposes the choice).  Future-work
   pointer to a "free-VRAM ceiling" heuristic (UMA reports
   system-RAM-scale; a discrete reporting > 256 GB is
   implausible and can be re-classified) tracked in
   aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.

3. voice_host_cache threading model (supertonic_internal.h).
   Tightens the reference-stability docstring from "must NOT
   call clear() while holding the reference" to a full
   thread-safety section explicitly calling out single-threaded
   -per-Engine as the supported model (matches what the iOS
   load/unload race fix 36a2c56 enforces for s3gen).  Explains
   why no internal lock today (cache exists to eliminate per
   -call GPU downloads; internal locking would give back the
   saving) and what a future thread-pool refactor must do
   (external mutex around get_or_load + downstream .data()
   capture, OR switch to a std::shared_mutex-guarded internal
   lock).  Also clarifies the unordered_map guarantee: element
   references survive insert even when the table rehashes;
   only iterators are invalidated.

Reviewer's fourth ask — "the round-11 fix is redone in PR
tetherto#21" — was resolved by the rebase landing in this same branch
state.  After rebasing onto upstream/supertonic_optimizations
(which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix),
this branch's round-11 commit is a delta of only the 2
Vulkan-only V-transpose sites needed for round 8's front-block
GPU bridge + round 9's style GPU bridge.  No double-application;
the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the
new base.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
…lidation + Q8_0 K/V finding

Round 13 is a strict-improvement-only follow-up to round 12:
no code path is removed, no optimisation is rolled back, and
the end-to-end perf on every backend stays at the round-12
level.  Two deliverables, both no-regret:

== 1. New helper `alloc_input_scratchpad_or_throw` ==

Round 12 tetherto#5 inlined the "try pinned-host first, fall back to
default backend buffer, throw on both-fail" idiom at 4 cache
sites (front block + 3 group caches):

    cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(model, cache.input_ctx);
    if (!cache.input_buf) {
        cache.input_buf = ggml_backend_alloc_ctx_tensors(cache.input_ctx, model.backend);
        if (!cache.input_buf) {
            // per-cache teardown + throw with cache-specific message
        }
    }

Round 13 factors it into one helper.  Each caller becomes:

    cache.input_buf = alloc_input_scratchpad_or_throw(
        model, cache.input_ctx, "vector_group_graph_cache");

Same correctness contract — CPU / Metal / OpenCL fall back to
default backend buffer; Vulkan tries pinned-host first.
Defensive failure modes consolidated: null model.backend, null
input_ctx, null cache_name all throw std::runtime_error with a
message that includes the cache name, instead of segfaulting in
an error-handler path.  Single point of maintenance for the
pattern; future cache builds that want pinned-host inputs use
the helper directly.

`test_supertonic_input_scratchpad.cpp` (NEW, 9 / 9 checks) pins
the contract via SFINAE on the symbol + CPU-fallback round-trip
through `ggml_backend_tensor_set` / `get` + null-arg throws +
empty-ctx error message includes the cache name.  CPU-only —
no GGUF fixture required.  CI test count goes from 24 / 24
(round 12) to 25 / 25 (round 13).

Perf impact: zero — same code path, same allocations, same data
movement, just one fewer level of nesting at each call site.

== 2. Q8_0 K/V no-win documented for RTX 5090 ==

Round 4 shipped the `--kv-attn-type q8_0` CLI option and bench
output advertises `q8_0_kv_attn=available`.  Round 13 measures
the trade-off on the test rig (RTX 5090, 1.79 TB/s memory
bandwidth, long prompt 206 chars / 18 s audio):

    --kv-attn-type f16:  total=31.11 ms (588x realtime)  <- default
    --kv-attn-type q8_0: total=31.84 ms (575x realtime)  <- 2 % slower

The F32->Q8_0 cast overhead exceeds the saved K/V upload
bandwidth on a high-bandwidth discrete GPU.  Operator guidance:
stick with the F16 default on RTX 5090 and similar high-
bandwidth discretes.  Q8_0 is shipped for adapters where the
K/V upload bottlenecks the synth (older PCIe 3.0, lower-end
discretes, iGPUs with slow BAR); cross-over point to be
measured per-adapter by operators using `--bench-per-step`
from round 7.

== Test plan ==

  ctest --test-dir tts-cpp/build         -L unit
  -> 25 / 25 PASS (was 24 / 24 in round 12; +1 input-scratchpad)
  ctest --test-dir tts-cpp/build-vulkan  -L unit
  -> 25 / 25 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter
writes a valid WAV.

Perf on RTX 5090 (10 runs + 3 warmup, long prompt):
  Round 12 baseline:  med= 31.11 ms  (588x realtime)
  Round 13:           med= 31.71 ms  (577x realtime)
  -> within run-to-run noise; no regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
…sumption + voice cache threading + round-5 gap

Pure docs / comments change.  No production-logic surface
modified.  CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit`
25 / 25; CPU + Vulkan end-to-end synth produce valid speech
WAVs (99.7% non-zero samples, healthy rms).

Addresses three reviewer asks on PR tetherto#18:

1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md).
   Adds an explicit "Note on the round 5 gap" section between
   round 4 and round 7 documenting that the round-4 plan
   reserved the name "Round 5 = pinned-host-buffer per-step
   uploads" as a placeholder, that the actual implementation
   was deferred behind round-7's bench observability
   prerequisite, and that it ultimately landed as round 12 tetherto#5.
   No code was dropped; round numbers stay contiguous so PR
   descriptions and CI logs match the round labels in this log
   without rebase churn.

2. UMA-bias assumption (supertonic_gguf.cpp —
   resolve_vulkan_device_index).  Adds a long comment in the
   requested == -1 auto-pick branch documenting the assumption
   that is_uma_per_device[i] is sourced from
   ggml_backend_dev_get_props().type and the failure mode when
   a discrete adapter's driver mis-reports its type as _IGPU
   (some Thunderbolt eGPU configs; some ARM SoC dGPU paths).
   Three sub-cases enumerated: (a) discrete-only with
   mis-classification falls through to round-3 all-device
   argmax and still picks discrete by free-VRAM (coincidentally
   correct), (b) mixed UMA-iGPU + mis-classified-discrete picks
   iGPU silently (regression vs. round 3 — operator escape
   hatch: --vulkan-device N is UMA-agnostic and
   --vulkan-perf-logger exposes the choice).  Future-work
   pointer to a "free-VRAM ceiling" heuristic (UMA reports
   system-RAM-scale; a discrete reporting > 256 GB is
   implausible and can be re-classified) tracked in
   aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.

3. voice_host_cache threading model (supertonic_internal.h).
   Tightens the reference-stability docstring from "must NOT
   call clear() while holding the reference" to a full
   thread-safety section explicitly calling out single-threaded
   -per-Engine as the supported model (matches what the iOS
   load/unload race fix 36a2c56 enforces for s3gen).  Explains
   why no internal lock today (cache exists to eliminate per
   -call GPU downloads; internal locking would give back the
   saving) and what a future thread-pool refactor must do
   (external mutex around get_or_load + downstream .data()
   capture, OR switch to a std::shared_mutex-guarded internal
   lock).  Also clarifies the unordered_map guarantee: element
   references survive insert even when the table rehashes;
   only iterators are invalidated.

Reviewer's fourth ask — "the round-11 fix is redone in PR
tetherto#21" — was resolved by the rebase landing in this same branch
state.  After rebasing onto upstream/supertonic_optimizations
(which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix),
this branch's round-11 commit is a delta of only the 2
Vulkan-only V-transpose sites needed for round 8's front-block
GPU bridge + round 9's style GPU bridge.  No double-application;
the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the
new base.

Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor pushed a commit that referenced this pull request May 28, 2026
gianni-cor pushed a commit that referenced this pull request May 28, 2026
The standalone setup-ggml.sh + patches/ tooling was dropped from
qvac-ext-lib-whisper.cpp/tts-cpp/ in the integration commit, but the
CMakeLists.txt still:
  * defaulted TTS_CPP_USE_SYSTEM_GGML=OFF, and
  * unconditionally compile-defined GGML_BACKEND_DL_PROJECT_PREFIX="speech-"
    on the bundled ggml target.
That combination quietly broke standalone bundled-ggml builds: the
filename-prefix patch was no longer applied, so libspeech-ggml-*.so
files existed on disk but ggml's runtime loader still searched for
libggml-*.so under GGML_BACKEND_DL=ON.  Vulkan / OpenCL / CUDA
backends silently failed to load on Android.

Fix per reviewer guidance: converge the speech stack on a single ggml
source-of-truth.  Standalone-bundled-ggml is no longer a supported
build mode out of this in-tree subtree; the canonical path is
`-DTTS_CPP_USE_SYSTEM_GGML=ON` against the QVAC speech-stack
`ggml-speech` vcpkg port (qvac-ext-ggml/speech branch), which ships
the patches pre-applied.

Edits:

- TTS_CPP_USE_SYSTEM_GGML default flipped from OFF to ON in this
  tree.  Docstring spells out the rationale + points users at the
  standalone github.com/gianni-cor/chatterbox.cpp repo if they need
  a bundled-ggml dev build with patches/ present.

- The bundled-ggml branch of `if (NOT TARGET ggml)` now refuses to
  configure when patches/ is absent: a FATAL_ERROR points at the
  right consumption path (vcpkg ggml-speech) and the standalone
  fallback.  Doesn't break in-tree-with-patches builds (parakeet-cpp
  in this same repo still ships patches/, so its bundled path is
  unaffected by this guard inside tts-cpp).

- Verified locally: `cmake -S tts-cpp -B build` (no flags) errors
  out at find_package(ggml CONFIG REQUIRED) with our new message
  pointing at the ggml-speech port; `cmake -S tts-cpp -B build
  -DTTS_CPP_USE_SYSTEM_GGML=OFF` errors out at the patches/ guard
  with the no-patches message.

- tts-cpp/scripts/setup-ggml.sh deleted: it referenced patches/
  that no longer exist; running it would have errored out anyway.
  The standalone repo keeps its own setup-ggml.sh; only the in-tree
  subtree drops it.

The standalone chatterbox.cpp repo (the one tts-cpp/ was copied
from) keeps TTS_CPP_USE_SYSTEM_GGML=OFF default + the patches/
folder + scripts/setup-ggml.sh.  This commit is therefore an
integration-time delta against that source, not a change to the
standalone build flow.

Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor pushed a commit that referenced this pull request May 28, 2026
After commit fa0d490 (review #5+#6) made bundled-add_subdirectory(ggml)
hard-error in this in-tree subtree when patches/ is absent, the
TTS_CPP_GGML_LIB_PREFIX block became dead code:

  if (TTS_CPP_GGML_LIB_PREFIX AND NOT TTS_CPP_USE_SYSTEM_GGML)

NOT TTS_CPP_USE_SYSTEM_GGML can never reach this `if` here -
configure has already FATAL_ERROR'd at the patches/-absent guard.
The option, the helper function, the foreach loop, the
GGML_BACKEND_DL_PROJECT_PREFIX define, and the STATUS message were
all unreachable.  The next maintainer flipping
-DTTS_CPP_GGML_LIB_PREFIX=OFF to disable prefixing would have been
silently confused when nothing changed.

Edits:

tts-cpp/CMakeLists.txt:
  - The option() declaration at line 22 removed.  Replaced with a
    one-paragraph cross-reference to the standalone chatterbox.cpp
    repo for the locally-rename flow + the rationale (ggml-speech
    vcpkg port emits the libspeech-ggml-* filenames itself).
  - The 41-line block at lines 131-176 (tts_cpp_apply_ggml_prefix
    function + foreach + target_compile_definitions +
    STATUS message) replaced with a 9-line note telling future
    readers where the standalone counterpart lives.

tts-cpp/README.md:
  - Useful CMake options table row for TTS_CPP_GGML_LIB_PREFIX
    rewritten with a strikethrough + "n/a in this subtree" cell:
    explains the standalone option exists at chatterbox.cpp upstream,
    why it's unnecessary here (ggml-speech vcpkg port handles the
    rename at its own build time), and that the file-prefix surface
    is whatever vcpkg installs.

Doc-only behavior visible to consumers: the integrated subtree no
longer has a TTS_CPP_GGML_LIB_PREFIX option at all.  Build behaviour
unchanged - the vcpkg find_package path was already taking effect
and emitting libspeech-ggml-* as designed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants