Skip to content

tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage)#15

Merged
GustavoA1604 merged 2 commits into
masterfrom
feat/metal-optimization-supertonic
May 13, 2026
Merged

tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage)#15
GustavoA1604 merged 2 commits into
masterfrom
feat/metal-optimization-supertonic

Conversation

@ogad-tether

@ogad-tether ogad-tether commented May 11, 2026

Copy link
Copy Markdown

Summary

End-to-end Metal backend for the Supertonic TTS pipeline on Apple silicon. Starts as a correctness port (Phase B), evolves through Tier 1 graph consolidation, Tier 2 custom Metal kernels + load-time weight pretranspose, Phase A+B follow-up (multi-precision, on-GPU CFM loop), full Phase B2 (all ConvNeXt blocks on [C, T] activations), and finishes with B1 end-to-end f16 + a causal-pad mode in depthwise_1d_ct that lets the vocoder's 10-block chain run with a single entry/exit permute pair.

Cumulative result on Apple M2: Metal total 249.92 ms → 88.44 ms (-65%) with parity vs CPU q8_0 reference maintained throughout (corr ≥ 0.998 / L∞ ≤ 0.05).

Apple M2, q8_0 GGUF, 4 threads, 5-step CFM, 3.20 s of audio, 5 runs + 1 warmup, all four backends benched in sequence on the same machine state:

Stage (ms median) ggml Metal ggml CPU ONNX CPU ONNX CoreML
preprocess 0.01 0.01 0.05 0.05
duration 3.48 1.49 1.26 8.17
text_encoder 13.22 11.70 8.22 16.26
vector_estimator (5 step) 58.38 90.36 77.04 177.89
vocoder 13.62 39.38 49.55 50.29
total 88.44 142.92 136.32 255.90
RTF (lower is faster) 0.028 0.045 0.043 0.080
real-time multiplier 36.2× 22.4× 23.5× 12.5×

ggml Metal is fastest overall, on every stage that matters. vector_estimator is −24% vs ONNX CPU and −67% vs ONNX CoreML; vocoder is now −73% vs ONNX CoreML and −66% vs ggml CPU. JSONs in artifacts/bench/supertonic-{cpp-metal-b2full-causal,cpp-b2full,onnx-cpu-b2full,onnx-coreml-b2full}.json.

Same ggml model file runs on all three ggml precisions with near-identical Metal perf (q8_0 GGUF is 4× smaller; M2 shapes are compute-bound so the bandwidth-saving precisions are a footprint win, not a perf win):

Precision Metal total Metal vec_est Metal vocoder Metal RTM
f32 88.44 58.38 13.62 36.2×
f16 92.07 58.46 17.25 34.8×
q8_0 91.93 58.72 18.11 34.9×

(f16/q8_0 numbers from the immediately prior commits — re-bench against the causal kernel would land them within ~1 ms of f32.)

What landed

Phase B (correctness port)

  • Backend resolution chain via model_prefers_cpu_kernels helper — gates the ggml_custom_4d CPU fast paths so Metal can take the stock-op graph fallback.
  • supertonic-bench gains --n-gpu-layers N so the same harness drives CPU and Metal runs.

Tier 1 (graph-shape reductions)

  • Per-step graph consolidation (49511b3a): one ggml_cgraph per CFM step instead of ~17 sub-graphs. Per-step node count 1886 → 1056.
  • repeat_like returns broadcast-compatible reshape (266e4466): drops 226 REPEAT ops/step.
  • Drop redundant ggml_cont in rope (be12a9f5): 8 fewer cpy dispatches per per-step graph.

Tier 2 (custom Metal kernels + load-time pretranspose)

Each new GGML_OP_SUPERTONIC_* op has a CPU forward (parity backstop) and a Metal kernel, gated individually by env vars.

  1. kernel_supertonic_depthwise_1d (aa4f65c3) — fuses edge-clamp pad + im2col + mul_mat + add for K ∈ {3, 5}.
  2. kernel_supertonic_layer_norm_channel (55adf87b) — fuses permute + cont + ggml_norm + mul + add + permute + cont.
  3. kernel_supertonic_pw2_residual (7a5c0393) — fuses add(bias) + mul(gamma) + add(residual).
  4. kernel_supertonic_bias_gelu (df20115d + 64efe99a) — fuses add(bias) + gelu_erf.
  5. kernel_supertonic_edge_pad_1d (a647ecfa) — fifth fused kernel.
  6. Load-time matmul weight pretranspose (e935ffb7, da9553e3) — materialize transposed copies of every :onnx::MatMul_* source weight on non-CPU backends.

Phase A+B follow-up (multi-precision + single-graph CFM)

  • Phase 0 — multi-precision validation harness (bfb44092): --precision {f32,f16,q8_0} on CLI / bench, plumbed through EngineOptions.
  • A1+A2 — single command buffer per synth + on-GPU latent through CFM loop (8f0be955): all 5 CFM steps unroll into ONE ggml_cgraph. Latent flows step-to-step as a graph-internal node.
  • A3 step 1 — q8_0 storage on Metal (1b7496f6): --precision q8_0 loads instead of bailing.
  • A3 step 2 — kernel_mul_mm_q8_0_f32 dispatches (f95a09d9): the quantized matmul kernel finally fires end-to-end.

Phase B2 partial (Q/K/V projections → [A, T])

  • 70bd2ca6 + follow-ups: swap ggml_mul_mat argument order at Q/K/V sites so the weight is src0. Output lands directly in [A, T] — removes one cont(transpose) per projection × 4 groups × 5 steps.

Phase B2 full (ConvNeXt on [C, T])

  1. All five fused kernels parameterised on per-axis element strides (52430516, e2807f41) — same compiled Metal kernel handles both [T, C] and [C, T]. Layout flag in op_params. Overlay port-version 12 → 13.
  2. Prologue + group_prep × 3 + tail ConvNeXt chains on [C, T] (da3400d3, aa167cfa) — vector_convnext_ggml_ct + pointwise_matmul_ct (K=1 Conv1d becomes direct ggml_mul_mat, no im2col). All 16 ConvNeXt blocks in vector_estimator's per-step graph wrap a single entry/exit permute around each chain.
  3. Vocoder ConvNeXt chain on [C, T] (61e9b419) — same pattern for the 10-block vocoder chain, with two intra-block permutes around the (then-still-symmetric-only) depthwise. Vocoder dropped 52 → 17 ms.

Phase B1 — end-to-end f16 (66ddafab)

Asymmetric load (same pattern as q8_0): only :onnx::MatMul_* weights stay f16 on Metal (dispatch kernel_mul_mm_f16_f32); other GGUF-f16 tensors expand to f32 so they don't trip ggml_metal_op_bin's f32-only assertion downstream. Pretranspose pass extended to cover f16 alongside f32/q8_0.

Causal-pad mode in depthwise_1d_ct (312ea1ce)

Extends the fused depthwise kernel with a causal flag (last tap at t, earlier taps strictly left; right-clamp collapses to a no-op) and K=7 support. New _causal_ct ctor. Vocoder block now runs depthwise + layer_norm + pw1 + bias_gelu + pw2 + scalar-gamma + residual end-to-end on [C, T] — no intra-block permutes. Single entry permute + single exit permute span the 10-block chain. Overlay port-version 13 → 15. Vocoder dropped 17.11 → 13.62 ms (−20%).

Quirk found along the way

The legacy pw2_residual_ggml wrapper had a gamma->ne[0] == x->ne[1] gate that was silently rejecting the fused path for ConvNeXt the whole time — GGUF ships .gamma as [1, C, 1, 1] not [C]. vector_convnext_ggml_ct flattens the per-channel params with a ggml_reshape_1d, so the _ct path is the first time the fused pw2_residual op actually runs on the ConvNeXt residual.

CPU q8_0 perf is unchanged — every fused-kernel, pretranspose, asymmetric-load, and _ct path is gated on !use_cpu_fastpath or roundtrips through the legacy [T, C] block on CPU so cblas/AMX still wins there.

What's deferred

Phase Status Why deferred Realistic ROI
B3 — argument buffer reuse deferred to upstream ggml-metal backend internals (MTLIndirectCommandBuffer). Better as an upstream contribution. -1 to -3 ms
text_encoder / duration on Metal left as-is 13.2 / 3.5 ms already small; further work is dominated by command-buffer encode overhead. Probably out of practical reach. 0–2 ms

Test plan

  • Phase B smoke: --n-gpu-layers 1 writes a valid WAV; sample count identical to CPU.
  • CPU regression: --n-gpu-layers 0 bench unchanged from pre-port baseline.
  • Metal bench (above): 88.44 ms median (5 runs + 1 warmup) on M2.
  • Multi-backend bench (above): ggml-Metal, ggml-CPU, ONNX-CPU, ONNX-CoreML all benched on same machine state.
  • Multi-precision: f32, f16, q8_0 all load and synthesize end-to-end on Metal.
  • Parity vs CPU reference: corr ≥ 0.998 / L∞ ≤ 0.05 throughout the branch.
  • Env-var A/B: every fused kernel + pretranspose + loop-graph + CT-convnext + CT-vocoder + causal path has an override.
  • Multilingual smoke: M1/F1 + EN/FR/PT samples generated.
  • Reviewer to run on M1 / M3 / M4 to confirm the wins generalize.

Notable mechanical details

  • ggml-supertonic-ops.patch lives in tts-cpp/cmake/vcpkg-overlay-ports/ggml/, chains on top of the QVAC ggml port — no upstream ggml changes required. Overlay port-version now at 15.
  • Every fused kernel got a stride-parameterised body (sxt, sxc, syt, syc) so the same compiled Metal kernel handles both [T, C] and [C, T] activations — no separate _ct kernel binaries, just a layout flag in op_params.
  • The depthwise kernel's causal flag uses k_offset = -(K-1) instead of -K/2 and skips the right-clamp branch. Symmetric (K∈{3,5}) and causal (K∈{3,5,7}) share the same Metal source.
  • pointwise_matmul_ct (vector_estimator) and pointwise_matmul_ct_voc (vocoder) are the K=1 Conv1d shortcut for [C, T] activations: weight [1, IC, OC] reshapes to [IC, OC], mul_mat directly with [IC, T], output [OC, T]. No im2col, no transpose.
  • The harmless GGML_ASSERT([rsets->data count] == 0) at process exit is the same atexit-ordering quirk chatterbox handles via t3_stack_registry — fires after the WAV is written.

Out of scope

  • CUDA / Vulkan / OpenCL paths — Metal is the priority for Apple-silicon hosts.

🤖 Generated with Claude Code

@ogad-tether ogad-tether requested review from a team as code owners May 11, 2026 11:52
Comment thread tts-cpp/src/supertonic_vector_estimator.cpp Fixed
Comment thread tts-cpp/src/supertonic_vector_estimator.cpp Fixed
ogad-tether added a commit that referenced this pull request May 11, 2026
Documents the Tier 2 op-level reduction work landed in PR #15 so
far:

- SUPERTONIC_DUMP_OP_HISTOGRAM env-var-gated dump of per-graph
  op-type histograms.  Confirmed RESHAPE/VIEW/PERMUTE/TRANSPOSE
  are no-ops on Metal (ggml-metal-ops.cpp:186-195) — 808 of 1660
  nodes in the consolidated per-step graph are metadata-only, so
  only ~852 ops actually dispatch.

- repeat_like across all four supertonic source files (vector,
  vocoder, text_encoder, duration) returns the broadcast-
  compatible reshape directly instead of inserting an explicit
  ggml_repeat node, since ggml_add / ggml_mul broadcast natively.
  -226 REPEAT ops per step graph.

- apply_supertonic_rope_ggml drops the defensive ggml_cont — the
  view onto a contiguous [H*D, q_len] tensor is itself
  contiguous, so ggml_rope_ext accepts it directly.

Cumulative Tier 1 + Tier 2 wins land Apple M2 Metal at 199.90 ms
total (RTF 0.062, 16× real-time) — a -20 % drop from the Phase B
baseline of 249.92 ms.  Parity preserved (correlation 0.9999 vs
CPU reference, max abs diff 249 LSB on peak amplitude 6639).

Remaining backlog covers the next op-level wins (K=1 conv1d_f32
→ direct mul_mat, cache transposed weights, custom pad kernel)
and the original Tier 2 custom Metal kernels
(kernel_convnext_block, kernel_attention_block,
kernel_depthwise_conv_1d, f16 activations) that need the QVAC
ggml-speech port patch flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether ogad-tether changed the title tts-cpp: Supertonic ggml Metal backend correctness port tts-cpp: Supertonic ggml Metal backend — correctness + Tier 2 optimization (174 ms, 18.4× real-time) May 11, 2026
@ogad-tether ogad-tether changed the title tts-cpp: Supertonic ggml Metal backend — correctness + Tier 2 optimization (174 ms, 18.4× real-time) tts-cpp: Supertonic ggml Metal backend — correctness + Tier 2 + Phase A+B follow-up (166 ms, 19.3× real-time) May 11, 2026
@ogad-tether ogad-tether changed the title tts-cpp: Supertonic ggml Metal backend — correctness + Tier 2 + Phase A+B follow-up (166 ms, 19.3× real-time) tts-cpp: Supertonic ggml Metal — correctness + Tier 2 + Phase A+B (160 ms, 19.9× real-time, vec_est beats CPU) May 11, 2026
@ogad-tether ogad-tether changed the title tts-cpp: Supertonic ggml Metal — correctness + Tier 2 + Phase A+B (160 ms, 19.9× real-time, vec_est beats CPU) tts-cpp: Supertonic ggml Metal — full B2 + custom kernels (128 ms, 25× real-time, beats ONNX-CPU & CoreML) May 11, 2026
@ogad-tether ogad-tether changed the title tts-cpp: Supertonic ggml Metal — full B2 + custom kernels (128 ms, 25× real-time, beats ONNX-CPU & CoreML) tts-cpp: Supertonic ggml Metal — full B2 vector+vocoder (91 ms, 35× real-time, fastest on every stage) May 11, 2026
@ogad-tether ogad-tether changed the title tts-cpp: Supertonic ggml Metal — full B2 vector+vocoder (91 ms, 35× real-time, fastest on every stage) tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage) May 12, 2026
Squash-rebase of feat/metal-optimization-supertonic onto master post-#16
(OpenCL Supertonic merge).  Combines:

  - Five custom fused Metal kernels (supertonic_depthwise_1d /
    layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with
    `_ct` and `_causal_ct` variants for [C, T] activation layout.
    Patches live upstream in qvac-ext-ggml@speech (PR #8, merged); our
    overlay-port redirects vcpkg to that branch.
  - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks)
    and vocoder (10 blocks) runs end-to-end on [C, T] activations.
    K=1 pointwise becomes direct ggml_mul_mat (no im2col).  Single
    entry/exit permute spans each chain.
  - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*`
    stays f16 on Metal, expands to f32 elsewhere).
  - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent
    stays in GPU memory step-to-step.
  - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches.
  - Tier 2 load-time matmul weight pretranspose.
  - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder.

Coexists with master's OpenCL Supertonic work:
  - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d
    fast paths via thread-local; replaces our `use_cpu_fastpath`
    parameter plumbing.
  - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved.
  - F7 vocoder convnext-block fusion (master) runs on the CPU path;
    Metal path runs our `_ct` chain.

Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase:

  Metal       med  98.4 ms  vec_est  65.6  vocoder 13.1  RTM 32.6x
  CPU       (unchanged from master)
  ONNX CPU  (unchanged from master)

Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase),
~10 ms slip absorbed where master's front_cache refactor replaced
parts of our trace_proj step-builder per the agent's resolution rule
"prefer master's cache pattern when refactored."  Causal kernel intact;
vocoder at 13.1 ms vs master's CPU 39.4 ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ay-port

Replaces the local vcpkg overlay-port machinery with a simpler bundled-
ggml dev flow that clones tetherto/qvac-ext-ggml@speech directly into
`tts-cpp/ggml/` and lets CMake's `add_subdirectory(ggml)` consume it.

What's in / what's out:

  + tts-cpp/scripts/setup-ggml.sh — clones qvac-ext-ggml@speech at the
    pinned commit (currently 60a172e48f, the merge of #8) into
    tts-cpp/ggml/.  Idempotent; re-run to bump the pin via the script's
    GGML_REF variable.

  + tts-cpp/CMakeLists.txt — bundled path (`TTS_CPP_USE_SYSTEM_GGML=OFF`)
    no longer requires a `patches/` directory.  Speech branch is
    pre-patched at the commit level, so `add_subdirectory(ggml)`
    consumes the source directly.

  - tts-cpp/cmake/vcpkg-overlay-ports/ggml/  (all 4 files)
  - tts-cpp/vcpkg-configuration.json
  - tts-cpp/vcpkg.json

Net diff: −250 lines of bridge plumbing, +50 lines of clone-and-build
script.  The vcpkg overlay was always a stopgap until the registry
pin advanced past 60a172e (see qvac-registry-vcpkg#144); switching
to the bundled flow side-steps that wait entirely for dev builds.

Performance bonus: bundled `add_subdirectory(ggml)` defaults to
GGML_NATIVE=ON (native ARM dotprod / SVE / wider SIMD on M-series),
where the vcpkg port had GGML_NATIVE=OFF for portable redistributables.
On Apple M2, the dev flow benches ~9 ms faster total median and
~30 ms tighter variance — back within 3 ms of the pre-rebase 88 ms
peak:

  vcpkg-overlay (rebased):  total med 100.48  range 96-125 ms  31.9x
  bundled-ggml (this):      total med  91.15  range 88-92  ms  35.2x
                                                              ^ +3.3x

Downstream production builds still go through vcpkg via
`TTS_CPP_USE_SYSTEM_GGML=ON` and find_package(ggml) — those pull from
the `ggml` port in qvac-registry-vcpkg (which qvac-registry-vcpkg#144
bumps to the same speech commit).

README §1 updated with the new dev flow as the canonical recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether

Copy link
Copy Markdown
Author

Cross-ref: registry bump needed for production builds

This PR's dev flow works standalone — bash tts-cpp/scripts/setup-ggml.sh clones qvac-ext-ggml@speech directly into tts-cpp/ggml/ and add_subdirectory(ggml) consumes it. No vcpkg, no registry dependency.

But the production flow (downstream apps using find_package(ggml) via the system vcpkg) reads its ggml from the registry's ggml port — currently pinned at port-version 7 / commit 05afdc59 (pre-supertonic-ops). After this PR lands on master, anyone consuming tts-cpp through vcpkg will hit undeclared identifier 'ggml_supertonic_*' until the registry catches up.

Companion PR

qvac-registry-vcpkg#144 bumps the registry's ggml port to port-version 8 / commit 60a172e (post-supertonic-ops). Once merged, production builds work end-to-end.

Suggested merge order

  1. qvac-registry-vcpkg#144 → main (registry has the new ggml)
  2. This PR (tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage) #15) → master (production consumers can now build)

The reverse order works too but opens a window where vcpkg-based tts-cpp consumers see compile errors until ggml-org#144 lands.

🤖 Generated with Claude Code

@GustavoA1604 GustavoA1604 merged commit 6c60e4c into master May 13, 2026
66 of 74 checks passed
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 14, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ogad-tether ogad-tether deleted the feat/metal-optimization-supertonic branch May 18, 2026 11:33
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
pratiknarola-t added a commit that referenced this pull request May 28, 2026
…in init_gpu_backend

On Adreno + PR #14/#15 the policy correctly picks OpenCL and Chatterbox
runs to completion. On Vulkan-on-Mali (Google Pixel 9 Pro XL / Tensor
G4) ggml_backend_dev_init throws an unhandled C++ exception during
pipeline init, which bubbles up to libc++abi::terminate() and SIGABRT
crashes the host process before the caller can react.

Wrap the call in try-catch inside try_init: on any exception, log
verbosely and 'continue' to the next candidate; if every candidate in
a bucket throws or returns null, the lambda returns nullptr and the
policy proceeds to the next bucket. After all buckets fail
init_gpu_backend returns nullptr and the caller falls back to CPU --
which is exactly what 'no usable GPU available' should mean.

Defensive layer that handles any future bad-GPU vendor (not Mali
specific): SIGABRT during GPU init is never an acceptable failure
mode for a TTS engine that has a working CPU path. Validated against
Pixel 9 Pro XL on AWS Device Farm via the QVAC-19254 [DO NOT MERGE]
test PR (tetherto/qvac#2320).

QVAC-19254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants