Skip to content

Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic#18

Merged
GustavoA1604 merged 16 commits into
tetherto:supertonic_optimizationsfrom
Zbig9000:supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic
May 19, 2026
Merged

Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic#18
GustavoA1604 merged 16 commits into
tetherto:supertonic_optimizationsfrom
Zbig9000:supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic

Conversation

@Zbig9000

@Zbig9000 Zbig9000 commented May 14, 2026

Copy link
Copy Markdown

Summary

Brings the Supertonic TTS stage of tts-cpp to functional + tunable parity on the Vulkan backend, layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR #16). The audit-driven optimisations from #16 are backend-portable by construction, so Vulkan inherits all ~280 host↔GPU sync-point eliminations + the F16-weight roster + the in-graph RoPE / ConvNeXt fusion / GPU↔GPU blit work without modification. This PR adds twelve rounds of Vulkan-specific deltas (rounds 1–13, round 5 skipped) — each round committed test-first (TDD) with a CPU-only unit gate that locks in the dispatch + capability + correctness contract for future regressions.

Scope vs. PR #16: this PR sits on top of the OpenCL branch (QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic). All 13 commits below are Vulkan-specific deltas; the OpenCL audit work is not restated here. The optimisations layer cleanly because the audit hits the GGML graph layer (backend-portable by construction); Vulkan inherits the wins automatically.

Net new surface (against the OpenCL branch):

Category Delta
New backend-capability probes 6 (native_leaky_relu, f16_kv_flash_attn, f16_mul_mat, q8_0_kv_flash_attn, bf16_kv_flash_attn, pinned_host_buffer)
New thread-local dispatch flags 2 (use_native_leaky_relu, kv_attn_type) — joins the round-1 use_f16_attn
New EngineOptions knobs 6 (vulkan_device, prewarm_text, f16_weights_deny_list, kv_attn_type, vulkan_env_overrides, bench_per_step)
New CLI flags (× 3 binaries) --vulkan-device, --prewarm, --f16-weights-deny, --kv-attn-type, --vulkan-prefer-host-memory, --vulkan-disable-coopmat2, --vulkan-disable-bfloat16, --vulkan-perf-logger, --vulkan-async-transfer, --vulkan-env, --bench-per-step, --no-bench-sync
New per-step / per-cache helpers upload_skip_tracker, voice_host_cache, try_alloc_inputs_in_pinned_host_buffer, alloc_input_scratchpad_or_throw, apply_vulkan_env_overrides, run_speech_prompted_merged_cache, plus 5 GPU-bridge dispatch sites
New unit tests (ctest -L unit) 12 (test-supertonic-vulkan-dispatch, -portable-ops updated, -capability-cache, -warm-up-api, -vulkan-device-select, -f16-deny-list-api, -kv-attn-type, -kv-attn-type-api, -vulkan-env-overrides, -voice-host-cache, -upload-skip-tracker, -text-encoder-gpu-bridge, -pinned-host-buffer, -input-scratchpad; plus -f16-attn-parity extended for BF16 and -graph-to-graph-blit extended for front-block + style shapes; plus -rope-packed-qk rewritten for the production [L, HD] layout)
Whole ctest -L unit 25 / 25 PASS, 0 regressions, 0 flakes (CPU build + Vulkan build)

Combined perf snapshot — RTX 5090, long prompt (173 chars / ~15 s audio):

Stage Round 11 baseline Round 13 (final) Speedup
Whole synth 76.11 ms / 5 steps (123× realtime) 27.99–31.71 ms (537–588× realtime) 2.4–2.7×
Vector-estimator step 12.7 ms 3.28 ms 3.9×
Prewarm cold-start 330 ms 21 ms 15×
Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU) picks RADV → 178 ms (7× rt) picks RTX 5090 → 28 ms (537× rt) 6.4×

Investigation methodology (TDD throughout)

Every round followed the same workflow:

  1. Audit: identify a Vulkan-specific gap (capability probe, multi-GPU support, drift recovery, sync-point hotspot, etc.).
  2. Test first: write the CPU-only unit gate that pins the new contract (resolver behaviour matrix, API surface, parity bound, layout invariant). Commit + observe failure on the missing symbol (compile error or assertion).
  3. Implement: minimal-surgery production change. Pure-logic helpers split out so the policy is testable on CPU without a Vulkan device.
  4. Re-run: every new test + every existing test must pass before commit.
  5. End-to-end smoke on real hardware once round-11 unblocked the production path.
  6. Update PROGRESS_SUPERTONIC.md + commit.

The CPU-only test strategy is deliberate: a fresh checkout's ctest exercises the dispatch + capability + resolver contracts without needing a Vulkan adapter, so CI on a CPU-only runner catches regressions in the policy layer. Real-Vulkan numerics are validated through the F16 / BF16 K/V parity harness running against the CPU flash_attn_ext reference, which lands the same ggml_cpy(K → typed) + ggml_flash_attn_ext graph the live Vulkan dispatch builds.

TDD caught real bugs that would otherwise have shipped:

  • The env-var-passthrough validator (round 7) used std::string() empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced a bool / out-param API fix BEFORE any production wiring went in.
  • The packed-QK RoPE helper (audit follow-up add_codeowners file #5 from PR Qvac 18607 tts ggml add and optimize open cl for supertonic #16) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] tensor. In fact the matmul produces ne=[L, HD] — the bit-exact transpose of the helper's input contract. The original CPU unit test hand-built Q under the wrong shape, so the failure mode was invisible to CI; round 11 rewrote the test under the production shape (RED), then fixed the helper (GREEN), unblocking end-to-end synth on every backend.
  • Round-10's pointer-compare upload-skip would have silently leaked prior synth's text-encoder embedding into the next synth on heap allocators that re-issue the same address (jemalloc / tcmalloc / glibc). An explicit cross-synth pointer-reuse hazard test forced the tracker.reset() API at every synth boundary.

Commit-by-commit walkthrough

787d966b — Round 1: Vulkan bring-up (initial commit)

Foundational Vulkan dispatch + capability probing. The OpenCL bring-up (#16) used model.use_f16_attn = !backend_is_cpu because the chatterbox OpenCL patch unconditionally accepts the F16-K/V op; on Vulkan the HSK % 8 == 0 supports_op gate has to be respected, so the auto-policy needs a probe.

  • Two new supertonic_model flags populated at GGUF load: backend_is_vk (informational; appended to the backend-description string) and use_native_leaky_relu (resolved via ggml_backend_supports_op(LEAKY_RELU) against a synthetic node — the dispatch helper short-circuits to the fused builtin on backends that ship GGML_OP_LEAKY_RELU natively, falls back to the conservative RELU + SCALE + ADD decomposition otherwise; no hard-coded backend table).
  • New backend-capability probe supertonic_backend_supports_f16_kv_flash_attn gates the use_f16_attn auto-policy. Builds a synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node and asks the backend whether it would accept it — load-time, zero hot-path cost, graceful auto-disable on a false answer.
  • EngineOptions::vulkan_device int + --vulkan-device N CLI flag plumbed through all three binaries. Replaces the historical hard-coded ggml_backend_vk_init(0); range-checked against ggml_backend_vk_get_device_count() at load (out-of-range = hard error, no silent CPU fallback that would hide CLI typos / wrong-machine config).
  • Verbose mode + bench output append ggml_backend_vk_get_device_description so multi-GPU / multi-ICD machines (NVIDIA + llvmpipe, AMD RADV + NVIDIA) unambiguously identify which adapter ran.
  • New CPU-only TDD harness test-supertonic-vulkan-dispatch covering the new flags through supertonic_op_dispatch_scope + a smoke test for the F16-K/V probe. Pre-existing test-supertonic-portable-ops updated to explicitly request the decomposed path on the GPU fixture model so its existing GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU.

d5518ee8 — Pre-existing missing-include fix

tts-cpp/src/chatterbox_tts.cpp used std::atomic<int> without #include <atomic>; pre-existed before this branch but blocked the Supertonic build under the cleaner cmake -S tts-cpp -B build-tts invocation used for round 2+ verification. One-line fix in a single TU. Kept as a separate commit so it's trivially revertable / cherry-pickable to other branches.

6ab085f6 — Round 2: capability-cache + 3 probes + prewarm

The round-1 probes were already cheap, but engine.cpp + bench.cpp + load_supertonic_gguf each ran them independently — three probes × N capabilities = up to 9 redundant ggml_backend_supports_op calls per backend per process.

  • Process-wide cached_backend_capabilities map keyed by ggml_backend_t, guarded by a single std::mutex. Hot path is load-time only, so contention is negligible. Probe-call counter (capability_probe_call_counter) exposed for the regression test.
  • 3 new probes added to the cache + exposed as public forwarders:
    • supertonic_backend_supports_f16_mul_mat — gates the use_f16_weights auto-policy (Phase 2A made it !backend_is_cpu unconditionally; round 2 makes it probe-gated so a backend that ships F16 storage but rejects the hot mul_mat(F16, F32) shape doesn't crash at first synth call).
    • supertonic_backend_supports_q8_0_kv_flash_attn — forward-compat probe; primes the cache for round 4's live dispatch.
    • supertonic_backend_supports_native_leaky_relu — wraps round 1's inline probe so the auto-policy can use the cached path.
  • Engine::warm_up(text) API + EngineOptions::prewarm_text + --prewarm TEXT CLI flag. Runs one throwaway synth at engine construction so the Vulkan / OpenCL shader pipelines for every Supertonic stage compile up-front; the operator-visible first synthesize() call hits steady-state latency instead of paying the ~hundreds-of-ms cold-start hit chatterbox PROGRESS.md measured on Adreno + RADV. No-op on CPU backends.
  • New tests: test-supertonic-capability-cache (probe-counter regression — 1 cache miss + N hits) and test-supertonic-warm-up-api (SFINAE compile-time gate on the new API).

36dc758c — Round 3: multi-device auto-pick + 2 forward-compat probes

The round-1 --vulkan-device N flag covered manual selection but every multi-GPU operator has to pin a specific index in their config; auto-pick across heterogeneous machines requires VRAM introspection.

  • --vulkan-device -1 auto-pick policy: resolve_vulkan_device_index pure-logic helper picks the device with the most free VRAM via ggml_backend_vk_get_device_memory(). Tie-break = lower index (deterministic). Reserved negatives < -1 throw to surface CLI typos. The pure-logic split makes the behaviour matrix testable on CPU with synthetic (index, [vram_per_device]) tuples — no real Vulkan device required for CI.
  • 2 new forward-compat probes added to the cache:
    • supertonic_backend_supports_bf16_kv_flash_attn — symmetric to F16-K/V, picks BF16 instead. Mostly relevant on Vulkan with cooperative_matrix2 (NVIDIA Ampere+ / RDNA3+).
    • supertonic_backend_supports_pinned_host_buffertrue iff the backend is Vulkan AND ggml_backend_vk_host_buffer_type() returns non-null. Primes the cache for round 12's per-engine input-scratchpad refactor.
  • New test test-supertonic-vulkan-device-select (8 functions, 23 checks — empty list, single device, auto-pick max VRAM, tie-breaking, explicit index passthrough, out-of-range, reserved negatives, zero-VRAM device).
  • test-supertonic-capability-cache extended with new-probe coverage.

8087852b — Round 6: F16-weights operator deny-list

The Phase 2A F16-weights policy was all-or-nothing — operators couldn't keep one specific tensor at F32 if it caused drift on a particular adapter / driver combo without disabling F16 weights for the entire model.

  • 2-arg should_materialise_f16_weight(source_name, deny_list) overload layered on top of the curated allow-list. Each entry is a substring; if ANY non-empty entry is found inside a tensor's source name, that tensor stays at its native storage type. Empty entries are skipped defensively (config-typo guard so a stray empty entry doesn't silently disable F16 for the whole model).
  • EngineOptions::f16_weights_deny_list + --f16-weights-deny PAT1,PAT2,... CLI flag (comma-split parser shared between supertonic-cli / tts-cli / supertonic-bench). Default empty (zero behaviour change for every existing operator config).
  • supertonic_model::f16_weights_excluded_count counter surfaced in bench output (human + JSON) so operators can confirm their deny-list took effect. Silent on the default empty path.
  • New test test-supertonic-f16-deny-list-api (SFINAE + runtime defaults + assignability + regression guards). Existing test-supertonic-f16-weights extended with 7 new test functions / 29 new checks (empty-list passthrough, matching-deny-excludes, non-matching-no-op, cannot-promote-cold, multiple-patterns ANY-match, empty-string defensive skip, empty-name safety).

60eed5e9 — Round 4: multi-dtype K/V flash-attention dispatch

The round-1 --f16-attn boolean only let operators pick between F32 and F16 K/V flash-attention. Round 4 generalises the dispatch into a four-valued enum + CLI flag so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth as F16, no F16 underflow on small attention scores) or Q8_0 K/V (Vulkan + half the K/V upload bandwidth) on adapters that advertise the corresponding capability. Live wiring that turns the round-2 / round-3 probe results into actual GPU work.

  • New internal enum tts_cpp::supertonic::detail::kv_attn_dtype { autoselect=-1, f32=0, f16=1, bf16=2, q8_0=3 } + pure-logic resolver resolve_kv_attn_type(requested, legacy_use_f16_attn, supports_f16, supports_bf16, supports_q8_0). Same testable-policy split as round-3's resolve_vulkan_device_index.
  • EngineOptions::kv_attn_type int field (-1 = auto, 0..3 explicit) — same -1 = auto convention as f16_attn / f16_weights / vulkan_device, so operator configs are consistent. Default falls back to f16_attn's value, so every existing operator config sees zero behaviour change.
  • Probe-gated graceful fallback to F32 on adapters that don't support the requested dtype — an operator setting --kv-attn-type bf16 once in their production config works on both NVIDIA Ampere+ (BF16 effective via Vulkan coopmat2) and Intel ARC (no coopmat2 → silent F32 fallback) without crashing. Out-of-range --kv-attn-type N throws loudly to surface CLI typos.
  • Vector-estimator dispatch site rewrite (build_text_attention_cache): if (cache.f16_kv_attn) { cast→F16 } replaced with a switch on the enum; cast target picked from {F16, BF16, Q8_0} per cache.kv_attn_type. Cache invalidation key promoted from bool to enum (rebuilds the graph when the enum flips, same correctness contract as the rest of the cache key tuple).
  • --kv-attn-type {auto,f32,f16,bf16,q8_0} CLI on all three binaries. Bench surface adds (kv_attn_type=…) to the human-readable line and "kv_attn_type" + "kv_attn_type_requested" to the JSON output so log-grep / CI attribution works across machines.
  • Bonus: supertonic-cli arg-parse loop wrapped in try/catch so invalid values surface as a clean error: ... line + exit 2 (also fixes a pre-existing latent crash on --vulkan-device abc / --seed nonsense / etc).
  • Prereq B: test-supertonic-f16-attn-parity extended with 2 new BF16-vs-F32 parity checks (vector-estimator + style shapes; CPU max_abs_err = 5.263e-3 and 3.596e-3, both within the same 5e-3 tolerance band as the existing F16 baseline). Written BEFORE any production change — the parity gate was in place before the cast logic was touched.
  • 2 new tests: test-supertonic-kv-attn-type (106 checks across the full {requested × legacy × probe-mask} matrix, out-of-range throws, exhaustive resolver-never-leaks-autoselect sweep) and test-supertonic-kv-attn-type-api (18 checks — SFINAE compile-time gates, runtime defaults, RAII restoration, regression guards on every other documented EngineOptions default).

3c59e523 — Round 7: bench observability + voice cache + Vulkan env-var passthrough

Lowest impact-÷-risk round of those planned in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup.

  1. Voice ttl/dp host cache (detail::voice_host_cache). Eliminates 2 sync points / synthesize() after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a full Engine; reference-stability contract documented for the synthesis-pipeline call site.
  2. Vulkan env-var passthrough: apply_vulkan_env_overrides(map) public helper + EngineOptions::vulkan_env_overrides field + --vulkan-prefer-host-memory / --vulkan-disable-coopmat2 / --vulkan-disable-bfloat16 / --vulkan-perf-logger / --vulkan-async-transfer / --vulkan-env KEY=VALUE CLI flags on all three binaries. ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched. set_env_if_unset semantics so an operator-set env var still WINS over the EngineOptions override.
  3. Bench ggml_backend_synchronize boundaries (--no-bench-sync opt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware.
  4. Bench per-denoise-step breakdown (--bench-per-step). Times each supertonic_vector_step_ggml call individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape.

Two new test executables (test-supertonic-voice-host-cache, test-supertonic-vulkan-env-overrides). TDD caught the env-key validator's empty-string-as-success bug BEFORE wiring went in.

5b166a79 — Round 8: front-block attn0 GPU bridge

Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (run_text_attention_cache_gpu) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0 — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without vector_rope_theta continue to take the host-rotate path.

The blit primitive parity gate already shipped with PR #16 (test-supertonic-graph-to-graph-blit); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact max_abs = 0.0).

0fa1593c — Round 9: style flash-attn GPU bridge

Extends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win).

  • vector_res_style_qkv_result extended with sq_gpu / sk_gpu / sv_gpu GPU handles, populated unconditionally by run_res_style_qkv_cache (cheap — no GPU sync; just ggml_graph_get_tensor lookups).
  • run_res_style_qkv_cache host-download gating: the 3 tensor_to_time_channel(...) downloads of sq / sk / sv are now gated on trace != nullptr. Production path skips them entirely. post stays unconditional — consumed by the next-stage run_style_residual_cache which still expects a host vector (cross-stage GPU bridge for post is deferred).
  • 4 dispatch sites rewired with the same gating pattern as the round-8 front-block bridge: !include_ggml_trace && sq_gpu && sk_gpu && sv_gpu → GPU bridge; otherwise legacy host bridge.

Strict TDD: parity test (test-supertonic-graph-to-graph-blit) extended with explicit style-shape coverage BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exact max_abs = 0.0.

38a67e45 — Round 10: per-step text-input upload-skip

After rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is text_emb (uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used for style_v_in / kctx_in) into a reusable upload_skip_tracker helper and applies it to the front-block + 3 group caches.

CRITICAL CORRECTNESS HAZARD addressed: text_emb is a stack-local std::vector<float> in Engine::Impl::synthesize() (and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may have text_emb.data() == synth_N.text_emb.data() despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer.

Mitigation: caller MUST invoke tracker.reset() at every synth boundary (current_step == 0). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it.

Per-synth wins: 16 fewer host→GPU uploads + ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length).

test-supertonic-upload-skip-tracker (NEW, 7 functions, 41 checks) committed first, observed to fail compile, then implementation added.

b54b7d43 — Round 11: packed-QK RoPE + GPU-bridge layout fix (CRITICAL CORRECTNESS)

Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying vector_rope_theta. The first end-to-end synth attempt (CPU OR Vulkan) aborted at GGML_ASSERT(HD == n_heads * head_dim) inside apply_rope_to_packed_qk, and even past that assertion every ggml_backend_tensor_copy(q_src, q_tc_in) on the GPU-bridge fast paths would have hit GGML_ASSERT(ggml_are_same_layout(src, dst)) because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's q_tc_in / k_tc_in / v_tc_in tensors expect.

Root cause: apply_rope_to_packed_qk (PR #16 audit follow-up #5) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] channel-fastest-in-memory tensor. In fact the matmul (CPU cblas_sgemm and GPU conv1d_f32(K=1)) produces ne=[L, HD] with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong [HD, L] shape, so the failure mode was invisible to CI.

The fix (strict TDD):

  1. test_supertonic_rope_packed_qk.cpp rewritten under the production matmul shape ne=[L, HD] (channel-major-flat memory). Reference built in scalar apply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins y->ne[0] = HD, y->ne[1] = L so the downstream q_tc_in blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks).
  2. apply_rope_to_packed_qk (supertonic_internal.h): add a head-of-pipeline ggml_cont(ggml_transpose(q)) to flip from ne=[L, HD] channel-major-flat to ne=[HD, L] time-major-flat (which IS the layout q_tc_in expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar apply_rope's native layout AND q_tc_in's blit target bit-for-bit.
  3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same ggml_cont(ggml_transpose(...)) at the matmul output in build_group_graph_cache, ve_front_block_proj_cache, and build_res_style_qkv_cache so all four GPU-bridge attention sites get bit-for-bit matching layouts.
  4. Legacy host-bridge fallbacks switched from tensor_to_time_channel(<post-rope-or-v>) to tensor_raw_f32(...). The new graph-side layout puts the bytes already in the time-major-flat shape scalar apply_rope / flash_attention_qkv host references read, so the raw download is the correct call.

Verification:

Backend Pre-fix Post-fix
CPU abort on first step writes 3.89s 44.1 kHz WAV
Vulkan RTX 5090 abort writes 6.53s WAV; 44 ms / 5 steps; 74× realtime
Vulkan AMD RADV iGPU abort writes 3.64s WAV; 178 ms; 7× realtime
Vulkan Mesa lavapipe abort writes 1.21s WAV

The round-1..10 wins are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path.

bb99d3ce — Round 12: auto-pick UMA bias + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware.

#10 — Auto-pick UMA bias

Round 3's argmax(free_vram) picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40× realtime regression for any operator following the help text "auto-pick adapter with most free VRAM".

Extended resolve_vulkan_device_index with an optional third arg is_uma_per_device. Empty UMA list → round-3 behaviour preserved verbatim. Non-empty + at least one discrete → argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit requested >= 0 passthrough is UMA-agnostic.

Caller wiring (in init_supertonic_backend) collects UMA flags via the public ggml_backend_dev_get_props() API on ggml_backend_vk_reg() — sets is_uma = true for GGML_BACKEND_DEVICE_TYPE_IGPU / _CPU / _ACCEL.

test_supertonic_vulkan_device_select.cpp extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-index-ignores-UMA-bias, mismatched-length-throws).

#6 — Text-encoder speech-prompted-attention GPU bridge

Master's Metal-port branch (PR #15) built speech_prompted_merged_cache (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download → pack → re-upload between the QKV cache and the flash-attn cache.

Round 12 #6 adds run_speech_prompted_merged_cache and the dispatch in speech_prompted_attention_ggml. Eliminates per call: 2 GPU→host downloads + 3 host→GPU uploads + 1 graph dispatch + all host pack work = 5 sync points × 2 layers = 10 sync points / synth at the text encoder alone.

CPU stays on the legacy two-cache path: master's dense_matmul_time_ggml CPU fast path uses cblas + the host-side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU).

test_supertonic_text_encoder_gpu_bridge.cpp (NEW) pins the symbol via SFINAE + struct field contract + a free-default-cache trip-wire (catches a buggy free path that segfaults on never-built thread_local cache slots at process exit). 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests.

#5 — Pinned-host-buffer per-step input scratchpad

Round 3 shipped the capability probe; the actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 #5 lands the helper try_alloc_inputs_in_pinned_host_buffer.

Returns nullptr on null model.backend / null input_ctx / non-Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from ggml_backend_vk_host_buffer_type() via ggml_backend_alloc_ctx_tensors_from_buft. Caller owns the returned buffer; frees at cache destruction.

Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: vector_group_graph_cache (× 3 for g1/g2/g3) and ve_front_block_graph_cache. Total: 9 per-step input tensors moved to host-pinned memory. Each ggml_backend_tensor_set on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy).

CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains.

test_supertonic_pinned_host_buffer.cpp (NEW) — 11 / 11 CPU-only checks pass.

Combined perf snapshot on RTX 5090

Long-prompt bench (173 chars, ~15s of audio):

  • Round 11 baseline: 76.11 ms / 5 steps (123× realtime)
  • Round 12 (all three): 27.99 ms / 5 steps (537× realtime) — 2.7× faster
  • Vector-estimator step: 12.7 ms → 3.28 ms (3.9× faster)
  • Prewarm cold-start: 330 ms → 21 ms (15× faster)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):

  • Round 11 --vulkan-device -1: picks RADV → 178 ms (7× realtime)
  • Round 12 --vulkan-device -1: picks RTX 5090 → 28 ms (537× realtime) — 6.4× faster for users following help text

b9f95358 — Round 13: code-quality consolidation + Q8_0 K/V finding

Strict-improvement-only follow-up to round 12: no code path is removed, no optimisation is rolled back, end-to-end perf on every backend stays at the round-12 level. Two deliverables, both no-regret:

1. New helper alloc_input_scratchpad_or_throw

Round 12 #5 inlined the "try pinned-host first, fall back to default backend buffer, throw on both-fail" idiom at 4 cache sites (front block + 3 group caches). Round 13 factors it into one helper. Same correctness contract — CPU / Metal / OpenCL fall back to default backend buffer; Vulkan tries pinned-host first. Defensive failure modes consolidated: null model.backend, null input_ctx, null cache_name all throw std::runtime_error with a message that includes the cache name, instead of segfaulting in an error-handler path. Single point of maintenance for the pattern; future cache builds that want pinned-host inputs use the helper directly.

test_supertonic_input_scratchpad.cpp (NEW, 9 / 9 checks) pins the contract via SFINAE on the symbol + CPU-fallback round-trip through ggml_backend_tensor_set / get + null-arg throws + empty-ctx error message includes the cache name. CPU-only — no GGUF fixture required.

Perf impact: zero — same code path, same allocations, same data movement, just one fewer level of nesting at each call site.

2. Q8_0 K/V no-win documented for RTX 5090

Round 4 shipped the --kv-attn-type q8_0 CLI option and bench output advertises q8_0_kv_attn=available. Round 13 measures the trade-off on the test rig (RTX 5090, 1.79 TB/s memory bandwidth, long prompt 206 chars / 18 s audio):

--kv-attn-type Total Realtime ratio
f16 (default) 31.11 ms 588×
q8_0 31.84 ms 575× (2 % slower)

The F32→Q8_0 cast overhead exceeds the saved K/V upload bandwidth on a high-bandwidth discrete GPU. Operator guidance: stick with the F16 default on RTX 5090 and similar high-bandwidth discretes. Q8_0 is shipped for adapters where the K/V upload bottlenecks the synth (older PCIe 3.0, lower-end discretes, iGPUs with slow BAR); cross-over point to be measured per-adapter by operators using --bench-per-step from round 7.

Backwards-compatibility contract

Every round preserves the existing operator-config baseline:

  • --f16-attn 0|1 semantics unchanged — round 4's --kv-attn-type auto (the default) falls back to --f16-attn via the resolver.
  • --vulkan-device 0 semantics unchanged — round 1 introduced the flag; round 3's -1 is opt-in only; round 12's UMA-bias only activates on hybrid rigs and never overrides an explicit index.
  • --f16-weights 0|1 semantics unchanged — round 6's --f16-weights-deny is opt-in only and has no effect when --f16-weights 0.
  • --prewarm defaults to empty (no-op).
  • --vulkan-env / --vulkan-prefer-host-memory / --vulkan-disable-coopmat2 etc. (round 7) all default off; an operator-set env var still wins over the EngineOptions override.
  • --bench-per-step / --no-bench-sync (round 7) default off; legacy JSON shape preserved on the default path.
  • model.use_f16_attn boolean is still populated and is kept in sync with the round-4 enum (= (kv_attn_type == f16)) so any external code keying on the boolean stays consistent.
  • All round-1 / round-3 probes throw on out-of-range CLI input (loud failure for actual config errors); all round-2 / round-3 / round-4 probe-gated dispatches fall back to F32 silently (advisory-probe contract — visible in bench output so operators can confirm a fallback).
  • All GPU-bridge fast paths (rounds 8 / 9 / 12 added approval check worker #6) gate on !include_ggml_trace, so the trace harness still captures pre-attention Q/K/V host vectors.
  • Round-10 upload-skip is gated on tracker.reset() at every synth boundary; without the reset, the tracker behaves identically to a no-op (each call uploads).
  • Round-11 layout-flip is universally applied, so the legacy host-bridge fallback continues to work bit-for-bit on backends that don't activate the GPU bridge.
  • Round-12 add_codeowners file #5 / round-13 helper safely return nullptr on non-Vulkan backends; no allocator behaviour change for CPU / Metal / OpenCL.

Test plan

CPU-only — a fresh checkout's ctest -L unit exercises every new contract without needing a Vulkan adapter.

cmake -S tts-cpp -B build-tts -DTTS_CPP_USE_SYSTEM_GGML=OFF
cmake --build build-tts --parallel
ctest --test-dir build-tts -L unit --output-on-failure

Expected: 25 / 25 tests, 0 failures, 0 regressions.

Vulkan build (same expectations):

cmake -S tts-cpp -B build-tts-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
cmake --build build-tts-vulkan --parallel
ctest --test-dir build-tts-vulkan -L unit --output-on-failure
Test Purpose Round
test-supertonic-vulkan-dispatch Backend-flag dispatch through supertonic_op_dispatch_scope + F16-K/V probe smoke 1
test-supertonic-portable-ops (UPDATED) LEAKY_RELU decomposition path stays exercised when the helper short-circuits to the native fused op 1
test-supertonic-capability-cache Probe-counter regression (1 cache miss + N hits per backend) + new-probe coverage 2 + 3
test-supertonic-warm-up-api SFINAE compile-time gate for Engine::warm_up + EngineOptions::prewarm_text 2
test-supertonic-vulkan-device-select resolve_vulkan_device_index behaviour matrix (extended in r12 with UMA-bias coverage) 3 + 12
test-supertonic-f16-weights (UPDATED) Round 6 deny-list overload — 7 new functions / 29 new checks 6
test-supertonic-f16-deny-list-api SFINAE compile-time gate for EngineOptions::f16_weights_deny_list 6
test-supertonic-kv-attn-type resolve_kv_attn_type behaviour matrix (full {requested × legacy × probe-mask} sweep, 106 checks) 4
test-supertonic-kv-attn-type-api SFINAE compile-time gates for the round-4 enum + EngineOptions field 4
test-supertonic-f16-attn-parity (UPDATED) F16 + BF16 K/V parity vs F32 reference on both hot shapes 4
test-supertonic-voice-host-cache Voice ttl/dp host cache lookup-or-load semantics + reference stability 7
test-supertonic-vulkan-env-overrides All-or-nothing env-var validator + set-if-unset semantics 7
test-supertonic-graph-to-graph-blit (UPDATED) Front-block + style + group attention shapes, bit-exact max_abs = 0.0 8 + 9
test-supertonic-upload-skip-tracker Pointer-compare upload-skip + cross-synth pointer-reuse hazard test (41 checks) 10
test-supertonic-rope-packed-qk (REWRITTEN) RoPE helper under production [L, HD] matmul layout, bit-exact vs scalar apply_rope 11
test-supertonic-text-encoder-gpu-bridge run_speech_prompted_merged_cache SFINAE + struct contract + free-default trip-wire 12
test-supertonic-pinned-host-buffer try_alloc_inputs_in_pinned_host_buffer nullptr safety + non-Vulkan fallback 12
test-supertonic-input-scratchpad alloc_input_scratchpad_or_throw SFINAE + CPU-fallback round-trip + null-arg throws 13

Smoke testing the CLIs

# Help text on all three binaries (round-4 + round-7 flags visible)
./build-tts/supertonic-cli --help 2>&1 | grep -A 6 kv-attn-type
./build-tts/tts-cli        --help 2>&1 | grep -B1 -A 6 vulkan-env
./build-tts/supertonic-bench       2>&1 | grep -A 5 bench-per-step

# Invalid value surfaces cleanly (no backtrace)
./build-tts/supertonic-cli --model /tmp/x.gguf --text x --out x.wav --kv-attn-type bogus
# -> "error: --kv-attn-type expects one of: auto, f32, f16, bf16, q8_0 (got: bogus)"
# -> exit 2

# Full round-1..13 surface
./build-tts/supertonic-cli --model models/supertonic2.gguf --text "Hello" --out /tmp/out.wav \
  --vulkan-device -1 --kv-attn-type bf16 --f16-weights 1 \
  --f16-weights-deny vector_estimator.attention.W_v --prewarm "Warm up text." \
  --vulkan-prefer-host-memory --vulkan-disable-coopmat2

End-to-end real-Vulkan validation

Verified on 4 backends after round 11 unblocked the production path:

Backend Result Latency / 5 steps
CPU writes 3.89 s WAV (reference)
Vulkan RTX 5090 writes 6.53 s WAV 28 ms / 537–588× realtime (round 12+)
Vulkan AMD RADV iGPU writes 3.64 s WAV 178 ms / 7× realtime
Vulkan Mesa lavapipe writes 1.21 s WAV (CPU-emulated)

Bench JSON includes "kv_attn_type" (resolved) + "kv_attn_type_requested" (raw int) + "prewarm_ms" + per-step timings (--bench-per-step) so a probe miss / cold-start cost / per-step regression is visible in the output and CI scripts can attribute drift / perf differences to the right cause.

File-by-file change summary

30 files changed, 8950 insertions(+), 331 deletions(-)
File Δ Notes
tts-cpp/CMakeLists.txt +184 Wire 12 new test executables + Vulkan link option
tts-cpp/PROGRESS_SUPERTONIC.md +1377 Per-round audit + measurement log
tts-cpp/include/tts-cpp/supertonic/engine.h +137 New EngineOptions fields: vulkan_device, prewarm_text, f16_weights_deny_list, kv_attn_type, vulkan_env_overrides, bench_per_step + Engine::warm_up()
tts-cpp/src/chatterbox_cli.cpp +118 All round flags mirrored on the tts-cli alias
tts-cpp/src/chatterbox_tts.cpp +1 #include <atomic> (pre-existing missing-include fix)
tts-cpp/src/supertonic_bench.cpp +397 All round flags + bench-output surface (human + JSON) + per-step + sync-boundary + voice-cache-stats
tts-cpp/src/supertonic_cli.cpp +73 All round flags + try/catch arg-parse hardening
tts-cpp/src/supertonic_engine.cpp +145 Probe-gated use_f16_weights auto-policy, multi-device auto-pick wiring (with UMA bias), warm_up impl, round-4 K/V dispatch resolution, voice-cache integration, env-var passthrough
tts-cpp/src/supertonic_gguf.cpp +1151 Capability-cache implementation, 6 new probes, resolve_vulkan_device_index (with UMA bias), resolve_kv_attn_type, multi-device auto-pick, dispatch-scope rounds 1–13 plumbing, deny-list integration, pinned-host-buffer helper, alloc_input_scratchpad_or_throw
tts-cpp/src/supertonic_internal.h +866 New kv_attn_dtype enum, model fields, probe forwarders, resolvers, dispatch-scope extension, voice_host_cache, upload_skip_tracker, GPU-bridge tensor handles, packed-QK RoPE layout fix
tts-cpp/src/supertonic_text_encoder.cpp +152 run_speech_prompted_merged_cache + dispatch in speech_prompted_attention_ggml (round-12 #6)
tts-cpp/src/supertonic_vector_estimator.cpp +718 Round-4 enum-switch dispatch site, cache-key promotion, GPU-bridge front-block + style + group rewires (rounds 8 / 9), upload-skip tracker integration (round 10), pinned-host-buffer per-step inputs (round 12 #5), layout fixes for round-11 GPU-bridge blits
tts-cpp/test/test_supertonic_capability_cache.cpp NEW (+424) Round 2 + extended in round 3
tts-cpp/test/test_supertonic_f16_attn_parity.cpp +162 Prereq B BF16 extension
tts-cpp/test/test_supertonic_f16_deny_list_api.cpp NEW (+134) Round 6
tts-cpp/test/test_supertonic_f16_weights.cpp +147 Round 6 deny-list extension
tts-cpp/test/test_supertonic_graph_to_graph_blit.cpp +28 Round 8 + 9 front-block + style shape coverage
tts-cpp/test/test_supertonic_input_scratchpad.cpp NEW (+296) Round 13
tts-cpp/test/test_supertonic_kv_attn_type.cpp NEW (+256) Round 4 (106 checks)
tts-cpp/test/test_supertonic_kv_attn_type_api.cpp NEW (+157) Round 4
tts-cpp/test/test_supertonic_pinned_host_buffer.cpp NEW (+236) Round 12 #5
tts-cpp/test/test_supertonic_portable_ops.cpp +10 Round 1 — explicit use_native_leaky_relu = false on the GPU fixture
tts-cpp/test/test_supertonic_rope_packed_qk.cpp REWRITTEN (+244 / -93) Round 11 — production [L, HD] matmul layout
tts-cpp/test/test_supertonic_text_encoder_gpu_bridge.cpp NEW (+216) Round 12 #6
tts-cpp/test/test_supertonic_upload_skip_tracker.cpp NEW (+300) Round 10 (41 checks)
tts-cpp/test/test_supertonic_voice_host_cache.cpp NEW (+285) Round 7
tts-cpp/test/test_supertonic_vulkan_device_select.cpp NEW (+403) Round 3 + extended in round 12 (UMA-bias coverage)
tts-cpp/test/test_supertonic_vulkan_dispatch.cpp NEW (+268) Round 1
tts-cpp/test/test_supertonic_vulkan_env_overrides.cpp NEW (+278) Round 7
tts-cpp/test/test_supertonic_warm_up_api.cpp NEW (+118) Round 2

Deferred follow-ups (intentionally out of scope)

Tracked in tts-cpp/PROGRESS_SUPERTONIC.md "Deferred work" section:

  • Persistent VkPipelineCache: recovers ~91 % of cold→warm shader-compilation gap on first warm run, keyed by <vendorID>-<deviceID>-<driverVersion> and rooted at $XDG_CACHE_HOME/ggml/vulkan. This is a ggml-vulkan internal patch (~199 lines) that benefits all Vulkan workloads, not just Supertonic; tracked separately so the supertonic-specific PR stays reviewable. Round-2's --prewarm is an in-process workaround; the persistent on-disk cache extends the win across process restarts.
  • Cross-stage GPU bridge for post (round 9 follow-up): the post output of run_res_style_qkv_cache is still downloaded to host and re-uploaded into run_style_residual_cache. Would eliminate ~20 more sync points / synth. Deferred until measured impact justifies the dual-graph refactor.
  • Q8_0 K/V cross-over measurement: round 13 documents Q8_0 is a 2 % regression on RTX 5090; cross-over point to be measured per-adapter (older PCIe 3.0, low-end discretes, iGPUs with slow BAR) by operators using --bench-per-step from round 7.

Linked

@Zbig9000 Zbig9000 requested review from a team as code owners May 14, 2026 09:37
@Zbig9000 Zbig9000 force-pushed the supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch 2 times, most recently from b9f9535 to 51a17d9 Compare May 15, 2026 14:25

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help address/clarify the following:

  1. Round 5 is skipped — no explanation

The summary says "twelve rounds of Vulkan-specific deltas (rounds 1–13, round 5 skipped)" but nowhere in the PR is there an explanation of what round 5 was or why it was skipped. Was it superseded by another round? Rolled into a different commit? Abandoned after testing? This leaves a gap in the audit log that makes it harder to assess whether the omission is safe or whether something was quietly dropped.

  1. The round-11 fix is redone in PR #21

PR #21 is a standalone fix for the same apply_rope_to_packed_qk layout bug fixed in round 11 here, but targeting supertonic_optimizations (without Vulkan). The PR description acknowledges the bug came from PR #16. What's unclear is the merge strategy: does PR #18 subsume PR #21 when it lands, or will both be merged separately and cause a double-application of the fix? The V-transpose fix in PR #21 also says it only touches 2 GPU-bridge call sites, while round 11 here touches 4 (build_group_graph_cache, ve_front_block_proj_cache, build_res_style_qkv_cache, and style sq/sk/sv). The difference needs to be reconciled before either merges.

  1. UMA bias heuristic is fragile on some device topologies

The round-12 fix (resolve_vulkan_device_index with is_uma_per_device) picks the discrete adapter by excluding GGML_BACKEND_DEVICE_TYPE_IGPU / _CPU / _ACCEL. This works for the RTX 5090 + AMD RADV iGPU test case. However, on machines where the discrete GPU is the only device and reports GGML_BACKEND_DEVICE_TYPE_IGPU (some Thunderbolt eGPUs, some ARM SoC configurations), the "all-UMA fallback" path would fire and argmax(free_vram) would still pick the right device. That's correct by the test matrix. But if someone has two UMA iGPUs and one discrete that also happens to report IGPU type due to a driver quirk, they'd silently get the wrong device with no warning. The existing test cases don't cover this; it might be worth a code comment documenting the assumption.

  1. Voice host cache reference stability — documented but not enforced

Round 7 introduces voice_host_cache and documents that "reference-stability contract [is] documented for the synthesis-pipeline call site." The test pins the contract via CPU-only checks. However, if a synthesizer call happens concurrently (e.g., from a thread pool or the iOS scenario described in the iOS concurrency fix commit), and the cache is evicted or a new voice is loaded mid-synthesis, the reference would dangle. The PR doesn't show any locking on the cache access path. Given that the iOS race fixes landed in the same PR history (the 36a2c56 commit fixing the gguf_init_from_file race), this deserves explicit scrutiny: is voice_host_cache accessed under any lock, or is it the caller's responsibility to ensure single-threaded access?

@Zbig9000 Zbig9000 force-pushed the supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch from 51a17d9 to 1632e45 Compare May 18, 2026 10:23
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
…sumption + voice cache threading + round-5 gap

Pure docs / comments change.  No production-logic surface
modified.  CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit`
25 / 25; CPU + Vulkan end-to-end synth produce valid speech
WAVs (99.7% non-zero samples, healthy rms).

Addresses three reviewer asks on PR tetherto#18:

1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md).
   Adds an explicit "Note on the round 5 gap" section between
   round 4 and round 7 documenting that the round-4 plan
   reserved the name "Round 5 = pinned-host-buffer per-step
   uploads" as a placeholder, that the actual implementation
   was deferred behind round-7's bench observability
   prerequisite, and that it ultimately landed as round 12 tetherto#5.
   No code was dropped; round numbers stay contiguous so PR
   descriptions and CI logs match the round labels in this log
   without rebase churn.

2. UMA-bias assumption (supertonic_gguf.cpp —
   resolve_vulkan_device_index).  Adds a long comment in the
   requested == -1 auto-pick branch documenting the assumption
   that is_uma_per_device[i] is sourced from
   ggml_backend_dev_get_props().type and the failure mode when
   a discrete adapter's driver mis-reports its type as _IGPU
   (some Thunderbolt eGPU configs; some ARM SoC dGPU paths).
   Three sub-cases enumerated: (a) discrete-only with
   mis-classification falls through to round-3 all-device
   argmax and still picks discrete by free-VRAM (coincidentally
   correct), (b) mixed UMA-iGPU + mis-classified-discrete picks
   iGPU silently (regression vs. round 3 — operator escape
   hatch: --vulkan-device N is UMA-agnostic and
   --vulkan-perf-logger exposes the choice).  Future-work
   pointer to a "free-VRAM ceiling" heuristic (UMA reports
   system-RAM-scale; a discrete reporting > 256 GB is
   implausible and can be re-classified) tracked in
   aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.

3. voice_host_cache threading model (supertonic_internal.h).
   Tightens the reference-stability docstring from "must NOT
   call clear() while holding the reference" to a full
   thread-safety section explicitly calling out single-threaded
   -per-Engine as the supported model (matches what the iOS
   load/unload race fix 36a2c56 enforces for s3gen).  Explains
   why no internal lock today (cache exists to eliminate per
   -call GPU downloads; internal locking would give back the
   saving) and what a future thread-pool refactor must do
   (external mutex around get_or_load + downstream .data()
   capture, OR switch to a std::shared_mutex-guarded internal
   lock).  Also clarifies the unordered_map guarantee: element
   references survive insert even when the table rehashes;
   only iterators are invalidated.

Reviewer's fourth ask — "the round-11 fix is redone in PR
tetherto#21" — was resolved by the rebase landing in this same branch
state.  After rebasing onto upstream/supertonic_optimizations
(which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix),
this branch's round-11 commit is a delta of only the 2
Vulkan-only V-transpose sites needed for round 8's front-block
GPU bridge + round 9's style GPU bridge.  No double-application;
the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the
new base.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000

Zbig9000 commented May 18, 2026

Copy link
Copy Markdown
Author

Please help address/clarify the following:

1. Round 5 is skipped — no explanation

The summary says "twelve rounds of Vulkan-specific deltas (rounds 1–13, round 5 skipped)" but nowhere in the PR is there an explanation of what round 5 was or why it was skipped. Was it superseded by another round? Rolled into a different commit? Abandoned after testing? This leaves a gap in the audit log that makes it harder to assess whether the omission is safe or whether something was quietly dropped.

2. The round-11 fix is redone in PR #21

PR #21 is a standalone fix for the same apply_rope_to_packed_qk layout bug fixed in round 11 here, but targeting supertonic_optimizations (without Vulkan). The PR description acknowledges the bug came from PR #16. What's unclear is the merge strategy: does PR #18 subsume PR #21 when it lands, or will both be merged separately and cause a double-application of the fix? The V-transpose fix in PR #21 also says it only touches 2 GPU-bridge call sites, while round 11 here touches 4 (build_group_graph_cache, ve_front_block_proj_cache, build_res_style_qkv_cache, and style sq/sk/sv). The difference needs to be reconciled before either merges.

3. UMA bias heuristic is fragile on some device topologies

The round-12 fix (resolve_vulkan_device_index with is_uma_per_device) picks the discrete adapter by excluding GGML_BACKEND_DEVICE_TYPE_IGPU / _CPU / _ACCEL. This works for the RTX 5090 + AMD RADV iGPU test case. However, on machines where the discrete GPU is the only device and reports GGML_BACKEND_DEVICE_TYPE_IGPU (some Thunderbolt eGPUs, some ARM SoC configurations), the "all-UMA fallback" path would fire and argmax(free_vram) would still pick the right device. That's correct by the test matrix. But if someone has two UMA iGPUs and one discrete that also happens to report IGPU type due to a driver quirk, they'd silently get the wrong device with no warning. The existing test cases don't cover this; it might be worth a code comment documenting the assumption.

4. Voice host cache reference stability — documented but not enforced

Round 7 introduces voice_host_cache and documents that "reference-stability contract [is] documented for the synthesis-pipeline call site." The test pins the contract via CPU-only checks. However, if a synthesizer call happens concurrently (e.g., from a thread pool or the iOS scenario described in the iOS concurrency fix commit), and the cache is evicted or a new voice is loaded mid-synthesis, the reference would dangle. The PR doesn't show any locking on the cache access path. Given that the iOS race fixes landed in the same PR history (the 36a2c56 commit fixing the gguf_init_from_file race), this deserves explicit scrutiny: is voice_host_cache accessed under any lock, or is it the caller's responsibility to ensure single-threaded access?

Reply 1 — "Round 5 is skipped — no explanation"
Good catch — fixed. Round 5 was a planning placeholder, not abandoned code. The round-4 plan in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md reserved the name "Round 5 = pinned-host-buffer per-step uploads" as the next deliverable. We deferred it because the plan itself called out a hard prerequisite: round 7's bench observability was needed to measure the win and verify no regression on adapters where pinned-host turns out slower. After landing rounds 6, 7, 8, 9, 10, 11 we came back to the pinned-host-buffer work and shipped it as round 12 #5 (bundled with two other items: the auto-pick UMA bias fix and the text-encoder GPU-bridge wiring — see the round-12 commit message and the #5 sub-section in PROGRESS_SUPERTONIC.md round-12 entry).

The contiguous round-12 / round-13 numbering (instead of retroactively renaming round 12 to "round 5 (delayed)") is deliberate: the commit hashes referenced in PR descriptions and CI logs match the round labels in PROGRESS_SUPERTONIC.md without rebase churn.

Added an explicit "Note on the round 5 gap" section in PROGRESS_SUPERTONIC.md between round 4 and round 7 so the audit log makes this unambiguous.

Reply 2 — "The round-11 fix is redone in PR #21"
Resolved by today's rebase. PR #21 was the canonical fix for QVAC-18966 (cherry-picked from this branch's round 11 and retargeted at supertonic_optimizations without the Vulkan rounds). PR #21 covers the 2 GPU-bridge call sites that exist on the Vulkan-free branch (build_group_graph_cache + the front-block path in supertonic_vector_trace_proj_ggml).

This PR's round 11 originally covered 4 sites: the same 2 sites PR #21 covers + 2 more (ve_front_block_proj_cache's V transpose for round 8's front-block GPU bridge + build_res_style_qkv_cache's sq/sk/sv transposes for round 9's style GPU bridge). Those 2 extras only matter when the Vulkan-only round-8/9 GPU bridges are wired — which is why PR #21's narrower scope was correct for the non-Vulkan branch.

Merge strategy after rebase: PR #21 is already in supertonic_optimizations. I just rebased this branch onto the new base, and the round-11 commit (ef266e4) is now a delta of only the 2 Vulkan-only V-transpose sites + comment merges. No double-application: the QVAC-18966 fix is applied exactly once via PR #21 in the new base. Verified: CPU + Vulkan ctest -L unit 25/25 PASS post-rebase; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples).

Reply 3 — "UMA bias heuristic is fragile on some device topologies"
Agreed, and the failure mode is real. Mitigations actually in code today:

Empty UMA-flag list → falls back to round-3 argmax(free_vram) (unchanged behaviour for callers that haven't wired the UMA flags).
All-UMA list → also falls back to round-3 argmax over all devices (preserves backward-compat).
Explicit --vulkan-device N → UMA-agnostic passthrough; operator-pinned index always wins.
--vulkan-perf-logger → exposes the chosen device in the bench JSON for post-mortem.
The edge cases you flagged broken down:

Single discrete reporting _IGPU due to driver quirk: discrete is flagged UMA → excluded from the discrete-subset argmax → any_discrete == false → falls through to round-3 all-device argmax → discrete still picked by free-VRAM (correct outcome by coincidence on a single-discrete rig).
Mixed true UMA iGPU + mis-classified discrete: round-12 bias prefers the iGPU over the mis-classified discrete (silent regression vs. round 3). Operator escape hatch is --vulkan-device N + the perf-logger device dump for diagnosis.
Added a long comment in resolve_vulkan_device_index (in the requested == -1 branch) documenting all three sub-cases plus a future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be heuristically re-classified). Tracked for a follow-up round in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Commit: 1632e45.

Reply 4 — "Voice host cache reference stability — documented but not enforced"
Right call — the threading expectation needed to be spelled out. Today voice_host_cache is single-threaded by contract, not by lock. The Engine's documented threading model is single-threaded synthesis per Engine instance; concurrent synthesis requires one Engine per thread (each Engine carries its own voice_host_cache). This is the same model the iOS load/unload race fix 36a2c56 enforces for the s3gen preload path — they're consistent.

Why no internal lock today: the cache exists to eliminate per-call GPU downloads of ttl / dp (~2 sync points per synthesize() on Vulkan / OpenCL). Adding an internal mutex would give back a measurable chunk of that saving (an uncontended std::mutex lock+unlock pair is small but not free on the hot path of every synth). Since the existing iOS fix already mandates one-Engine-per-thread for concurrent synthesis, the cache inherits the same constraint at zero extra cost.

Standard unordered_map guarantee re: rehash: element references are NOT invalidated by insert (only iterators are). So even if a second voice loads mid-call on the same thread (impossible today, but allowed for completeness), a held entry & from a prior get_or_load survives. The only operations that can invalidate are clear() / erase() on that entry — and clear() is only reachable on Engine destruction.

Strengthened the docstring in supertonic_internal.h with an explicit THREAD-SAFETY section documenting all of the above, including what a future thread-pool refactor would need (external mutex around get_or_load + the downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Commit: 1632e45.

@ogad-tether ogad-tether left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Vulkan backend for Supertonic (PR #18)

Thorough review of the 8753-line addition across 30 files. The overall engineering quality is high — TDD discipline is genuine, the commit-per-round structure makes the evolution auditable, and the backwards-compatibility contract is well-documented. The PR is in good shape for merge with a few items to consider.

Findings

1. test_resolver_returns_concrete_only asserts too weakly (test_supertonic_kv_attn_type.cpp)

The exhaustive 5×2×8 resolver sweep only checks dt != kv_attn_dtype::autoselect. A typo in the resolver (e.g., returning f16 when bf16 was requested + supported) would pass this test silently. Consider spot-checking the "happy path" cases with exact enum comparisons — e.g., requested=2, supports_bf16=true → bf16.

2. test_cpu_fallback_returns_valid_buffer only round-trips one of two tensors (test_supertonic_input_scratchpad.cpp)

The test allocates x_in (512B) and temb_in (256B) but only does a tensor_set/tensor_get round-trip on x_in. If the buffer allocation failed to bind temb_in, this test wouldn't catch it.

3. Probe-gated silent fallback vs explicit operator request (resolve_kv_attn_type, supertonic_gguf.cpp:1473-1478)

When an operator explicitly requests --kv-attn-type bf16 but the backend doesn't support it, the resolver silently falls back to F32. This is documented as intentional (advisory-probe contract), but a fprintf(stderr, "warning: ...") on the explicit-request + unsupported path would save operators from silently getting F32 when they thought they had BF16. The auto path (-1) correctly stays silent. The bench JSON does surface the resolved type, so it's partially observable already.

4. Minor: resolve_vulkan_device_index UMA-bias tiebreak within discrete subset (test_supertonic_vulkan_device_select.cpp)

The test for test_hybrid_prefer_discrete_over_uma uses devices with distinct VRAM sizes (32GB vs 120GB). The tiebreak case of two discrete cards with equal VRAM (should pick lower index) is not tested. Covered by the non-UMA auto-pick tests, but worth adding one UMA-specific tiebreak case for completeness.

5. cached_backend_capabilities returns const& through a lock boundary (supertonic_gguf.cpp:779)

The returned reference outlives the lock_guard. This is safe in production because unordered_map references aren't invalidated by insert, and clear() is test-only. But supertonic_clear_capability_cache() could create a dangling reference in multi-threaded test scenarios. If test code ever calls clear() while another thread holds a reference from cached_backend_capabilities, that's UaF. Low risk given single-threaded test execution today, but worth a comment.

Positive observations

  • The TDD caught real bugs (V layout transpose, env-var empty-string sentinel, pointer-compare upload-skip). The commit messages document the red→green cycle with specific failure modes — this is exactly how TDD should be practiced on low-level GPU code.
  • The pure-logic resolver split (resolve_vulkan_device_index, resolve_kv_attn_type) makes the policy layer fully testable on CPU without a Vulkan adapter. Smart design.
  • Backwards-compatibility is meticulously maintained — every existing flag/default preserves its semantics.
  • The 25/25 CPU-only ctest suite catches regressions in the dispatch/capability/resolver contracts without needing GPU hardware in CI.
  • Performance results are impressive (2.4–2.7× end-to-end speedup, 15× prewarm improvement on RTX 5090).

None of the findings are merge-blockers. Items 1–2 are low-effort test improvements; items 3–5 are suggestions for consideration.

Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
… tests + surface explicit-dtype downgrades

Pure additive change (one new resolver out-param defaulting to
nullptr; two test files extended; two doc-comment blocks added).
No production-logic surface modified for existing callers.

Regression status:
- CPU `ctest -L unit`: 25 / 25, 256 individual checks
  (was 25 / 25, ~209 checks pre-change).
- Vulkan `ctest -L unit`: 25 / 25.
- CPU + Vulkan end-to-end synth: bit-identical 10.10 s WAV
  (rms=285.6, abs_max=4703 on both backends, same seed +
  text), confirming no rounds-1..13 optimisation regressed.

Addresses Omar's five non-blocker findings on PR tetherto#18:

1. test_resolver_returns_concrete_only (kv_attn_type).  The
   original exhaustive 5 x 2 x 8 sweep only asserted dt !=
   autoselect, so a typo returning f16 when bf16 was
   requested+supported would pass silently.  Rewritten with a
   second pure-function `expected()` mirror of the resolver's
   matrix; every one of the 80 grid points now CHECKs the
   resolver's return value against the expected concrete
   dtype.  Added cross-contamination spot checks (requesting
   bf16 with f16+q8_0 supported but bf16 NOT supported must
   fall to f32, not silently to f16 or q8_0).  Now 205 checks
   passed in test-supertonic-kv-attn-type.

2. test_cpu_fallback_returns_valid_buffer (input_scratchpad).
   Original only round-tripped x_in (one of two allocated
   tensors).  Now round-trips BOTH x_in and temb_in with
   distinct payload patterns (1.0f vs 2.5f), plus a
   cross-aliasing recheck (after writing temb_in, x_in must
   still read back its original 1.0f) — a binding-collision
   bug where both tensors share memory would now fail this
   check.

3. resolve_kv_attn_type silent fallback on explicit operator
   request.  Added optional `bool * out_was_downgraded` output
   parameter to the resolver — set to true IFF the operator
   explicitly requested f16/bf16/q8_0 AND the corresponding
   backend probe returned false AND we therefore returned f32.
   The auto path (-1) leaves the flag false (no operator
   surprise — auto-policy is doing its job).  Engine ctor +
   supertonic-bench wired to emit a one-line
   `fprintf(stderr, "warning: requested --kv-attn-type %s but
   the resolved backend's flash-attn probe rejected it;
   falling back to f32 (set --kv-attn-type auto to silence)")`
   on a downgrade.  Defaulted nullptr keeps the pure-logic
   unit tests stderr-clean.  New test_downgrade_flag_signal
   pins the contract on every relevant path (auto + missing
   probe -> flag false; explicit + matching probe -> flag
   false; explicit + missing probe -> flag true; nullptr out-
   ptr safe).

4. test_uma_aware_tiebreak_equal_vram_discretes
   (vulkan_device_select).  Added a dedicated UMA-bias-active
   test case: two discrete cards with EQUAL VRAM (32 GB each)
   alongside a UMA iGPU.  Pins three sub-cases: interleaved
   UMA in the middle, adjacent discretes with no UMA, three-
   way all-discrete tie.  Lower index wins in every case.
   The existing test 11's second CHECK already covered the
   interleaved-UMA case; this hoists the contract into its
   own named test so a future refactor reading the test
   names knows the tiebreak case is pinned.

5. cached_backend_capabilities UaF risk under test-only
   clear().  Added a long comment on the function documenting
   the four invariants:
   (a) production callers may hold the returned ref across
       subsequent calls for OTHER backends (unordered_map's
       insert-doesn't-invalidate-references guarantee);
   (b) production callers MUST NOT keep the ref alive across
       a clear() call (test code's responsibility);
   (c) multi-threaded callers must externally synchronise
       deref vs. clear (the cache's lock protects map
       structure, NOT element lifetime);
   (d) if a future refactor adds a production-reachable
       erase / clear path, this function must switch to
       return-by-value or std::shared_ptr<const T>.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000

Copy link
Copy Markdown
Author

Review — Vulkan backend for Supertonic (PR #18)

Thorough review of the 8753-line addition across 30 files. The overall engineering quality is high — TDD discipline is genuine, the commit-per-round structure makes the evolution auditable, and the backwards-compatibility contract is well-documented. The PR is in good shape for merge with a few items to consider.

Findings

1. test_resolver_returns_concrete_only asserts too weakly (test_supertonic_kv_attn_type.cpp)

The exhaustive 5×2×8 resolver sweep only checks dt != kv_attn_dtype::autoselect. A typo in the resolver (e.g., returning f16 when bf16 was requested + supported) would pass this test silently. Consider spot-checking the "happy path" cases with exact enum comparisons — e.g., requested=2, supports_bf16=true → bf16.

2. test_cpu_fallback_returns_valid_buffer only round-trips one of two tensors (test_supertonic_input_scratchpad.cpp)

The test allocates x_in (512B) and temb_in (256B) but only does a tensor_set/tensor_get round-trip on x_in. If the buffer allocation failed to bind temb_in, this test wouldn't catch it.

3. Probe-gated silent fallback vs explicit operator request (resolve_kv_attn_type, supertonic_gguf.cpp:1473-1478)

When an operator explicitly requests --kv-attn-type bf16 but the backend doesn't support it, the resolver silently falls back to F32. This is documented as intentional (advisory-probe contract), but a fprintf(stderr, "warning: ...") on the explicit-request + unsupported path would save operators from silently getting F32 when they thought they had BF16. The auto path (-1) correctly stays silent. The bench JSON does surface the resolved type, so it's partially observable already.

4. Minor: resolve_vulkan_device_index UMA-bias tiebreak within discrete subset (test_supertonic_vulkan_device_select.cpp)

The test for test_hybrid_prefer_discrete_over_uma uses devices with distinct VRAM sizes (32GB vs 120GB). The tiebreak case of two discrete cards with equal VRAM (should pick lower index) is not tested. Covered by the non-UMA auto-pick tests, but worth adding one UMA-specific tiebreak case for completeness.

5. cached_backend_capabilities returns const& through a lock boundary (supertonic_gguf.cpp:779)

The returned reference outlives the lock_guard. This is safe in production because unordered_map references aren't invalidated by insert, and clear() is test-only. But supertonic_clear_capability_cache() could create a dangling reference in multi-threaded test scenarios. If test code ever calls clear() while another thread holds a reference from cached_backend_capabilities, that's UaF. Low risk given single-threaded test execution today, but worth a comment.

Positive observations

* The TDD caught real bugs (V layout transpose, env-var empty-string sentinel, pointer-compare upload-skip). The commit messages document the red→green cycle with specific failure modes — this is exactly how TDD should be practiced on low-level GPU code.

* The pure-logic resolver split (`resolve_vulkan_device_index`, `resolve_kv_attn_type`) makes the policy layer fully testable on CPU without a Vulkan adapter. Smart design.

* Backwards-compatibility is meticulously maintained — every existing flag/default preserves its semantics.

* The 25/25 CPU-only `ctest` suite catches regressions in the dispatch/capability/resolver contracts without needing GPU hardware in CI.

* Performance results are impressive (2.4–2.7× end-to-end speedup, 15× prewarm improvement on RTX 5090).

None of the findings are merge-blockers. Items 1–2 are low-effort test improvements; items 3–5 are suggestions for consideration.

Reply 1 — test_resolver_returns_concrete_only asserts too weakly
Fixed in 903c312. The original sweep only asserted dt != autoselect, so a typo returning f16 when bf16 was requested + supported would have slipped through. Now the test computes the expected concrete dtype as a separately-implemented pure function of the inputs (a hand-rolled mirror of the resolver's behaviour matrix — typo on one side won't cancel a typo on the other) and CHECKs each of the 80 grid points against the expected dtype. Added explicit happy-path spot checks for your example (requested=2, supports_bf16=true → bf16, requested=3, supports_q8_0=true → q8_0) plus cross-contamination guards: requesting bf16 with f16 and q8_0 supported but bf16 NOT supported MUST fall to f32, not silently to one of the other supported dtypes. Total test-supertonic-kv-attn-type count went from ~80 checks to 205 / 205.

Reply 2 — test_cpu_fallback_returns_valid_buffer only round-trips one of two tensors
Fixed in 903c312. The test now round-trips BOTH x_in and temb_in with distinct payload patterns (1.0f vs 2.5f) so a binding failure on the second tensor fails the test, plus a cross-aliasing recheck: after writing 2.5f to temb_in, x_in must still read back 1.0f — a buffer-overlap bug where the helper bound both tensors to the same memory range would now fail this check too. test-supertonic-input-scratchpad is now 11 / 11 checks (was 9).

Reply 3 — Probe-gated silent fallback vs explicit operator request
Agreed, and fixed in 903c312. Added an optional bool * out_was_downgraded output parameter to resolve_kv_attn_type (defaulting to nullptr so the pure-logic unit tests stay stderr-clean). The resolver sets the flag iff the operator explicitly requested f16 / bf16 / q8_0 AND the corresponding backend probe returned false AND the resolver therefore returned f32. The auto path (-1) leaves the flag false — the operator didn't ask for a specific dtype, so there's nothing to surprise them with.

Engine ctor and supertonic-bench are wired to emit:

supertonic: warning: requested --kv-attn-type bf16 but the resolved backend's
flash-attn probe rejected it; falling back to f32 (set --kv-attn-type auto to silence)
on a downgrade. The auto path correctly stays silent — verified on both CPU and Vulkan (no warning when --kv-attn-type auto runs on either backend). New test_downgrade_flag_signal pins the contract: 8 scenarios covering every relevant path including the nullptr default-argument safety check.

One observation worth noting: on this dev rig the CPU backend's ggml_backend_supports_op(FLASH_ATTN_EXT(F32, BF16, BF16)) actually returns true (the CPU flash_attn_ext is generic), so the warning doesn't fire on CPU + --kv-attn-type bf16. That's correct probe behaviour, not a wiring bug. The warning will fire on adapters that genuinely reject the op (e.g., Vulkan without cooperative_matrix2 for BF16, or future backends that selectively reject Q8_0 K/V).

Reply 4 — resolve_vulkan_device_index UMA-bias tiebreak
Fixed in 903c312. Added a dedicated test_uma_aware_tiebreak_equal_vram_discretes that pins three sub-cases of the equal-VRAM-discretes tiebreak with the UMA bias active:

Interleaved UMA: [32GB discrete, 32GB discrete, 120GB UMA] → picks index 0 (lower discrete).
Adjacent discretes (no UMA in the middle): [32GB discrete, 32GB discrete] → picks index 0.
Three-way all-discrete tie: [32GB, 32GB, 32GB] → picks index 0.
Test 11's second CHECK already covered the interleaved case implicitly, but hoisting it into its own named test makes the tiebreak contract greppable + a future refactor reading the test names knows the case is pinned. test-supertonic-vulkan-device-select is now 40 / 40 checks (was 37).

Reply 5 — cached_backend_capabilities returns const & through a lock boundary
Fixed in 903c312. Added a long comment on the function documenting the four invariants the contract relies on:

Production callers may hold the returned ref across subsequent cached_backend_capabilities calls for OTHER backends — std::unordered_map's reference-stability guarantee survives insert/emplace rehash; only iterators are invalidated.
Production callers MUST NOT keep the ref alive across a supertonic_clear_capability_cache call. That helper is test-only and exported with no header declaration; the contract is "callers don't reach this; tests do, single-threaded".
Multi-threaded callers must externally synchronise deref vs. clear (the lock here protects the map's structural invariants during insert/find, NOT the lifetime of returned elements).
If a future refactor adds a production-reachable erase/clear path, this function should switch to return-by-value or std::shared_ptr ownership — otherwise the UaF you flagged becomes reachable from production.
Spelled out explicitly above the function body so the next maintainer doesn't have to derive the constraint from scattered context.

GustavoA1604 and others added 13 commits May 19, 2026 10:41
Two interleaved chatterbox concurrency fixes for iOS, collapsed into
one history-preserving commit on top of the upstream merge so the
qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604
SHA (no chained port-version bumps).

1) gguf_init_from_file race (the SIGABRT seen before this commit):
   bake_voice_conditioning() must run BEFORE we spawn the s3gen preload
   thread.  Both paths funnel into gguf_init_from_file() (voice_encoder
   opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the
   ggml_init / gguf_init_from_file pair underneath is not safe to
   invoke concurrently from two threads against ggml's process-global
   state.  Empirically races on Apple Silicon with a fast SIGABRT
   inside ggml_abort coming from the preload thread's ggml_init while
   the main thread is still executing voice_encoder_load.

2) Metal shared-buffer-type init race (the SIGSEGV in
   ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the
   preload thread spawns we now block on wait_for_preload() before
   the constructor returns, so the SDK e2e bootstrap's
   load -> immediate unload pattern ("preLoadUnload") can no longer
   tear down the engine while s3gen_preload is still inside
   ggml_backend_metal_buffer_type_shared_alloc_buffer ->
   ggml_metal_buffer_is_shared.  Defeats the parallel-preload
   optimisation (s3gen_preload no longer overlaps with first T3
   inference inside synthesize()); revisit once ggml-metal's shared
   buffer-type init is safe to use from a preload thread concurrent
   with construction-time teardown.

Together these two changes unblock chatterbox load on iPhone 16e
(iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal
enabled — qvac/pull/1992.

Co-authored-by: Cursor <cursoragent@cursor.com>
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups
(PR tetherto#16): the audit-driven optimisations there are backend-portable by
construction (every host-sync / bandwidth / fusion win uses the same
GPU dispatch path Vulkan walks), so this PR only adds the
Vulkan-specific dispatch deltas the OpenCL bring-up did not need.

Vulkan-specific deltas
- supertonic_model gains backend_is_vk + use_native_leaky_relu, both
  resolved at GGUF load time:
  - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine
    backend_name() can annotate the device with
    ggml_backend_vk_get_device_description().
  - use_native_leaky_relu via a ggml_backend_supports_op probe against
    a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml
    to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched
    OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for
    plain upstream OpenCL.  Dynamic probe self-adapts to whichever
    ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml
    ships in.
- supertonic_backend_supports_f16_kv_flash_attn probe (synthetic
  Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the
  use_f16_attn auto-policy so a backend that ships flash_attn_ext but
  rejects the F16-K/V variant for Supertonic shapes keeps the F32 path
  instead of crashing at first synth call.  Manual --f16-attn 1 still
  forces F16 (debug knob).
- Vulkan device selection: replaces the historical hard-coded
  ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed
  through EngineOptions::vulkan_device, range-checked against
  ggml_backend_vk_get_device_count() at load (out-of-range index is a
  hard error — surfaces operator typos / wrong-machine config loud
  rather than silently falling back to CPU).  Verbose mode + bench
  output append the Vulkan device description so multi-GPU / multi-ICD
  machines unambiguously identify which adapter ran.
- supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu
  slot so the scope correctly mirrors the new model field through
  thread-local dispatch.

Tests
- test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness
  covering the new flags through supertonic_op_dispatch_scope plus a
  smoke test for the F16-K/V flash-attn probe.  29/29 checks pass.
- test-supertonic-portable-ops (existing): fixture model now requests
  use_native_leaky_relu = false explicitly so the GPU-decomposition
  correctness gate stays green now that the helper short-circuits on
  backends with native LEAKY_RELU.  10/10 checks pass.
- test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass.
- All audit follow-up tests from tetherto#16 unchanged, all PASS.

Build
- All changed source files compile clean with both -DGGML_USE_VULKAN
  defined and undefined; non-Vulkan builds compile clean.
- No public-API break: EngineOptions::vulkan_device defaults to 0
  (the historical hard-coded value), load_supertonic_gguf gains a new
  optional last argument with the same default; existing callers are
  source-compatible.

Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"):
persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all
Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device
load-balancing (--vulkan-device -1 auto-pick).

Co-authored-by: Cursor <cursoragent@cursor.com>
`g_s3gen_cache_refcount` is a `std::atomic<int>` (line 189) but
`<atomic>` was never included; the file relied on a transitive
include chain that broke once any consumer rearranged includes.
Surfaces as `error: variable 'std::atomic<int> ... has initializer
but incomplete type'` on a clean build.

Pre-existing bug, unrelated to QVAC-18605 itself but blocked
local CTest runs against the Vulkan-optimisation work.  Trivial
additive include with no behaviour change.

Co-authored-by: Cursor <cursoragent@cursor.com>
…s + prewarm

Layered on top of the QVAC-18605 Vulkan bring-up commit; the
round-2 changes generalise the bring-up's "load-time backend
probe" pattern into a process-wide capability cache and add
three more probes / dispatch hooks that fit the same shape.
Net effect on Vulkan: redundant supports_op traffic eliminated,
defensive auto-policy gating extended to F16 weights, forward-
compat Q8_0 K/V probe primed for a follow-up dispatch flip,
and an opt-in --prewarm hook that lets operators amortise the
~hundreds-of-ms cold-start shader-compile cost outside the
operator-visible first synth call.

1) Process-wide capability-probe cache keyed by ggml_backend_t

   The bring-up's three load sites (load_supertonic_gguf,
   Engine::Engine, supertonic_bench's main) each ran the
   LEAKY_RELU + F16-K/V flash-attn supports_op queries
   independently — 2-3x redundant probe traffic per backend.
   On Vulkan, supports_op may inspect the device's pipeline
   state (~50-200 us per query on Adreno / llvmpipe / RADV in
   microbenchmarks); the cache short-circuits 100 % of the
   duplicates.  Test seam (supertonic_clear_capability_cache +
   supertonic_capability_probe_call_count) lets the unit test
   verify the cache is hit on the second call by comparing the
   counter before / after.  Per-backend independence verified
   against two distinct CPU backend handles.

2) F16 mul_mat backend-capability probe

   Symmetric to the F16-K/V flash-attn probe.  The bring-up
   auto-enabled use_f16_weights on `!backend_is_cpu` blindly;
   a partial-port backend that ships F16 storage but rejects
   the hot vector-estimator W_query mul_mat shape would crash
   at first synth call.  Probe builds the live shape ([256,256]
   F16 weight x [256,16] F32 activation) and asks the backend;
   auto-policy refuses materialisation on a `false` answer
   (slower F32 path stays correct).  Manual --f16-weights 1
   still forces materialisation (debug-shim escape hatch).
   Probe cached; test verifies CPU returns true.

3) Q8_0 K/V flash-attn forward-compat probe

   Vulkan's GGML_OP_FLASH_ATTN_EXT supports_op advertises Q8_0
   (and Q4_0) K/V types in scalar + coopmat2 paths.  Switching
   K/V from F16 to Q8_0 would halve the per-step upload
   bandwidth (50 KB -> 25 KB per K/V on Supertonic's hot shape;
   ~1 MB / synth on the default 5-step x 4-site schedule) in
   exchange for a small (~0.5 %) drift on the attention output.
   This commit adds the probe + caches the result; live
   dispatch site is NOT yet wired pending F16-vs-Q8_0 K/V drift
   measurement against the parity harness on a real Vulkan
   adapter.  Bench output annotates `(q8_0_kv_attn=available)`
   when the probe says yes so operators can confirm their
   hardware is ready for the follow-up.

4) Engine::warm_up(text) + EngineOptions::prewarm_text +
   --prewarm CLI flag (supertonic-cli, tts-cli, supertonic-bench)

   First-synth-latency reduction on Vulkan / OpenCL.  In-tree
   thread_local graph caches handle every subsequent call but
   can't avoid the first pipeline-compile cost (~hundreds of
   ms on Adreno / RADV per chatterbox PROGRESS.md).  warm_up
   runs one throwaway synth at construction time on a caller-
   supplied sample text so the operator-visible first synth
   sees steady-state latency.  Auto-no-op on CPU (no shader-
   compile cost).  Bench's --prewarm runs the cold-start synth
   BEFORE the timed loop (independent of --warmup N which only
   discards N timed runs from the median); cold-start latency
   logged as `[prewarm] cold-start synth on '...' took N.Nms`
   and emitted to --json-out as "prewarm_ms".

5) Bench output extended

   Backend log line surfaces every dispatch flag plus the
   cold-start prewarm latency:
     Vulkan (device 0: ...) (f16_attn=on) (f16_weights=on)
       (native_leaky_relu=on) (q8_0_kv_attn=available)
   --json-out gains "f16_attn", "f16_weights",
   "native_leaky_relu", "q8_0_kv_attn_available", "prewarm_ms"
   keys for downstream analysis tooling.

Tests
- test-supertonic-capability-cache (NEW, LABEL "unit"): probe
  cache short-circuit + clear seam + per-backend independence
  + idempotency + F16 mul_mat probe + Q8_0 K/V probe smoke.
  18 / 18 checks pass.
- test-supertonic-warm-up-api (NEW, LABEL "unit"): API-surface
  contract for EngineOptions::prewarm_text + Engine::warm_up
  via SFINAE.  9 / 9 checks pass.
- All existing CPU-only unit tests (test-supertonic-vulkan-
  dispatch, -portable-ops, -backend-dispatch, -rope-in-graph,
  -rope-packed-qk, -in-graph-transpose, -convnext-block-fused,
  -graph-to-graph-blit, -profile-csv, -f16-attn-parity, plus
  resample / cpu-caches / t3-caches): all 13 pass unchanged.
- ctest -L unit reports 100 % pass (15 / 15 binaries; 184+ /
  184+ individual checks).

Build
- All changed source files compile clean with both
  -DGGML_USE_VULKAN defined and undefined.
- No public-API break: EngineOptions::prewarm_text is a new
  optional field defaulting to empty (no-op), Engine::warm_up
  is a new method (existing callers don't have to invoke it).

Deferred (tracked in PROGRESS_SUPERTONIC.md "Deferred work"):
persistent VkPipelineCache (cross-process), BF16 K/V flash-attn,
Q8_0 K/V live dispatch wiring, multi-device load-balancing.

Co-authored-by: Cursor <cursoragent@cursor.com>
…vice auto-pick + 2 forward-compat probes

Three more Vulkan-specific deltas, all developed test-first.  New
tests were committed first, observed to fail on the missing
symbol, and only then was the implementation written and the
tests re-run to verify green.

1. BF16 K/V flash-attn capability probe (5th cached_backend_capabilities
   flag).  Symmetric to the round-2 Q8_0 K/V probe.  Vulkan's
   FLASH_ATTN_EXT supports_op advertises BF16 K/V via the coopmat2-
   only path; BF16 has the same 2-byte per-element footprint as
   F16 (so identical upload bandwidth) but the wider 8-bit
   exponent range avoids the F16 underflow on small attention
   scores.  Forward-compat — the live --kv-attn-type bf16 dispatch
   wiring is deferred to a follow-up that measures drift against
   the parity harness on a real Vulkan adapter.

2. Multi-device auto-pick for --vulkan-device -1.  Wires the
   previously-reserved auto-pick API: walks every visible adapter,
   queries ggml_backend_vk_get_device_memory() to read free VRAM,
   and dispatches into a pure-logic helper
   resolve_vulkan_device_index(requested, free_vram_per_device)
   that picks argmax(free_vram); ties → lower index for stable
   per-run assignment on identical-spec multi-GPU machines.  The
   pure-logic helper is testable on CPU with synthetic inputs (8
   test functions, 23 checks).  Reserved-future negative values
   (-2, -100, ...) now throw instead of silently falling through
   to device 0.  Verbose mode logs the per-device VRAM table so
   operators can confirm the auto-pick chose the expected adapter.

3. Pinned-host-buffer-type capability probe (6th cache flag) +
   bench surface.  Probes whether ggml_backend_vk_host_buffer_type()
   is callable on the resolved backend (Vulkan + non-null buffer-
   type).  Forward-compat — primes the capability cache for a
   follow-up per-engine input-scratchpad refactor that skips
   ggml-vulkan's internal staging-buffer hop on per-step uploads.
   Bench output now shows bf16_kv_attn_available +
   pinned_host_buffer_available in both the human-readable backend
   tag and the JSON output so operators can pre-flight whether a
   future opt-in will be effective on their machine.

Test plan (TDD round 3):
- test-supertonic-capability-cache: 27 / 27 checks pass (was 18,
  +9 checks for round-3: BF16 K/V smoke + cache-slot share,
  pinned-host-buffer smoke + cache-slot share, null-backend
  defensive checks for both new probes).
- test-supertonic-vulkan-device-select (NEW): 23 / 23 checks pass
  (8 test functions: empty-list, single-device, argmax-VRAM, tie-
  break, explicit-index passthrough, out-of-range, reserved-
  negative, zero-VRAM handling).
- Whole CPU-only ctest -L unit reports 16 / 16 tests passing,
  zero regressions on round-1 / round-2 / audit-follow-up tests.

CLI surface:
- supertonic CLI + chatterbox CLI usage strings updated to
  document --vulkan-device -1 = auto-pick adapter with most free
  VRAM.
- supertonic-bench usage string updated likewise.

Co-authored-by: Cursor <cursoragent@cursor.com>
…hts operator deny-list

Round 6 layers a user-overridable extra deny-list on top of the
existing hand-curated should_materialise_f16_weight() allow-list.
The curated allow-list (Phase 2A) already excludes biases, norms,
embeddings, depthwise convs, and pre-transposed companions; the
round-6 deny-list lets operators force-keep specific additional
tensors as F32 even when --f16-weights is on.  Use cases:

- A/B testing: researcher excludes a specific tensor pattern
  temporarily without recompiling.
- Hardware-specific drift mitigation: operator pins a problematic
  tensor to F32 via config rather than disabling F16 weights
  wholesale.
- Future-GGUF safety net: new tensor patterns added in future
  GGUFs that the curated allow-list inadvertently scoops in can
  be excluded via config without a code change.

Smallest blast radius of the four follow-up rounds — load-time
policy only, runtime dispatch unaffected, zero behaviour change
on the empty-deny-list default path.

Strict TDD discipline (per the user's "double check, don't break
anything" constraint):
- Both new tests committed FIRST.
- Both confirmed to fail to compile on the missing symbols
  (predicate test: 'too many arguments to should_materialise_f16_weight';
  API test: 'EngineOptions has no member f16_weights_deny_list').
- Implementation written.
- Both tests + every existing unit test re-run; all green.

What changed:

1. 2-arg overload should_materialise_f16_weight(name,
   extra_deny_substrings) added alongside the existing 1-arg
   version (existing test + call sites unchanged).  Substring
   matching matches the curated predicate's audit-friendly style;
   no regex compile cost or invalid-pattern surface.  The deny-
   list can only flip true → false, never false → true.  Empty
   strings inside the deny-list are SKIPPED defensively, not
   treated as universal matches (config-typo guard).

2. EngineOptions::f16_weights_deny_list (vector<string>, default
   empty) — public API surface.  Wired through Engine::Impl →
   load_supertonic_gguf → the per-tensor allocation loop.

3. load_supertonic_gguf 7th parameter added at the end of the
   signature with a {} default — every existing call site keeps
   compiling without modification.

4. supertonic_model::f16_weights_excluded_count counter bumped at
   load time when a curated-hot tensor is excluded by the user's
   deny-list.  Surfaced in bench's human + JSON output so
   operators can confirm their config took effect.

5. CLI plumbing: --f16-weights-deny PAT1,PAT2,... flag on
   supertonic-cli, tts-cli (chatterbox), and supertonic-bench
   (comma-separated substring patterns).

6. Verbose-log line in load_supertonic_gguf when the deny-list is
   non-empty (silent on the default path — no visual noise on
   existing operator workflows).

Test plan (TDD round 6):

- test-supertonic-f16-weights (UPDATED): existing 36 checks
  (positives, negatives, edges) + 29 new round-6 checks across 7
  new test functions (empty-list passthrough, matching-deny-
  excludes, non-matching-no-op, cannot-promote-cold, multiple-
  patterns ANY-match, empty-string defensive skip, empty-name
  safety) → 65 / 65 PASS.
- test-supertonic-f16-deny-list-api (NEW): SFINAE compile-time
  gate for EngineOptions::f16_weights_deny_list +
  load_supertonic_gguf 7th param; runtime defaults check +
  assignability + regression guards on every other documented
  EngineOptions default → 9 / 9 PASS.
- Whole CPU-only ctest -L unit reports 17 / 17 tests, 0
  failures, 0 regressions on round-1/2/3 + audit follow-up + the
  baseline tests.
- Smoke-tested supertonic-cli + tts-cli + supertonic-bench
  binaries: --f16-weights-deny flag parses correctly, surfaces in
  --help output, and threads through to the load layer.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ype K/V flash-attention dispatch

Generalises the round-1 `--f16-attn` boolean (F16 vs F32 only) into a
four-valued enum + `--kv-attn-type {auto,f32,f16,bf16,q8_0}` CLI flag
so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth
as F16, no F16 underflow on small attention scores) or Q8_0 K/V
(Vulkan + half the K/V upload bandwidth) on adapters that advertise
the corresponding capability.  Default `auto` falls back to
`--f16-attn` so every existing operator config sees zero behaviour
change.

Strict TDD throughout: Prereq B extends the F16 parity harness to
cover BF16 (4 → 8 checks, 5e-3 abs / 5e-3 rel tolerance band, both
hot shapes) BEFORE touching any production code; new pure-logic
resolver test (`test-supertonic-kv-attn-type`, 106 checks across the
full {-1, 0..3} × legacy × probe-mask matrix); new API-surface
SFINAE lockdown (`test-supertonic-kv-attn-type-api`, 18 checks).
Tests committed first, observed to fail on missing symbols, then
implementation added.

Pure-logic resolver (`resolve_kv_attn_type`) split from the dispatch
site (same pattern as round-3's `resolve_vulkan_device_index`).
Probe-rejected explicit requests fall back to F32 silently
(advisory-probe contract); out-of-range int throws to surface CLI
typos loudly.  Vector-estimator dispatch site
(`build_text_attention_cache`) replaces the F16-only cast with a
switch on the enum; cache key promoted from `bool f16_kv_attn` to
`kv_attn_dtype kv_attn_type`.  Bench surface adds `(kv_attn_type=…)`
to the human-readable backend line and `"kv_attn_type"` +
`"kv_attn_type_requested"` to the JSON output so log-grep / CI
attribution works across machines.

Bonus: `supertonic-cli`'s arg-parse loop is now wrapped in try/catch
so invalid values surface as a clean `error: ...` line + exit 2
(also fixes the pre-existing latent crash on `--vulkan-device abc` /
`--seed nonsense`).

Whole CPU-only `ctest -L unit` reports 19 / 19 tests, 0 failures, 0
regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…servability + voice cache + Vulkan env-var passthrough

Lowest impact-÷-risk round of the four planned in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  Four sub-features, none
touching the per-synth hot path beyond a single voice-cache
lookup.

1. Voice ttl/dp host cache (`detail::voice_host_cache`).  Eliminates
   2 sync points / synthesize() after the first per-voice call on
   Vulkan / OpenCL.  Extracted to a standalone helper so the
   lookup-or-load semantics are testable on CPU without
   instantiating a full Engine; reference-stability contract
   documented for the synthesis-pipeline call site.

2. Vulkan env-var passthrough (`apply_vulkan_env_overrides(map)`
   public helper + `EngineOptions::vulkan_env_overrides` field +
   `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` /
   `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` /
   `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags
   on all three binaries).  ALL-OR-NOTHING validation: an
   operator-config typo throws cleanly BEFORE any env var is
   touched.  `set_env_if_unset` semantics so an operator-set env
   var still WINS over the EngineOptions override.

3. Bench `ggml_backend_synchronize` boundaries (`--no-bench-sync`
   opt-out).  Inserts an explicit backend sync at every per-stage
   timing boundary so wall-clock attributes to the right stage on
   async backends.  Cheap on CPU; prerequisite for measuring
   round-5 / 8 / 9 wins on real hardware.

4. Bench per-denoise-step breakdown (`--bench-per-step`).  Times
   each `supertonic_vector_step_ggml` call individually so the
   first-step (cold pipeline) cost is distinguished from
   steady-state.  Empty array on the default-off path = identical
   legacy JSON shape.

Strict TDD throughout.  Two new test executables committed
first, observed to fail on missing symbols, then implementation
written.  TDD also caught a real bug: the original env-key
validator used `std::string()` empty-as-success sentinel which
collided with the empty-string-as-key edge case; the test pinned
the contract and forced a `bool / out-param` API fix BEFORE any
production wiring went in.

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions (was 19; +2 new tests = 54 new checks).

Co-authored-by: Cursor <cursoragent@cursor.com>
…PU bridge

Single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  PR tetherto#16's audit follow-up tetherto#6
(2C-lite) shipped the GPU device→device blit infrastructure
(`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group
attentions to use it; the front-block attn0 site was deferred
because of cache-lifetime concerns at the time.  Round 8 picks
it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one
function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points /
synth on the production path (3 GPU→host downloads + 3 host→GPU
uploads of post-RoPE Q / K / raw V at the front-block attn0
site).  Strict gating on `front_in_graph_rope &&
!include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace
mode falls back to the legacy host bridge so the trace harness
still captures pre-attention Q/K/V host vectors, and legacy
GGUFs without `vector_rope_theta` continue to take the host-
rotate path.

The blit primitive parity gate already shipped with PR tetherto#16
(`test-supertonic-graph-to-graph-blit`); round 8 extends it
with explicit coverage of the front-block K / V shapes
(text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`).

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…U bridge

Extends the round-8 GPU bridge pattern to the 4 style flash-attn
sites (style0 + g1_style + g2_style + g3_style).  Largest
bandwidth-style optimisation that ships from pure-Supertonic-side
code: 120 sync points / synth eliminated on the production
Vulkan / OpenCL path (4× the round-8 win).

- vector_res_style_qkv_result extended with `sq_gpu / sk_gpu /
  sv_gpu` GPU handles, populated unconditionally by
  `run_res_style_qkv_cache` (cheap — no GPU sync; just
  `ggml_graph_get_tensor` lookups).  Same shape as
  `vector_group_graph_result::q_rope_gpu` etc from the round-1
  2C-lite work.

- `run_res_style_qkv_cache` host-download gating: the 3
  `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv`
  are now gated on `trace != nullptr`.  Production path skips
  them entirely.  Mirrors the round-1 2C-lite
  `need_host_qkv = (trace != nullptr)` gate.  `post` stays
  unconditional — consumed by the next-stage
  `run_style_residual_cache` which still expects a host vector
  (cross-stage GPU bridge for `post` is deferred).

- 4 dispatch sites rewired with the same gating pattern as the
  round-8 front-block bridge: `!include_ggml_trace && sq_gpu &&
  sk_gpu && sv_gpu` → GPU bridge; otherwise legacy host bridge.
  Trace mode falls back to the legacy host bridge so the trace
  harness still gets all the host vectors.

Strict TDD: parity test
(`test-supertonic-graph-to-graph-blit`) extended with explicit
style-shape coverage (`style_sq_L1` trip-wire + clarified
`style0_q_rope_L20` / `style0_k_rope_kv50`) BEFORE any
production wiring.  All 24 / 24 parity checks pass at bit-exact
`max_abs = 0.0`.

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…t upload-skip

After rounds 8 + 9 wired the GPU bridge for the 5 attention
sites, the largest remaining per-step host upload is `text_emb`
(uploaded to 4 caches × 5 denoise steps = 20 times / synth, but
constant data within one synth).  Round 10 generalises the F4
pointer-compare upload-skip pattern (already used for
`style_v_in` / `kctx_in`) into a reusable
`upload_skip_tracker` helper and applies it to the front-block
+ 3 group caches.

CRITICAL CORRECTNESS HAZARD addressed:

`text_emb` is a stack-local `std::vector<float>` in
`Engine::Impl::synthesize()` (and bench loops).  Modern heap
allocators (jemalloc / tcmalloc / glibc) very often re-issue
the SAME address for the next stack-local vector of the same
size — so synth N+1 may have `text_emb.data() ==
synth_N.text_emb.data()` despite holding completely different
data.  A naive pointer-compare upload-skip would silently leak
prior synth's text-encoder embedding into the next synth's GPU
buffer.

Mitigation: caller MUST invoke `tracker.reset()` at every
synth boundary (`current_step == 0`).  The CPU-only TDD test
includes an explicit cross-synth pointer-reuse hazard
simulation that documents the bug and verifies the reset
prevents it.

Per-synth wins:
- 16 fewer `ggml_backend_tensor_set` host→GPU uploads per synth
- ~512 KB / synth bandwidth saved at text_len=32 (linear in
  prompt length)

Strict TDD: `test-supertonic-upload-skip-tracker` (NEW, 7
functions, 41 checks) committed first, observed to fail compile
(`upload_skip_tracker was not declared`), then implementation
added.

Whole CPU-only `ctest -L unit` reports 22 / 22 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 and others added 3 commits May 19, 2026 10:42
…lidation + Q8_0 K/V finding

Round 13 is a strict-improvement-only follow-up to round 12:
no code path is removed, no optimisation is rolled back, and
the end-to-end perf on every backend stays at the round-12
level.  Two deliverables, both no-regret:

== 1. New helper `alloc_input_scratchpad_or_throw` ==

Round 12 tetherto#5 inlined the "try pinned-host first, fall back to
default backend buffer, throw on both-fail" idiom at 4 cache
sites (front block + 3 group caches):

    cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(model, cache.input_ctx);
    if (!cache.input_buf) {
        cache.input_buf = ggml_backend_alloc_ctx_tensors(cache.input_ctx, model.backend);
        if (!cache.input_buf) {
            // per-cache teardown + throw with cache-specific message
        }
    }

Round 13 factors it into one helper.  Each caller becomes:

    cache.input_buf = alloc_input_scratchpad_or_throw(
        model, cache.input_ctx, "vector_group_graph_cache");

Same correctness contract — CPU / Metal / OpenCL fall back to
default backend buffer; Vulkan tries pinned-host first.
Defensive failure modes consolidated: null model.backend, null
input_ctx, null cache_name all throw std::runtime_error with a
message that includes the cache name, instead of segfaulting in
an error-handler path.  Single point of maintenance for the
pattern; future cache builds that want pinned-host inputs use
the helper directly.

`test_supertonic_input_scratchpad.cpp` (NEW, 9 / 9 checks) pins
the contract via SFINAE on the symbol + CPU-fallback round-trip
through `ggml_backend_tensor_set` / `get` + null-arg throws +
empty-ctx error message includes the cache name.  CPU-only —
no GGUF fixture required.  CI test count goes from 24 / 24
(round 12) to 25 / 25 (round 13).

Perf impact: zero — same code path, same allocations, same data
movement, just one fewer level of nesting at each call site.

== 2. Q8_0 K/V no-win documented for RTX 5090 ==

Round 4 shipped the `--kv-attn-type q8_0` CLI option and bench
output advertises `q8_0_kv_attn=available`.  Round 13 measures
the trade-off on the test rig (RTX 5090, 1.79 TB/s memory
bandwidth, long prompt 206 chars / 18 s audio):

    --kv-attn-type f16:  total=31.11 ms (588x realtime)  <- default
    --kv-attn-type q8_0: total=31.84 ms (575x realtime)  <- 2 % slower

The F32->Q8_0 cast overhead exceeds the saved K/V upload
bandwidth on a high-bandwidth discrete GPU.  Operator guidance:
stick with the F16 default on RTX 5090 and similar high-
bandwidth discretes.  Q8_0 is shipped for adapters where the
K/V upload bottlenecks the synth (older PCIe 3.0, lower-end
discretes, iGPUs with slow BAR); cross-over point to be
measured per-adapter by operators using `--bench-per-step`
from round 7.

== Test plan ==

  ctest --test-dir tts-cpp/build         -L unit
  -> 25 / 25 PASS (was 24 / 24 in round 12; +1 input-scratchpad)
  ctest --test-dir tts-cpp/build-vulkan  -L unit
  -> 25 / 25 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter
writes a valid WAV.

Perf on RTX 5090 (10 runs + 3 warmup, long prompt):
  Round 12 baseline:  med= 31.11 ms  (588x realtime)
  Round 13:           med= 31.71 ms  (577x realtime)
  -> within run-to-run noise; no regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
…sumption + voice cache threading + round-5 gap

Pure docs / comments change.  No production-logic surface
modified.  CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit`
25 / 25; CPU + Vulkan end-to-end synth produce valid speech
WAVs (99.7% non-zero samples, healthy rms).

Addresses three reviewer asks on PR tetherto#18:

1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md).
   Adds an explicit "Note on the round 5 gap" section between
   round 4 and round 7 documenting that the round-4 plan
   reserved the name "Round 5 = pinned-host-buffer per-step
   uploads" as a placeholder, that the actual implementation
   was deferred behind round-7's bench observability
   prerequisite, and that it ultimately landed as round 12 tetherto#5.
   No code was dropped; round numbers stay contiguous so PR
   descriptions and CI logs match the round labels in this log
   without rebase churn.

2. UMA-bias assumption (supertonic_gguf.cpp —
   resolve_vulkan_device_index).  Adds a long comment in the
   requested == -1 auto-pick branch documenting the assumption
   that is_uma_per_device[i] is sourced from
   ggml_backend_dev_get_props().type and the failure mode when
   a discrete adapter's driver mis-reports its type as _IGPU
   (some Thunderbolt eGPU configs; some ARM SoC dGPU paths).
   Three sub-cases enumerated: (a) discrete-only with
   mis-classification falls through to round-3 all-device
   argmax and still picks discrete by free-VRAM (coincidentally
   correct), (b) mixed UMA-iGPU + mis-classified-discrete picks
   iGPU silently (regression vs. round 3 — operator escape
   hatch: --vulkan-device N is UMA-agnostic and
   --vulkan-perf-logger exposes the choice).  Future-work
   pointer to a "free-VRAM ceiling" heuristic (UMA reports
   system-RAM-scale; a discrete reporting > 256 GB is
   implausible and can be re-classified) tracked in
   aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.

3. voice_host_cache threading model (supertonic_internal.h).
   Tightens the reference-stability docstring from "must NOT
   call clear() while holding the reference" to a full
   thread-safety section explicitly calling out single-threaded
   -per-Engine as the supported model (matches what the iOS
   load/unload race fix 36a2c56 enforces for s3gen).  Explains
   why no internal lock today (cache exists to eliminate per
   -call GPU downloads; internal locking would give back the
   saving) and what a future thread-pool refactor must do
   (external mutex around get_or_load + downstream .data()
   capture, OR switch to a std::shared_mutex-guarded internal
   lock).  Also clarifies the unordered_map guarantee: element
   references survive insert even when the table rehashes;
   only iterators are invalidated.

Reviewer's fourth ask — "the round-11 fix is redone in PR
tetherto#21" — was resolved by the rebase landing in this same branch
state.  After rebasing onto upstream/supertonic_optimizations
(which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix),
this branch's round-11 commit is a delta of only the 2
Vulkan-only V-transpose sites needed for round 8's front-block
GPU bridge + round 9's style GPU bridge.  No double-application;
the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the
new base.

Co-authored-by: Cursor <cursoragent@cursor.com>
… tests + surface explicit-dtype downgrades

Pure additive change (one new resolver out-param defaulting to
nullptr; two test files extended; two doc-comment blocks added).
No production-logic surface modified for existing callers.

Regression status:
- CPU `ctest -L unit`: 25 / 25, 256 individual checks
  (was 25 / 25, ~209 checks pre-change).
- Vulkan `ctest -L unit`: 25 / 25.
- CPU + Vulkan end-to-end synth: bit-identical 10.10 s WAV
  (rms=285.6, abs_max=4703 on both backends, same seed +
  text), confirming no rounds-1..13 optimisation regressed.

Addresses Omar's five non-blocker findings on PR tetherto#18:

1. test_resolver_returns_concrete_only (kv_attn_type).  The
   original exhaustive 5 x 2 x 8 sweep only asserted dt !=
   autoselect, so a typo returning f16 when bf16 was
   requested+supported would pass silently.  Rewritten with a
   second pure-function `expected()` mirror of the resolver's
   matrix; every one of the 80 grid points now CHECKs the
   resolver's return value against the expected concrete
   dtype.  Added cross-contamination spot checks (requesting
   bf16 with f16+q8_0 supported but bf16 NOT supported must
   fall to f32, not silently to f16 or q8_0).  Now 205 checks
   passed in test-supertonic-kv-attn-type.

2. test_cpu_fallback_returns_valid_buffer (input_scratchpad).
   Original only round-tripped x_in (one of two allocated
   tensors).  Now round-trips BOTH x_in and temb_in with
   distinct payload patterns (1.0f vs 2.5f), plus a
   cross-aliasing recheck (after writing temb_in, x_in must
   still read back its original 1.0f) — a binding-collision
   bug where both tensors share memory would now fail this
   check.

3. resolve_kv_attn_type silent fallback on explicit operator
   request.  Added optional `bool * out_was_downgraded` output
   parameter to the resolver — set to true IFF the operator
   explicitly requested f16/bf16/q8_0 AND the corresponding
   backend probe returned false AND we therefore returned f32.
   The auto path (-1) leaves the flag false (no operator
   surprise — auto-policy is doing its job).  Engine ctor +
   supertonic-bench wired to emit a one-line
   `fprintf(stderr, "warning: requested --kv-attn-type %s but
   the resolved backend's flash-attn probe rejected it;
   falling back to f32 (set --kv-attn-type auto to silence)")`
   on a downgrade.  Defaulted nullptr keeps the pure-logic
   unit tests stderr-clean.  New test_downgrade_flag_signal
   pins the contract on every relevant path (auto + missing
   probe -> flag false; explicit + matching probe -> flag
   false; explicit + missing probe -> flag true; nullptr out-
   ptr safe).

4. test_uma_aware_tiebreak_equal_vram_discretes
   (vulkan_device_select).  Added a dedicated UMA-bias-active
   test case: two discrete cards with EQUAL VRAM (32 GB each)
   alongside a UMA iGPU.  Pins three sub-cases: interleaved
   UMA in the middle, adjacent discretes with no UMA, three-
   way all-discrete tie.  Lower index wins in every case.
   The existing test 11's second CHECK already covered the
   interleaved-UMA case; this hoists the contract into its
   own named test so a future refactor reading the test
   names knows the tiebreak case is pinned.

5. cached_backend_capabilities UaF risk under test-only
   clear().  Added a long comment on the function documenting
   the four invariants:
   (a) production callers may hold the returned ref across
       subsequent calls for OTHER backends (unordered_map's
       insert-doesn't-invalidate-references guarantee);
   (b) production callers MUST NOT keep the ref alive across
       a clear() call (test code's responsibility);
   (c) multi-threaded callers must externally synchronise
       deref vs. clear (the cache's lock protects map
       structure, NOT element lifetime);
   (d) if a future refactor adds a production-reachable
       erase / clear path, this function must switch to
       return-by-value or std::shared_ptr<const T>.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000 Zbig9000 force-pushed the supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch from 903c312 to bf0ce3b Compare May 19, 2026 09:08

@ogad-tether ogad-tether left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All five findings from the previous review have been addressed in commits 16b9b90 and bf0ce3bb:

  1. kv_attn_type resolver test — Rewritten with a separate expected() mirror function that verifies the exact concrete dtype on all 80 grid points + cross-contamination spot checks. Solid.

  2. Input scratchpad tensor coverage — Now round-trips both x_in and temb_in with distinct payload patterns (1.0f vs 2.5f) plus a cross-aliasing recheck. Would catch binding-collision bugs.

  3. Silent fallback warningresolve_kv_attn_type now takes an optional bool * out_was_downgraded out-param. Engine + bench emit a stderr warning on explicit-request downgrade. Auto path stays quiet. Clean API design with nullptr default.

  4. UMA-bias tiebreak — New test_uma_aware_tiebreak_equal_vram_discretes covers the equal-VRAM discrete case with three sub-cases (interleaved UMA, adjacent discretes, three-way all-discrete tie).

  5. Capability cache UaF docs — Thorough 4-point invariant comment on cached_backend_capabilities documenting the reference-lifetime contract and the conditions under which it would need to change.

The doc commit also adds a clear explanation for the round-5 gap and documents the UMA-bias driver-misreport failure modes.

25/25 tests, 256 individual checks. LGTM.

@GustavoA1604 GustavoA1604 merged commit 184c641 into tetherto:supertonic_optimizations May 19, 2026
59 of 66 checks passed
@Zbig9000 Zbig9000 deleted the supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch May 19, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants