Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic by Zbig9000 · Pull Request #18 · tetherto/qvac-ext-lib-whisper.cpp

Zbig9000 · 2026-05-14T09:37:53Z

Summary

Brings the Supertonic TTS stage of tts-cpp to functional + tunable parity on the Vulkan backend, layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR #16). The audit-driven optimisations from #16 are backend-portable by construction, so Vulkan inherits all ~280 host↔GPU sync-point eliminations + the F16-weight roster + the in-graph RoPE / ConvNeXt fusion / GPU↔GPU blit work without modification. This PR adds twelve rounds of Vulkan-specific deltas (rounds 1–13, round 5 skipped) — each round committed test-first (TDD) with a CPU-only unit gate that locks in the dispatch + capability + correctness contract for future regressions.

Scope vs. PR #16: this PR sits on top of the OpenCL branch (QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic). All 13 commits below are Vulkan-specific deltas; the OpenCL audit work is not restated here. The optimisations layer cleanly because the audit hits the GGML graph layer (backend-portable by construction); Vulkan inherits the wins automatically.

Net new surface (against the OpenCL branch):

Category	Delta
New backend-capability probes	6 (`native_leaky_relu`, `f16_kv_flash_attn`, `f16_mul_mat`, `q8_0_kv_flash_attn`, `bf16_kv_flash_attn`, `pinned_host_buffer`)
New thread-local dispatch flags	2 (`use_native_leaky_relu`, `kv_attn_type`) — joins the round-1 `use_f16_attn`
New `EngineOptions` knobs	6 (`vulkan_device`, `prewarm_text`, `f16_weights_deny_list`, `kv_attn_type`, `vulkan_env_overrides`, `bench_per_step`)
New CLI flags (× 3 binaries)	`--vulkan-device`, `--prewarm`, `--f16-weights-deny`, `--kv-attn-type`, `--vulkan-prefer-host-memory`, `--vulkan-disable-coopmat2`, `--vulkan-disable-bfloat16`, `--vulkan-perf-logger`, `--vulkan-async-transfer`, `--vulkan-env`, `--bench-per-step`, `--no-bench-sync`
New per-step / per-cache helpers	`upload_skip_tracker`, `voice_host_cache`, `try_alloc_inputs_in_pinned_host_buffer`, `alloc_input_scratchpad_or_throw`, `apply_vulkan_env_overrides`, `run_speech_prompted_merged_cache`, plus 5 GPU-bridge dispatch sites
New unit tests (`ctest -L unit`)	12 (`test-supertonic-vulkan-dispatch`, `-portable-ops` updated, `-capability-cache`, `-warm-up-api`, `-vulkan-device-select`, `-f16-deny-list-api`, `-kv-attn-type`, `-kv-attn-type-api`, `-vulkan-env-overrides`, `-voice-host-cache`, `-upload-skip-tracker`, `-text-encoder-gpu-bridge`, `-pinned-host-buffer`, `-input-scratchpad`; plus `-f16-attn-parity` extended for BF16 and `-graph-to-graph-blit` extended for front-block + style shapes; plus `-rope-packed-qk` rewritten for the production `[L, HD]` layout)
Whole `ctest -L unit`	25 / 25 PASS, 0 regressions, 0 flakes (CPU build + Vulkan build)

Combined perf snapshot — RTX 5090, long prompt (173 chars / ~15 s audio):

Stage	Round 11 baseline	Round 13 (final)	Speedup
Whole synth	76.11 ms / 5 steps (123× realtime)	27.99–31.71 ms (537–588× realtime)	2.4–2.7×
Vector-estimator step	12.7 ms	3.28 ms	3.9×
Prewarm cold-start	330 ms	21 ms	15×
Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU)	picks RADV → 178 ms (7× rt)	picks RTX 5090 → 28 ms (537× rt)	6.4×

Investigation methodology (TDD throughout)

Every round followed the same workflow:

Audit: identify a Vulkan-specific gap (capability probe, multi-GPU support, drift recovery, sync-point hotspot, etc.).
Test first: write the CPU-only unit gate that pins the new contract (resolver behaviour matrix, API surface, parity bound, layout invariant). Commit + observe failure on the missing symbol (compile error or assertion).
Implement: minimal-surgery production change. Pure-logic helpers split out so the policy is testable on CPU without a Vulkan device.
Re-run: every new test + every existing test must pass before commit.
End-to-end smoke on real hardware once round-11 unblocked the production path.
Update PROGRESS_SUPERTONIC.md + commit.

The CPU-only test strategy is deliberate: a fresh checkout's ctest exercises the dispatch + capability + resolver contracts without needing a Vulkan adapter, so CI on a CPU-only runner catches regressions in the policy layer. Real-Vulkan numerics are validated through the F16 / BF16 K/V parity harness running against the CPU flash_attn_ext reference, which lands the same ggml_cpy(K → typed) + ggml_flash_attn_ext graph the live Vulkan dispatch builds.

TDD caught real bugs that would otherwise have shipped:

The env-var-passthrough validator (round 7) used std::string() empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced a bool / out-param API fix BEFORE any production wiring went in.
The packed-QK RoPE helper (audit follow-up add_codeowners file #5 from PR Qvac 18607 tts ggml add and optimize open cl for supertonic #16) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] tensor. In fact the matmul produces ne=[L, HD] — the bit-exact transpose of the helper's input contract. The original CPU unit test hand-built Q under the wrong shape, so the failure mode was invisible to CI; round 11 rewrote the test under the production shape (RED), then fixed the helper (GREEN), unblocking end-to-end synth on every backend.
Round-10's pointer-compare upload-skip would have silently leaked prior synth's text-encoder embedding into the next synth on heap allocators that re-issue the same address (jemalloc / tcmalloc / glibc). An explicit cross-synth pointer-reuse hazard test forced the tracker.reset() API at every synth boundary.

Commit-by-commit walkthrough

`787d966b` — Round 1: Vulkan bring-up (initial commit)

Foundational Vulkan dispatch + capability probing. The OpenCL bring-up (#16) used model.use_f16_attn = !backend_is_cpu because the chatterbox OpenCL patch unconditionally accepts the F16-K/V op; on Vulkan the HSK % 8 == 0 supports_op gate has to be respected, so the auto-policy needs a probe.

Two new supertonic_model flags populated at GGUF load: backend_is_vk (informational; appended to the backend-description string) and use_native_leaky_relu (resolved via ggml_backend_supports_op(LEAKY_RELU) against a synthetic node — the dispatch helper short-circuits to the fused builtin on backends that ship GGML_OP_LEAKY_RELU natively, falls back to the conservative RELU + SCALE + ADD decomposition otherwise; no hard-coded backend table).
New backend-capability probe supertonic_backend_supports_f16_kv_flash_attn gates the use_f16_attn auto-policy. Builds a synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node and asks the backend whether it would accept it — load-time, zero hot-path cost, graceful auto-disable on a false answer.
EngineOptions::vulkan_device int + --vulkan-device N CLI flag plumbed through all three binaries. Replaces the historical hard-coded ggml_backend_vk_init(0); range-checked against ggml_backend_vk_get_device_count() at load (out-of-range = hard error, no silent CPU fallback that would hide CLI typos / wrong-machine config).
Verbose mode + bench output append ggml_backend_vk_get_device_description so multi-GPU / multi-ICD machines (NVIDIA + llvmpipe, AMD RADV + NVIDIA) unambiguously identify which adapter ran.
New CPU-only TDD harness test-supertonic-vulkan-dispatch covering the new flags through supertonic_op_dispatch_scope + a smoke test for the F16-K/V probe. Pre-existing test-supertonic-portable-ops updated to explicitly request the decomposed path on the GPU fixture model so its existing GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU.

`d5518ee8` — Pre-existing missing-include fix

tts-cpp/src/chatterbox_tts.cpp used std::atomic<int> without #include <atomic>; pre-existed before this branch but blocked the Supertonic build under the cleaner cmake -S tts-cpp -B build-tts invocation used for round 2+ verification. One-line fix in a single TU. Kept as a separate commit so it's trivially revertable / cherry-pickable to other branches.

`6ab085f6` — Round 2: capability-cache + 3 probes + prewarm

The round-1 probes were already cheap, but engine.cpp + bench.cpp + load_supertonic_gguf each ran them independently — three probes × N capabilities = up to 9 redundant ggml_backend_supports_op calls per backend per process.

Process-wide cached_backend_capabilities map keyed by ggml_backend_t, guarded by a single std::mutex. Hot path is load-time only, so contention is negligible. Probe-call counter (capability_probe_call_counter) exposed for the regression test.
3 new probes added to the cache + exposed as public forwarders:
- supertonic_backend_supports_f16_mul_mat — gates the use_f16_weights auto-policy (Phase 2A made it !backend_is_cpu unconditionally; round 2 makes it probe-gated so a backend that ships F16 storage but rejects the hot mul_mat(F16, F32) shape doesn't crash at first synth call).
- supertonic_backend_supports_q8_0_kv_flash_attn — forward-compat probe; primes the cache for round 4's live dispatch.
- supertonic_backend_supports_native_leaky_relu — wraps round 1's inline probe so the auto-policy can use the cached path.
Engine::warm_up(text) API + EngineOptions::prewarm_text + --prewarm TEXT CLI flag. Runs one throwaway synth at engine construction so the Vulkan / OpenCL shader pipelines for every Supertonic stage compile up-front; the operator-visible first synthesize() call hits steady-state latency instead of paying the ~hundreds-of-ms cold-start hit chatterbox PROGRESS.md measured on Adreno + RADV. No-op on CPU backends.
New tests: test-supertonic-capability-cache (probe-counter regression — 1 cache miss + N hits) and test-supertonic-warm-up-api (SFINAE compile-time gate on the new API).

`36dc758c` — Round 3: multi-device auto-pick + 2 forward-compat probes

The round-1 --vulkan-device N flag covered manual selection but every multi-GPU operator has to pin a specific index in their config; auto-pick across heterogeneous machines requires VRAM introspection.

--vulkan-device -1 auto-pick policy: resolve_vulkan_device_index pure-logic helper picks the device with the most free VRAM via ggml_backend_vk_get_device_memory(). Tie-break = lower index (deterministic). Reserved negatives < -1 throw to surface CLI typos. The pure-logic split makes the behaviour matrix testable on CPU with synthetic (index, [vram_per_device]) tuples — no real Vulkan device required for CI.
2 new forward-compat probes added to the cache:
- supertonic_backend_supports_bf16_kv_flash_attn — symmetric to F16-K/V, picks BF16 instead. Mostly relevant on Vulkan with cooperative_matrix2 (NVIDIA Ampere+ / RDNA3+).
- supertonic_backend_supports_pinned_host_buffer — true iff the backend is Vulkan AND ggml_backend_vk_host_buffer_type() returns non-null. Primes the cache for round 12's per-engine input-scratchpad refactor.
New test test-supertonic-vulkan-device-select (8 functions, 23 checks — empty list, single device, auto-pick max VRAM, tie-breaking, explicit index passthrough, out-of-range, reserved negatives, zero-VRAM device).
test-supertonic-capability-cache extended with new-probe coverage.

`8087852b` — Round 6: F16-weights operator deny-list

The Phase 2A F16-weights policy was all-or-nothing — operators couldn't keep one specific tensor at F32 if it caused drift on a particular adapter / driver combo without disabling F16 weights for the entire model.

2-arg should_materialise_f16_weight(source_name, deny_list) overload layered on top of the curated allow-list. Each entry is a substring; if ANY non-empty entry is found inside a tensor's source name, that tensor stays at its native storage type. Empty entries are skipped defensively (config-typo guard so a stray empty entry doesn't silently disable F16 for the whole model).
EngineOptions::f16_weights_deny_list + --f16-weights-deny PAT1,PAT2,... CLI flag (comma-split parser shared between supertonic-cli / tts-cli / supertonic-bench). Default empty (zero behaviour change for every existing operator config).
supertonic_model::f16_weights_excluded_count counter surfaced in bench output (human + JSON) so operators can confirm their deny-list took effect. Silent on the default empty path.
New test test-supertonic-f16-deny-list-api (SFINAE + runtime defaults + assignability + regression guards). Existing test-supertonic-f16-weights extended with 7 new test functions / 29 new checks (empty-list passthrough, matching-deny-excludes, non-matching-no-op, cannot-promote-cold, multiple-patterns ANY-match, empty-string defensive skip, empty-name safety).

`60eed5e9` — Round 4: multi-dtype K/V flash-attention dispatch

The round-1 --f16-attn boolean only let operators pick between F32 and F16 K/V flash-attention. Round 4 generalises the dispatch into a four-valued enum + CLI flag so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth as F16, no F16 underflow on small attention scores) or Q8_0 K/V (Vulkan + half the K/V upload bandwidth) on adapters that advertise the corresponding capability. Live wiring that turns the round-2 / round-3 probe results into actual GPU work.

New internal enum tts_cpp::supertonic::detail::kv_attn_dtype { autoselect=-1, f32=0, f16=1, bf16=2, q8_0=3 } + pure-logic resolver resolve_kv_attn_type(requested, legacy_use_f16_attn, supports_f16, supports_bf16, supports_q8_0). Same testable-policy split as round-3's resolve_vulkan_device_index.
EngineOptions::kv_attn_type int field (-1 = auto, 0..3 explicit) — same -1 = auto convention as f16_attn / f16_weights / vulkan_device, so operator configs are consistent. Default falls back to f16_attn's value, so every existing operator config sees zero behaviour change.
Probe-gated graceful fallback to F32 on adapters that don't support the requested dtype — an operator setting --kv-attn-type bf16 once in their production config works on both NVIDIA Ampere+ (BF16 effective via Vulkan coopmat2) and Intel ARC (no coopmat2 → silent F32 fallback) without crashing. Out-of-range --kv-attn-type N throws loudly to surface CLI typos.
Vector-estimator dispatch site rewrite (build_text_attention_cache): if (cache.f16_kv_attn) { cast→F16 } replaced with a switch on the enum; cast target picked from {F16, BF16, Q8_0} per cache.kv_attn_type. Cache invalidation key promoted from bool to enum (rebuilds the graph when the enum flips, same correctness contract as the rest of the cache key tuple).
--kv-attn-type {auto,f32,f16,bf16,q8_0} CLI on all three binaries. Bench surface adds (kv_attn_type=…) to the human-readable line and "kv_attn_type" + "kv_attn_type_requested" to the JSON output so log-grep / CI attribution works across machines.
Bonus: supertonic-cli arg-parse loop wrapped in try/catch so invalid values surface as a clean error: ... line + exit 2 (also fixes a pre-existing latent crash on --vulkan-device abc / --seed nonsense / etc).
Prereq B: test-supertonic-f16-attn-parity extended with 2 new BF16-vs-F32 parity checks (vector-estimator + style shapes; CPU max_abs_err = 5.263e-3 and 3.596e-3, both within the same 5e-3 tolerance band as the existing F16 baseline). Written BEFORE any production change — the parity gate was in place before the cast logic was touched.
2 new tests: test-supertonic-kv-attn-type (106 checks across the full {requested × legacy × probe-mask} matrix, out-of-range throws, exhaustive resolver-never-leaks-autoselect sweep) and test-supertonic-kv-attn-type-api (18 checks — SFINAE compile-time gates, runtime defaults, RAII restoration, regression guards on every other documented EngineOptions default).

`3c59e523` — Round 7: bench observability + voice cache + Vulkan env-var passthrough

Lowest impact-÷-risk round of those planned in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup.

Voice ttl/dp host cache (detail::voice_host_cache). Eliminates 2 sync points / synthesize() after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a full Engine; reference-stability contract documented for the synthesis-pipeline call site.
Vulkan env-var passthrough: apply_vulkan_env_overrides(map) public helper + EngineOptions::vulkan_env_overrides field + --vulkan-prefer-host-memory / --vulkan-disable-coopmat2 / --vulkan-disable-bfloat16 / --vulkan-perf-logger / --vulkan-async-transfer / --vulkan-env KEY=VALUE CLI flags on all three binaries. ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched. set_env_if_unset semantics so an operator-set env var still WINS over the EngineOptions override.
Bench ggml_backend_synchronize boundaries (--no-bench-sync opt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware.
Bench per-denoise-step breakdown (--bench-per-step). Times each supertonic_vector_step_ggml call individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape.

Two new test executables (test-supertonic-voice-host-cache, test-supertonic-vulkan-env-overrides). TDD caught the env-key validator's empty-string-as-success bug BEFORE wiring went in.

`5b166a79` — Round 8: front-block attn0 GPU bridge

Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (run_text_attention_cache_gpu) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0 — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without vector_rope_theta continue to take the host-rotate path.

The blit primitive parity gate already shipped with PR #16 (test-supertonic-graph-to-graph-blit); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact max_abs = 0.0).

`0fa1593c` — Round 9: style flash-attn GPU bridge

Extends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win).

vector_res_style_qkv_result extended with sq_gpu / sk_gpu / sv_gpu GPU handles, populated unconditionally by run_res_style_qkv_cache (cheap — no GPU sync; just ggml_graph_get_tensor lookups).
run_res_style_qkv_cache host-download gating: the 3 tensor_to_time_channel(...) downloads of sq / sk / sv are now gated on trace != nullptr. Production path skips them entirely. post stays unconditional — consumed by the next-stage run_style_residual_cache which still expects a host vector (cross-stage GPU bridge for post is deferred).
4 dispatch sites rewired with the same gating pattern as the round-8 front-block bridge: !include_ggml_trace && sq_gpu && sk_gpu && sv_gpu → GPU bridge; otherwise legacy host bridge.

Strict TDD: parity test (test-supertonic-graph-to-graph-blit) extended with explicit style-shape coverage BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exact max_abs = 0.0.

`38a67e45` — Round 10: per-step text-input upload-skip

After rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is text_emb (uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used for style_v_in / kctx_in) into a reusable upload_skip_tracker helper and applies it to the front-block + 3 group caches.

CRITICAL CORRECTNESS HAZARD addressed: text_emb is a stack-local std::vector<float> in Engine::Impl::synthesize() (and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may have text_emb.data() == synth_N.text_emb.data() despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer.

Mitigation: caller MUST invoke tracker.reset() at every synth boundary (current_step == 0). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it.

Per-synth wins: 16 fewer host→GPU uploads + ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length).

test-supertonic-upload-skip-tracker (NEW, 7 functions, 41 checks) committed first, observed to fail compile, then implementation added.

`b54b7d43` — Round 11: packed-QK RoPE + GPU-bridge layout fix (CRITICAL CORRECTNESS)

Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying vector_rope_theta. The first end-to-end synth attempt (CPU OR Vulkan) aborted at GGML_ASSERT(HD == n_heads * head_dim) inside apply_rope_to_packed_qk, and even past that assertion every ggml_backend_tensor_copy(q_src, q_tc_in) on the GPU-bridge fast paths would have hit GGML_ASSERT(ggml_are_same_layout(src, dst)) because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's q_tc_in / k_tc_in / v_tc_in tensors expect.

Root cause: apply_rope_to_packed_qk (PR #16 audit follow-up #5) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] channel-fastest-in-memory tensor. In fact the matmul (CPU cblas_sgemm and GPU conv1d_f32(K=1)) produces ne=[L, HD] with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong [HD, L] shape, so the failure mode was invisible to CI.

The fix (strict TDD):

test_supertonic_rope_packed_qk.cpp rewritten under the production matmul shape ne=[L, HD] (channel-major-flat memory). Reference built in scalar apply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins y->ne[0] = HD, y->ne[1] = L so the downstream q_tc_in blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks).
apply_rope_to_packed_qk (supertonic_internal.h): add a head-of-pipeline ggml_cont(ggml_transpose(q)) to flip from ne=[L, HD] channel-major-flat to ne=[HD, L] time-major-flat (which IS the layout q_tc_in expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar apply_rope's native layout AND q_tc_in's blit target bit-for-bit.
V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same ggml_cont(ggml_transpose(...)) at the matmul output in build_group_graph_cache, ve_front_block_proj_cache, and build_res_style_qkv_cache so all four GPU-bridge attention sites get bit-for-bit matching layouts.
Legacy host-bridge fallbacks switched from tensor_to_time_channel(<post-rope-or-v>) to tensor_raw_f32(...). The new graph-side layout puts the bytes already in the time-major-flat shape scalar apply_rope / flash_attention_qkv host references read, so the raw download is the correct call.

Verification:

Backend	Pre-fix	Post-fix
CPU	abort on first step	writes 3.89s 44.1 kHz WAV
Vulkan RTX 5090	abort	writes 6.53s WAV; 44 ms / 5 steps; 74× realtime
Vulkan AMD RADV iGPU	abort	writes 3.64s WAV; 178 ms; 7× realtime
Vulkan Mesa lavapipe	abort	writes 1.21s WAV

The round-1..10 wins are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path.

`bb99d3ce` — Round 12: auto-pick UMA bias + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware.

#10 — Auto-pick UMA bias

Round 3's argmax(free_vram) picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40× realtime regression for any operator following the help text "auto-pick adapter with most free VRAM".

Extended resolve_vulkan_device_index with an optional third arg is_uma_per_device. Empty UMA list → round-3 behaviour preserved verbatim. Non-empty + at least one discrete → argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit requested >= 0 passthrough is UMA-agnostic.

Caller wiring (in init_supertonic_backend) collects UMA flags via the public ggml_backend_dev_get_props() API on ggml_backend_vk_reg() — sets is_uma = true for GGML_BACKEND_DEVICE_TYPE_IGPU / _CPU / _ACCEL.

test_supertonic_vulkan_device_select.cpp extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-index-ignores-UMA-bias, mismatched-length-throws).

#6 — Text-encoder speech-prompted-attention GPU bridge

Master's Metal-port branch (PR #15) built speech_prompted_merged_cache (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download → pack → re-upload between the QKV cache and the flash-attn cache.

Round 12 #6 adds run_speech_prompted_merged_cache and the dispatch in speech_prompted_attention_ggml. Eliminates per call: 2 GPU→host downloads + 3 host→GPU uploads + 1 graph dispatch + all host pack work = 5 sync points × 2 layers = 10 sync points / synth at the text encoder alone.

CPU stays on the legacy two-cache path: master's dense_matmul_time_ggml CPU fast path uses cblas + the host-side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU).

test_supertonic_text_encoder_gpu_bridge.cpp (NEW) pins the symbol via SFINAE + struct field contract + a free-default-cache trip-wire (catches a buggy free path that segfaults on never-built thread_local cache slots at process exit). 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests.

#5 — Pinned-host-buffer per-step input scratchpad

Round 3 shipped the capability probe; the actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 #5 lands the helper try_alloc_inputs_in_pinned_host_buffer.

Returns nullptr on null model.backend / null input_ctx / non-Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from ggml_backend_vk_host_buffer_type() via ggml_backend_alloc_ctx_tensors_from_buft. Caller owns the returned buffer; frees at cache destruction.

Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: vector_group_graph_cache (× 3 for g1/g2/g3) and ve_front_block_graph_cache. Total: 9 per-step input tensors moved to host-pinned memory. Each ggml_backend_tensor_set on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy).

CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains.

test_supertonic_pinned_host_buffer.cpp (NEW) — 11 / 11 CPU-only checks pass.

Combined perf snapshot on RTX 5090

Long-prompt bench (173 chars, ~15s of audio):

Round 11 baseline: 76.11 ms / 5 steps (123× realtime)
Round 12 (all three): 27.99 ms / 5 steps (537× realtime) — 2.7× faster
Vector-estimator step: 12.7 ms → 3.28 ms (3.9× faster)
Prewarm cold-start: 330 ms → 21 ms (15× faster)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):

Round 11 --vulkan-device -1: picks RADV → 178 ms (7× realtime)
Round 12 --vulkan-device -1: picks RTX 5090 → 28 ms (537× realtime) — 6.4× faster for users following help text

`b9f95358` — Round 13: code-quality consolidation + Q8_0 K/V finding

Strict-improvement-only follow-up to round 12: no code path is removed, no optimisation is rolled back, end-to-end perf on every backend stays at the round-12 level. Two deliverables, both no-regret:

1. New helper `alloc_input_scratchpad_or_throw`

Round 12 #5 inlined the "try pinned-host first, fall back to default backend buffer, throw on both-fail" idiom at 4 cache sites (front block + 3 group caches). Round 13 factors it into one helper. Same correctness contract — CPU / Metal / OpenCL fall back to default backend buffer; Vulkan tries pinned-host first. Defensive failure modes consolidated: null model.backend, null input_ctx, null cache_name all throw std::runtime_error with a message that includes the cache name, instead of segfaulting in an error-handler path. Single point of maintenance for the pattern; future cache builds that want pinned-host inputs use the helper directly.

test_supertonic_input_scratchpad.cpp (NEW, 9 / 9 checks) pins the contract via SFINAE on the symbol + CPU-fallback round-trip through ggml_backend_tensor_set / get + null-arg throws + empty-ctx error message includes the cache name. CPU-only — no GGUF fixture required.

Perf impact: zero — same code path, same allocations, same data movement, just one fewer level of nesting at each call site.

2. Q8_0 K/V no-win documented for RTX 5090

Round 4 shipped the --kv-attn-type q8_0 CLI option and bench output advertises q8_0_kv_attn=available. Round 13 measures the trade-off on the test rig (RTX 5090, 1.79 TB/s memory bandwidth, long prompt 206 chars / 18 s audio):

`--kv-attn-type`	Total	Realtime ratio
`f16` (default)	31.11 ms	588×
`q8_0`	31.84 ms	575× (2 % slower)

The F32→Q8_0 cast overhead exceeds the saved K/V upload bandwidth on a high-bandwidth discrete GPU. Operator guidance: stick with the F16 default on RTX 5090 and similar high-bandwidth discretes. Q8_0 is shipped for adapters where the K/V upload bottlenecks the synth (older PCIe 3.0, lower-end discretes, iGPUs with slow BAR); cross-over point to be measured per-adapter by operators using --bench-per-step from round 7.

Backwards-compatibility contract

Every round preserves the existing operator-config baseline:

--f16-attn 0|1 semantics unchanged — round 4's --kv-attn-type auto (the default) falls back to --f16-attn via the resolver.
--vulkan-device 0 semantics unchanged — round 1 introduced the flag; round 3's -1 is opt-in only; round 12's UMA-bias only activates on hybrid rigs and never overrides an explicit index.
--f16-weights 0|1 semantics unchanged — round 6's --f16-weights-deny is opt-in only and has no effect when --f16-weights 0.
--prewarm defaults to empty (no-op).
--vulkan-env / --vulkan-prefer-host-memory / --vulkan-disable-coopmat2 etc. (round 7) all default off; an operator-set env var still wins over the EngineOptions override.
--bench-per-step / --no-bench-sync (round 7) default off; legacy JSON shape preserved on the default path.
model.use_f16_attn boolean is still populated and is kept in sync with the round-4 enum (= (kv_attn_type == f16)) so any external code keying on the boolean stays consistent.
All round-1 / round-3 probes throw on out-of-range CLI input (loud failure for actual config errors); all round-2 / round-3 / round-4 probe-gated dispatches fall back to F32 silently (advisory-probe contract — visible in bench output so operators can confirm a fallback).
All GPU-bridge fast paths (rounds 8 / 9 / 12 added approval check worker #6) gate on !include_ggml_trace, so the trace harness still captures pre-attention Q/K/V host vectors.
Round-10 upload-skip is gated on tracker.reset() at every synth boundary; without the reset, the tracker behaves identically to a no-op (each call uploads).
Round-11 layout-flip is universally applied, so the legacy host-bridge fallback continues to work bit-for-bit on backends that don't activate the GPU bridge.
Round-12 add_codeowners file #5 / round-13 helper safely return nullptr on non-Vulkan backends; no allocator behaviour change for CPU / Metal / OpenCL.

Test plan

CPU-only — a fresh checkout's ctest -L unit exercises every new contract without needing a Vulkan adapter.

cmake -S tts-cpp -B build-tts -DTTS_CPP_USE_SYSTEM_GGML=OFF
cmake --build build-tts --parallel
ctest --test-dir build-tts -L unit --output-on-failure

Expected: 25 / 25 tests, 0 failures, 0 regressions.

Vulkan build (same expectations):

cmake -S tts-cpp -B build-tts-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
cmake --build build-tts-vulkan --parallel
ctest --test-dir build-tts-vulkan -L unit --output-on-failure

Test	Purpose	Round
`test-supertonic-vulkan-dispatch`	Backend-flag dispatch through `supertonic_op_dispatch_scope` + F16-K/V probe smoke	1
`test-supertonic-portable-ops` (UPDATED)	LEAKY_RELU decomposition path stays exercised when the helper short-circuits to the native fused op	1
`test-supertonic-capability-cache`	Probe-counter regression (1 cache miss + N hits per backend) + new-probe coverage	2 + 3
`test-supertonic-warm-up-api`	SFINAE compile-time gate for `Engine::warm_up` + `EngineOptions::prewarm_text`	2
`test-supertonic-vulkan-device-select`	`resolve_vulkan_device_index` behaviour matrix (extended in r12 with UMA-bias coverage)	3 + 12
`test-supertonic-f16-weights` (UPDATED)	Round 6 deny-list overload — 7 new functions / 29 new checks	6
`test-supertonic-f16-deny-list-api`	SFINAE compile-time gate for `EngineOptions::f16_weights_deny_list`	6
`test-supertonic-kv-attn-type`	`resolve_kv_attn_type` behaviour matrix (full {requested × legacy × probe-mask} sweep, 106 checks)	4
`test-supertonic-kv-attn-type-api`	SFINAE compile-time gates for the round-4 enum + EngineOptions field	4
`test-supertonic-f16-attn-parity` (UPDATED)	F16 + BF16 K/V parity vs F32 reference on both hot shapes	4
`test-supertonic-voice-host-cache`	Voice ttl/dp host cache lookup-or-load semantics + reference stability	7
`test-supertonic-vulkan-env-overrides`	All-or-nothing env-var validator + set-if-unset semantics	7
`test-supertonic-graph-to-graph-blit` (UPDATED)	Front-block + style + group attention shapes, bit-exact `max_abs = 0.0`	8 + 9
`test-supertonic-upload-skip-tracker`	Pointer-compare upload-skip + cross-synth pointer-reuse hazard test (41 checks)	10
`test-supertonic-rope-packed-qk` (REWRITTEN)	RoPE helper under production `[L, HD]` matmul layout, bit-exact vs scalar `apply_rope`	11
`test-supertonic-text-encoder-gpu-bridge`	`run_speech_prompted_merged_cache` SFINAE + struct contract + free-default trip-wire	12
`test-supertonic-pinned-host-buffer`	`try_alloc_inputs_in_pinned_host_buffer` nullptr safety + non-Vulkan fallback	12
`test-supertonic-input-scratchpad`	`alloc_input_scratchpad_or_throw` SFINAE + CPU-fallback round-trip + null-arg throws	13

Smoke testing the CLIs

# Help text on all three binaries (round-4 + round-7 flags visible)
./build-tts/supertonic-cli --help 2>&1 | grep -A 6 kv-attn-type
./build-tts/tts-cli        --help 2>&1 | grep -B1 -A 6 vulkan-env
./build-tts/supertonic-bench       2>&1 | grep -A 5 bench-per-step

# Invalid value surfaces cleanly (no backtrace)
./build-tts/supertonic-cli --model /tmp/x.gguf --text x --out x.wav --kv-attn-type bogus
# -> "error: --kv-attn-type expects one of: auto, f32, f16, bf16, q8_0 (got: bogus)"
# -> exit 2

# Full round-1..13 surface
./build-tts/supertonic-cli --model models/supertonic2.gguf --text "Hello" --out /tmp/out.wav \
  --vulkan-device -1 --kv-attn-type bf16 --f16-weights 1 \
  --f16-weights-deny vector_estimator.attention.W_v --prewarm "Warm up text." \
  --vulkan-prefer-host-memory --vulkan-disable-coopmat2

End-to-end real-Vulkan validation

Verified on 4 backends after round 11 unblocked the production path:

Backend	Result	Latency / 5 steps
CPU	writes 3.89 s WAV	(reference)
Vulkan RTX 5090	writes 6.53 s WAV	28 ms / 537–588× realtime (round 12+)
Vulkan AMD RADV iGPU	writes 3.64 s WAV	178 ms / 7× realtime
Vulkan Mesa lavapipe	writes 1.21 s WAV	(CPU-emulated)

Bench JSON includes "kv_attn_type" (resolved) + "kv_attn_type_requested" (raw int) + "prewarm_ms" + per-step timings (--bench-per-step) so a probe miss / cold-start cost / per-step regression is visible in the output and CI scripts can attribute drift / perf differences to the right cause.

File-by-file change summary

30 files changed, 8950 insertions(+), 331 deletions(-)

File	Δ	Notes
`tts-cpp/CMakeLists.txt`	+184	Wire 12 new test executables + Vulkan link option
`tts-cpp/PROGRESS_SUPERTONIC.md`	+1377	Per-round audit + measurement log
`tts-cpp/include/tts-cpp/supertonic/engine.h`	+137	New `EngineOptions` fields: `vulkan_device`, `prewarm_text`, `f16_weights_deny_list`, `kv_attn_type`, `vulkan_env_overrides`, `bench_per_step` + `Engine::warm_up()`
`tts-cpp/src/chatterbox_cli.cpp`	+118	All round flags mirrored on the `tts-cli` alias
`tts-cpp/src/chatterbox_tts.cpp`	+1	`#include <atomic>` (pre-existing missing-include fix)
`tts-cpp/src/supertonic_bench.cpp`	+397	All round flags + bench-output surface (human + JSON) + per-step + sync-boundary + voice-cache-stats
`tts-cpp/src/supertonic_cli.cpp`	+73	All round flags + try/catch arg-parse hardening
`tts-cpp/src/supertonic_engine.cpp`	+145	Probe-gated `use_f16_weights` auto-policy, multi-device auto-pick wiring (with UMA bias), `warm_up` impl, round-4 K/V dispatch resolution, voice-cache integration, env-var passthrough
`tts-cpp/src/supertonic_gguf.cpp`	+1151	Capability-cache implementation, 6 new probes, `resolve_vulkan_device_index` (with UMA bias), `resolve_kv_attn_type`, multi-device auto-pick, dispatch-scope rounds 1–13 plumbing, deny-list integration, pinned-host-buffer helper, `alloc_input_scratchpad_or_throw`
`tts-cpp/src/supertonic_internal.h`	+866	New `kv_attn_dtype` enum, model fields, probe forwarders, resolvers, dispatch-scope extension, `voice_host_cache`, `upload_skip_tracker`, GPU-bridge tensor handles, packed-QK RoPE layout fix
`tts-cpp/src/supertonic_text_encoder.cpp`	+152	`run_speech_prompted_merged_cache` + dispatch in `speech_prompted_attention_ggml` (round-12 #6)
`tts-cpp/src/supertonic_vector_estimator.cpp`	+718	Round-4 enum-switch dispatch site, cache-key promotion, GPU-bridge front-block + style + group rewires (rounds 8 / 9), upload-skip tracker integration (round 10), pinned-host-buffer per-step inputs (round 12 #5), layout fixes for round-11 GPU-bridge blits
`tts-cpp/test/test_supertonic_capability_cache.cpp`	NEW (+424)	Round 2 + extended in round 3
`tts-cpp/test/test_supertonic_f16_attn_parity.cpp`	+162	Prereq B BF16 extension
`tts-cpp/test/test_supertonic_f16_deny_list_api.cpp`	NEW (+134)	Round 6
`tts-cpp/test/test_supertonic_f16_weights.cpp`	+147	Round 6 deny-list extension
`tts-cpp/test/test_supertonic_graph_to_graph_blit.cpp`	+28	Round 8 + 9 front-block + style shape coverage
`tts-cpp/test/test_supertonic_input_scratchpad.cpp`	NEW (+296)	Round 13
`tts-cpp/test/test_supertonic_kv_attn_type.cpp`	NEW (+256)	Round 4 (106 checks)
`tts-cpp/test/test_supertonic_kv_attn_type_api.cpp`	NEW (+157)	Round 4
`tts-cpp/test/test_supertonic_pinned_host_buffer.cpp`	NEW (+236)	Round 12 #5
`tts-cpp/test/test_supertonic_portable_ops.cpp`	+10	Round 1 — explicit `use_native_leaky_relu = false` on the GPU fixture
`tts-cpp/test/test_supertonic_rope_packed_qk.cpp`	REWRITTEN (+244 / -93)	Round 11 — production `[L, HD]` matmul layout
`tts-cpp/test/test_supertonic_text_encoder_gpu_bridge.cpp`	NEW (+216)	Round 12 #6
`tts-cpp/test/test_supertonic_upload_skip_tracker.cpp`	NEW (+300)	Round 10 (41 checks)
`tts-cpp/test/test_supertonic_voice_host_cache.cpp`	NEW (+285)	Round 7
`tts-cpp/test/test_supertonic_vulkan_device_select.cpp`	NEW (+403)	Round 3 + extended in round 12 (UMA-bias coverage)
`tts-cpp/test/test_supertonic_vulkan_dispatch.cpp`	NEW (+268)	Round 1
`tts-cpp/test/test_supertonic_vulkan_env_overrides.cpp`	NEW (+278)	Round 7
`tts-cpp/test/test_supertonic_warm_up_api.cpp`	NEW (+118)	Round 2

Deferred follow-ups (intentionally out of scope)

Tracked in tts-cpp/PROGRESS_SUPERTONIC.md "Deferred work" section:

Persistent VkPipelineCache: recovers ~91 % of cold→warm shader-compilation gap on first warm run, keyed by <vendorID>-<deviceID>-<driverVersion> and rooted at $XDG_CACHE_HOME/ggml/vulkan. This is a ggml-vulkan internal patch (~199 lines) that benefits all Vulkan workloads, not just Supertonic; tracked separately so the supertonic-specific PR stays reviewable. Round-2's --prewarm is an in-process workaround; the persistent on-disk cache extends the win across process restarts.
Cross-stage GPU bridge for post (round 9 follow-up): the post output of run_res_style_qkv_cache is still downloaded to host and re-uploaded into run_style_residual_cache. Would eliminate ~20 more sync points / synth. Deferred until measured impact justifies the dual-graph refactor.
Q8_0 K/V cross-over measurement: round 13 documents Q8_0 is a 2 % regression on RTX 5090; cross-over point to be measured per-adapter (older PCIe 3.0, low-end discretes, iGPUs with slow BAR) by operators using --bench-per-step from round 7.

Linked

Asana: QVAC-18605 [TTS GGML] Add and optimize Vulkan for supertonic
Stacks on: PR Qvac 18607 tts ggml add and optimize open cl for supertonic #16 (QVAC-18607 OpenCL bring-up + audit follow-ups)
Reference: chatterbox.cpp's PROGRESS.md OpenCL / Vulkan optimization log

GustavoA1604

Please help address/clarify the following:

Round 5 is skipped — no explanation

The summary says "twelve rounds of Vulkan-specific deltas (rounds 1–13, round 5 skipped)" but nowhere in the PR is there an explanation of what round 5 was or why it was skipped. Was it superseded by another round? Rolled into a different commit? Abandoned after testing? This leaves a gap in the audit log that makes it harder to assess whether the omission is safe or whether something was quietly dropped.

The round-11 fix is redone in PR #21

PR #21 is a standalone fix for the same apply_rope_to_packed_qk layout bug fixed in round 11 here, but targeting supertonic_optimizations (without Vulkan). The PR description acknowledges the bug came from PR #16. What's unclear is the merge strategy: does PR #18 subsume PR #21 when it lands, or will both be merged separately and cause a double-application of the fix? The V-transpose fix in PR #21 also says it only touches 2 GPU-bridge call sites, while round 11 here touches 4 (build_group_graph_cache, ve_front_block_proj_cache, build_res_style_qkv_cache, and style sq/sk/sv). The difference needs to be reconciled before either merges.

UMA bias heuristic is fragile on some device topologies

The round-12 fix (resolve_vulkan_device_index with is_uma_per_device) picks the discrete adapter by excluding GGML_BACKEND_DEVICE_TYPE_IGPU / _CPU / _ACCEL. This works for the RTX 5090 + AMD RADV iGPU test case. However, on machines where the discrete GPU is the only device and reports GGML_BACKEND_DEVICE_TYPE_IGPU (some Thunderbolt eGPUs, some ARM SoC configurations), the "all-UMA fallback" path would fire and argmax(free_vram) would still pick the right device. That's correct by the test matrix. But if someone has two UMA iGPUs and one discrete that also happens to report IGPU type due to a driver quirk, they'd silently get the wrong device with no warning. The existing test cases don't cover this; it might be worth a code comment documenting the assumption.

Voice host cache reference stability — documented but not enforced

Round 7 introduces voice_host_cache and documents that "reference-stability contract [is] documented for the synthesis-pipeline call site." The test pins the contract via CPU-only checks. However, if a synthesizer call happens concurrently (e.g., from a thread pool or the iOS scenario described in the iOS concurrency fix commit), and the cache is evicted or a new voice is loaded mid-synthesis, the reference would dangle. The PR doesn't show any locking on the cache access path. Given that the iOS race fixes landed in the same PR history (the 36a2c56 commit fixing the gguf_init_from_file race), this deserves explicit scrutiny: is voice_host_cache accessed under any lock, or is it the caller's responsibility to ensure single-threaded access?

…sumption + voice cache threading + round-5 gap Pure docs / comments change. No production-logic surface modified. CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit` 25 / 25; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples, healthy rms). Addresses three reviewer asks on PR tetherto#18: 1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md). Adds an explicit "Note on the round 5 gap" section between round 4 and round 7 documenting that the round-4 plan reserved the name "Round 5 = pinned-host-buffer per-step uploads" as a placeholder, that the actual implementation was deferred behind round-7's bench observability prerequisite, and that it ultimately landed as round 12 tetherto#5. No code was dropped; round numbers stay contiguous so PR descriptions and CI logs match the round labels in this log without rebase churn. 2. UMA-bias assumption (supertonic_gguf.cpp — resolve_vulkan_device_index). Adds a long comment in the requested == -1 auto-pick branch documenting the assumption that is_uma_per_device[i] is sourced from ggml_backend_dev_get_props().type and the failure mode when a discrete adapter's driver mis-reports its type as _IGPU (some Thunderbolt eGPU configs; some ARM SoC dGPU paths). Three sub-cases enumerated: (a) discrete-only with mis-classification falls through to round-3 all-device argmax and still picks discrete by free-VRAM (coincidentally correct), (b) mixed UMA-iGPU + mis-classified-discrete picks iGPU silently (regression vs. round 3 — operator escape hatch: --vulkan-device N is UMA-agnostic and --vulkan-perf-logger exposes the choice). Future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be re-classified) tracked in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. 3. voice_host_cache threading model (supertonic_internal.h). Tightens the reference-stability docstring from "must NOT call clear() while holding the reference" to a full thread-safety section explicitly calling out single-threaded -per-Engine as the supported model (matches what the iOS load/unload race fix 36a2c56 enforces for s3gen). Explains why no internal lock today (cache exists to eliminate per -call GPU downloads; internal locking would give back the saving) and what a future thread-pool refactor must do (external mutex around get_or_load + downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Also clarifies the unordered_map guarantee: element references survive insert even when the table rehashes; only iterators are invalidated. Reviewer's fourth ask — "the round-11 fix is redone in PR tetherto#21" — was resolved by the rebase landing in this same branch state. After rebasing onto upstream/supertonic_optimizations (which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix), this branch's round-11 commit is a delta of only the 2 Vulkan-only V-transpose sites needed for round 8's front-block GPU bridge + round 9's style GPU bridge. No double-application; the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the new base. Co-authored-by: Cursor <cursoragent@cursor.com>

Zbig9000 · 2026-05-18T10:32:21Z

Please help address/clarify the following:
1. Round 5 is skipped — no explanation
The summary says "twelve rounds of Vulkan-specific deltas (rounds 1–13, round 5 skipped)" but nowhere in the PR is there an explanation of what round 5 was or why it was skipped. Was it superseded by another round? Rolled into a different commit? Abandoned after testing? This leaves a gap in the audit log that makes it harder to assess whether the omission is safe or whether something was quietly dropped.
2. The round-11 fix is redone in PR #21
PR #21 is a standalone fix for the same apply_rope_to_packed_qk layout bug fixed in round 11 here, but targeting supertonic_optimizations (without Vulkan). The PR description acknowledges the bug came from PR #16. What's unclear is the merge strategy: does PR #18 subsume PR #21 when it lands, or will both be merged separately and cause a double-application of the fix? The V-transpose fix in PR #21 also says it only touches 2 GPU-bridge call sites, while round 11 here touches 4 (build_group_graph_cache, ve_front_block_proj_cache, build_res_style_qkv_cache, and style sq/sk/sv). The difference needs to be reconciled before either merges.
3. UMA bias heuristic is fragile on some device topologies
The round-12 fix (resolve_vulkan_device_index with is_uma_per_device) picks the discrete adapter by excluding GGML_BACKEND_DEVICE_TYPE_IGPU / _CPU / _ACCEL. This works for the RTX 5090 + AMD RADV iGPU test case. However, on machines where the discrete GPU is the only device and reports GGML_BACKEND_DEVICE_TYPE_IGPU (some Thunderbolt eGPUs, some ARM SoC configurations), the "all-UMA fallback" path would fire and argmax(free_vram) would still pick the right device. That's correct by the test matrix. But if someone has two UMA iGPUs and one discrete that also happens to report IGPU type due to a driver quirk, they'd silently get the wrong device with no warning. The existing test cases don't cover this; it might be worth a code comment documenting the assumption.
4. Voice host cache reference stability — documented but not enforced
Round 7 introduces voice_host_cache and documents that "reference-stability contract [is] documented for the synthesis-pipeline call site." The test pins the contract via CPU-only checks. However, if a synthesizer call happens concurrently (e.g., from a thread pool or the iOS scenario described in the iOS concurrency fix commit), and the cache is evicted or a new voice is loaded mid-synthesis, the reference would dangle. The PR doesn't show any locking on the cache access path. Given that the iOS race fixes landed in the same PR history (the 36a2c56 commit fixing the gguf_init_from_file race), this deserves explicit scrutiny: is voice_host_cache accessed under any lock, or is it the caller's responsibility to ensure single-threaded access?

Reply 1 — "Round 5 is skipped — no explanation"
Good catch — fixed. Round 5 was a planning placeholder, not abandoned code. The round-4 plan in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md reserved the name "Round 5 = pinned-host-buffer per-step uploads" as the next deliverable. We deferred it because the plan itself called out a hard prerequisite: round 7's bench observability was needed to measure the win and verify no regression on adapters where pinned-host turns out slower. After landing rounds 6, 7, 8, 9, 10, 11 we came back to the pinned-host-buffer work and shipped it as round 12 #5 (bundled with two other items: the auto-pick UMA bias fix and the text-encoder GPU-bridge wiring — see the round-12 commit message and the #5 sub-section in PROGRESS_SUPERTONIC.md round-12 entry).

The contiguous round-12 / round-13 numbering (instead of retroactively renaming round 12 to "round 5 (delayed)") is deliberate: the commit hashes referenced in PR descriptions and CI logs match the round labels in PROGRESS_SUPERTONIC.md without rebase churn.

Added an explicit "Note on the round 5 gap" section in PROGRESS_SUPERTONIC.md between round 4 and round 7 so the audit log makes this unambiguous.

Reply 2 — "The round-11 fix is redone in PR #21"
Resolved by today's rebase. PR #21 was the canonical fix for QVAC-18966 (cherry-picked from this branch's round 11 and retargeted at supertonic_optimizations without the Vulkan rounds). PR #21 covers the 2 GPU-bridge call sites that exist on the Vulkan-free branch (build_group_graph_cache + the front-block path in supertonic_vector_trace_proj_ggml).

This PR's round 11 originally covered 4 sites: the same 2 sites PR #21 covers + 2 more (ve_front_block_proj_cache's V transpose for round 8's front-block GPU bridge + build_res_style_qkv_cache's sq/sk/sv transposes for round 9's style GPU bridge). Those 2 extras only matter when the Vulkan-only round-8/9 GPU bridges are wired — which is why PR #21's narrower scope was correct for the non-Vulkan branch.

Merge strategy after rebase: PR #21 is already in supertonic_optimizations. I just rebased this branch onto the new base, and the round-11 commit (ef266e4) is now a delta of only the 2 Vulkan-only V-transpose sites + comment merges. No double-application: the QVAC-18966 fix is applied exactly once via PR #21 in the new base. Verified: CPU + Vulkan ctest -L unit 25/25 PASS post-rebase; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples).

Reply 3 — "UMA bias heuristic is fragile on some device topologies"
Agreed, and the failure mode is real. Mitigations actually in code today:

Empty UMA-flag list → falls back to round-3 argmax(free_vram) (unchanged behaviour for callers that haven't wired the UMA flags).
All-UMA list → also falls back to round-3 argmax over all devices (preserves backward-compat).
Explicit --vulkan-device N → UMA-agnostic passthrough; operator-pinned index always wins.
--vulkan-perf-logger → exposes the chosen device in the bench JSON for post-mortem.
The edge cases you flagged broken down:

Single discrete reporting _IGPU due to driver quirk: discrete is flagged UMA → excluded from the discrete-subset argmax → any_discrete == false → falls through to round-3 all-device argmax → discrete still picked by free-VRAM (correct outcome by coincidence on a single-discrete rig).
Mixed true UMA iGPU + mis-classified discrete: round-12 bias prefers the iGPU over the mis-classified discrete (silent regression vs. round 3). Operator escape hatch is --vulkan-device N + the perf-logger device dump for diagnosis.
Added a long comment in resolve_vulkan_device_index (in the requested == -1 branch) documenting all three sub-cases plus a future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be heuristically re-classified). Tracked for a follow-up round in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Commit: 1632e45.

Reply 4 — "Voice host cache reference stability — documented but not enforced"
Right call — the threading expectation needed to be spelled out. Today voice_host_cache is single-threaded by contract, not by lock. The Engine's documented threading model is single-threaded synthesis per Engine instance; concurrent synthesis requires one Engine per thread (each Engine carries its own voice_host_cache). This is the same model the iOS load/unload race fix 36a2c56 enforces for the s3gen preload path — they're consistent.

Why no internal lock today: the cache exists to eliminate per-call GPU downloads of ttl / dp (~2 sync points per synthesize() on Vulkan / OpenCL). Adding an internal mutex would give back a measurable chunk of that saving (an uncontended std::mutex lock+unlock pair is small but not free on the hot path of every synth). Since the existing iOS fix already mandates one-Engine-per-thread for concurrent synthesis, the cache inherits the same constraint at zero extra cost.

Standard unordered_map guarantee re: rehash: element references are NOT invalidated by insert (only iterators are). So even if a second voice loads mid-call on the same thread (impossible today, but allowed for completeness), a held entry & from a prior get_or_load survives. The only operations that can invalidate are clear() / erase() on that entry — and clear() is only reachable on Engine destruction.

Strengthened the docstring in supertonic_internal.h with an explicit THREAD-SAFETY section documenting all of the above, including what a future thread-pool refactor would need (external mutex around get_or_load + the downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Commit: 1632e45.

ogad-tether

Review — Vulkan backend for Supertonic (PR #18)

Thorough review of the 8753-line addition across 30 files. The overall engineering quality is high — TDD discipline is genuine, the commit-per-round structure makes the evolution auditable, and the backwards-compatibility contract is well-documented. The PR is in good shape for merge with a few items to consider.

Findings

1. test_resolver_returns_concrete_only asserts too weakly (test_supertonic_kv_attn_type.cpp)

The exhaustive 5×2×8 resolver sweep only checks dt != kv_attn_dtype::autoselect. A typo in the resolver (e.g., returning f16 when bf16 was requested + supported) would pass this test silently. Consider spot-checking the "happy path" cases with exact enum comparisons — e.g., requested=2, supports_bf16=true → bf16.

2. test_cpu_fallback_returns_valid_buffer only round-trips one of two tensors (test_supertonic_input_scratchpad.cpp)

The test allocates x_in (512B) and temb_in (256B) but only does a tensor_set/tensor_get round-trip on x_in. If the buffer allocation failed to bind temb_in, this test wouldn't catch it.

3. Probe-gated silent fallback vs explicit operator request (resolve_kv_attn_type, supertonic_gguf.cpp:1473-1478)

When an operator explicitly requests --kv-attn-type bf16 but the backend doesn't support it, the resolver silently falls back to F32. This is documented as intentional (advisory-probe contract), but a fprintf(stderr, "warning: ...") on the explicit-request + unsupported path would save operators from silently getting F32 when they thought they had BF16. The auto path (-1) correctly stays silent. The bench JSON does surface the resolved type, so it's partially observable already.

4. Minor: resolve_vulkan_device_index UMA-bias tiebreak within discrete subset (test_supertonic_vulkan_device_select.cpp)

The test for test_hybrid_prefer_discrete_over_uma uses devices with distinct VRAM sizes (32GB vs 120GB). The tiebreak case of two discrete cards with equal VRAM (should pick lower index) is not tested. Covered by the non-UMA auto-pick tests, but worth adding one UMA-specific tiebreak case for completeness.

5. cached_backend_capabilities returns const& through a lock boundary (supertonic_gguf.cpp:779)

The returned reference outlives the lock_guard. This is safe in production because unordered_map references aren't invalidated by insert, and clear() is test-only. But supertonic_clear_capability_cache() could create a dangling reference in multi-threaded test scenarios. If test code ever calls clear() while another thread holds a reference from cached_backend_capabilities, that's UaF. Low risk given single-threaded test execution today, but worth a comment.

Positive observations

The TDD caught real bugs (V layout transpose, env-var empty-string sentinel, pointer-compare upload-skip). The commit messages document the red→green cycle with specific failure modes — this is exactly how TDD should be practiced on low-level GPU code.
The pure-logic resolver split (resolve_vulkan_device_index, resolve_kv_attn_type) makes the policy layer fully testable on CPU without a Vulkan adapter. Smart design.
Backwards-compatibility is meticulously maintained — every existing flag/default preserves its semantics.
The 25/25 CPU-only ctest suite catches regressions in the dispatch/capability/resolver contracts without needing GPU hardware in CI.
Performance results are impressive (2.4–2.7× end-to-end speedup, 15× prewarm improvement on RTX 5090).

None of the findings are merge-blockers. Items 1–2 are low-effort test improvements; items 3–5 are suggestions for consideration.

… tests + surface explicit-dtype downgrades Pure additive change (one new resolver out-param defaulting to nullptr; two test files extended; two doc-comment blocks added). No production-logic surface modified for existing callers. Regression status: - CPU `ctest -L unit`: 25 / 25, 256 individual checks (was 25 / 25, ~209 checks pre-change). - Vulkan `ctest -L unit`: 25 / 25. - CPU + Vulkan end-to-end synth: bit-identical 10.10 s WAV (rms=285.6, abs_max=4703 on both backends, same seed + text), confirming no rounds-1..13 optimisation regressed. Addresses Omar's five non-blocker findings on PR tetherto#18: 1. test_resolver_returns_concrete_only (kv_attn_type). The original exhaustive 5 x 2 x 8 sweep only asserted dt != autoselect, so a typo returning f16 when bf16 was requested+supported would pass silently. Rewritten with a second pure-function `expected()` mirror of the resolver's matrix; every one of the 80 grid points now CHECKs the resolver's return value against the expected concrete dtype. Added cross-contamination spot checks (requesting bf16 with f16+q8_0 supported but bf16 NOT supported must fall to f32, not silently to f16 or q8_0). Now 205 checks passed in test-supertonic-kv-attn-type. 2. test_cpu_fallback_returns_valid_buffer (input_scratchpad). Original only round-tripped x_in (one of two allocated tensors). Now round-trips BOTH x_in and temb_in with distinct payload patterns (1.0f vs 2.5f), plus a cross-aliasing recheck (after writing temb_in, x_in must still read back its original 1.0f) — a binding-collision bug where both tensors share memory would now fail this check. 3. resolve_kv_attn_type silent fallback on explicit operator request. Added optional `bool * out_was_downgraded` output parameter to the resolver — set to true IFF the operator explicitly requested f16/bf16/q8_0 AND the corresponding backend probe returned false AND we therefore returned f32. The auto path (-1) leaves the flag false (no operator surprise — auto-policy is doing its job). Engine ctor + supertonic-bench wired to emit a one-line `fprintf(stderr, "warning: requested --kv-attn-type %s but the resolved backend's flash-attn probe rejected it; falling back to f32 (set --kv-attn-type auto to silence)")` on a downgrade. Defaulted nullptr keeps the pure-logic unit tests stderr-clean. New test_downgrade_flag_signal pins the contract on every relevant path (auto + missing probe -> flag false; explicit + matching probe -> flag false; explicit + missing probe -> flag true; nullptr out- ptr safe). 4. test_uma_aware_tiebreak_equal_vram_discretes (vulkan_device_select). Added a dedicated UMA-bias-active test case: two discrete cards with EQUAL VRAM (32 GB each) alongside a UMA iGPU. Pins three sub-cases: interleaved UMA in the middle, adjacent discretes with no UMA, three- way all-discrete tie. Lower index wins in every case. The existing test 11's second CHECK already covered the interleaved-UMA case; this hoists the contract into its own named test so a future refactor reading the test names knows the tiebreak case is pinned. 5. cached_backend_capabilities UaF risk under test-only clear(). Added a long comment on the function documenting the four invariants: (a) production callers may hold the returned ref across subsequent calls for OTHER backends (unordered_map's insert-doesn't-invalidate-references guarantee); (b) production callers MUST NOT keep the ref alive across a clear() call (test code's responsibility); (c) multi-threaded callers must externally synchronise deref vs. clear (the cache's lock protects map structure, NOT element lifetime); (d) if a future refactor adds a production-reachable erase / clear path, this function must switch to return-by-value or std::shared_ptr<const T>. Co-authored-by: Cursor <cursoragent@cursor.com>

Zbig9000 · 2026-05-19T08:36:43Z

Review — Vulkan backend for Supertonic (PR #18)

Thorough review of the 8753-line addition across 30 files. The overall engineering quality is high — TDD discipline is genuine, the commit-per-round structure makes the evolution auditable, and the backwards-compatibility contract is well-documented. The PR is in good shape for merge with a few items to consider.

Findings

1. test_resolver_returns_concrete_only asserts too weakly (test_supertonic_kv_attn_type.cpp)

The exhaustive 5×2×8 resolver sweep only checks dt != kv_attn_dtype::autoselect. A typo in the resolver (e.g., returning f16 when bf16 was requested + supported) would pass this test silently. Consider spot-checking the "happy path" cases with exact enum comparisons — e.g., requested=2, supports_bf16=true → bf16.

2. test_cpu_fallback_returns_valid_buffer only round-trips one of two tensors (test_supertonic_input_scratchpad.cpp)

The test allocates x_in (512B) and temb_in (256B) but only does a tensor_set/tensor_get round-trip on x_in. If the buffer allocation failed to bind temb_in, this test wouldn't catch it.

3. Probe-gated silent fallback vs explicit operator request (resolve_kv_attn_type, supertonic_gguf.cpp:1473-1478)

When an operator explicitly requests --kv-attn-type bf16 but the backend doesn't support it, the resolver silently falls back to F32. This is documented as intentional (advisory-probe contract), but a fprintf(stderr, "warning: ...") on the explicit-request + unsupported path would save operators from silently getting F32 when they thought they had BF16. The auto path (-1) correctly stays silent. The bench JSON does surface the resolved type, so it's partially observable already.

4. Minor: resolve_vulkan_device_index UMA-bias tiebreak within discrete subset (test_supertonic_vulkan_device_select.cpp)

The test for test_hybrid_prefer_discrete_over_uma uses devices with distinct VRAM sizes (32GB vs 120GB). The tiebreak case of two discrete cards with equal VRAM (should pick lower index) is not tested. Covered by the non-UMA auto-pick tests, but worth adding one UMA-specific tiebreak case for completeness.

5. cached_backend_capabilities returns const& through a lock boundary (supertonic_gguf.cpp:779)

The returned reference outlives the lock_guard. This is safe in production because unordered_map references aren't invalidated by insert, and clear() is test-only. But supertonic_clear_capability_cache() could create a dangling reference in multi-threaded test scenarios. If test code ever calls clear() while another thread holds a reference from cached_backend_capabilities, that's UaF. Low risk given single-threaded test execution today, but worth a comment.

Positive observations
* The TDD caught real bugs (V layout transpose, env-var empty-string sentinel, pointer-compare upload-skip). The commit messages document the red→green cycle with specific failure modes — this is exactly how TDD should be practiced on low-level GPU code.

* The pure-logic resolver split (`resolve_vulkan_device_index`, `resolve_kv_attn_type`) makes the policy layer fully testable on CPU without a Vulkan adapter. Smart design.

* Backwards-compatibility is meticulously maintained — every existing flag/default preserves its semantics.

* The 25/25 CPU-only `ctest` suite catches regressions in the dispatch/capability/resolver contracts without needing GPU hardware in CI.

* Performance results are impressive (2.4–2.7× end-to-end speedup, 15× prewarm improvement on RTX 5090).
None of the findings are merge-blockers. Items 1–2 are low-effort test improvements; items 3–5 are suggestions for consideration.

Reply 1 — test_resolver_returns_concrete_only asserts too weakly
Fixed in 903c312. The original sweep only asserted dt != autoselect, so a typo returning f16 when bf16 was requested + supported would have slipped through. Now the test computes the expected concrete dtype as a separately-implemented pure function of the inputs (a hand-rolled mirror of the resolver's behaviour matrix — typo on one side won't cancel a typo on the other) and CHECKs each of the 80 grid points against the expected dtype. Added explicit happy-path spot checks for your example (requested=2, supports_bf16=true → bf16, requested=3, supports_q8_0=true → q8_0) plus cross-contamination guards: requesting bf16 with f16 and q8_0 supported but bf16 NOT supported MUST fall to f32, not silently to one of the other supported dtypes. Total test-supertonic-kv-attn-type count went from ~80 checks to 205 / 205.

Reply 2 — test_cpu_fallback_returns_valid_buffer only round-trips one of two tensors
Fixed in 903c312. The test now round-trips BOTH x_in and temb_in with distinct payload patterns (1.0f vs 2.5f) so a binding failure on the second tensor fails the test, plus a cross-aliasing recheck: after writing 2.5f to temb_in, x_in must still read back 1.0f — a buffer-overlap bug where the helper bound both tensors to the same memory range would now fail this check too. test-supertonic-input-scratchpad is now 11 / 11 checks (was 9).

Reply 3 — Probe-gated silent fallback vs explicit operator request
Agreed, and fixed in 903c312. Added an optional bool * out_was_downgraded output parameter to resolve_kv_attn_type (defaulting to nullptr so the pure-logic unit tests stay stderr-clean). The resolver sets the flag iff the operator explicitly requested f16 / bf16 / q8_0 AND the corresponding backend probe returned false AND the resolver therefore returned f32. The auto path (-1) leaves the flag false — the operator didn't ask for a specific dtype, so there's nothing to surprise them with.

Engine ctor and supertonic-bench are wired to emit:

supertonic: warning: requested --kv-attn-type bf16 but the resolved backend's
flash-attn probe rejected it; falling back to f32 (set --kv-attn-type auto to silence)
on a downgrade. The auto path correctly stays silent — verified on both CPU and Vulkan (no warning when --kv-attn-type auto runs on either backend). New test_downgrade_flag_signal pins the contract: 8 scenarios covering every relevant path including the nullptr default-argument safety check.

One observation worth noting: on this dev rig the CPU backend's ggml_backend_supports_op(FLASH_ATTN_EXT(F32, BF16, BF16)) actually returns true (the CPU flash_attn_ext is generic), so the warning doesn't fire on CPU + --kv-attn-type bf16. That's correct probe behaviour, not a wiring bug. The warning will fire on adapters that genuinely reject the op (e.g., Vulkan without cooperative_matrix2 for BF16, or future backends that selectively reject Q8_0 K/V).

Reply 4 — resolve_vulkan_device_index UMA-bias tiebreak
Fixed in 903c312. Added a dedicated test_uma_aware_tiebreak_equal_vram_discretes that pins three sub-cases of the equal-VRAM-discretes tiebreak with the UMA bias active:

Interleaved UMA: [32GB discrete, 32GB discrete, 120GB UMA] → picks index 0 (lower discrete).
Adjacent discretes (no UMA in the middle): [32GB discrete, 32GB discrete] → picks index 0.
Three-way all-discrete tie: [32GB, 32GB, 32GB] → picks index 0.
Test 11's second CHECK already covered the interleaved case implicitly, but hoisting it into its own named test makes the tiebreak contract greppable + a future refactor reading the test names knows the case is pinned. test-supertonic-vulkan-device-select is now 40 / 40 checks (was 37).

Reply 5 — cached_backend_capabilities returns const & through a lock boundary
Fixed in 903c312. Added a long comment on the function documenting the four invariants the contract relies on:

Production callers may hold the returned ref across subsequent cached_backend_capabilities calls for OTHER backends — std::unordered_map's reference-stability guarantee survives insert/emplace rehash; only iterators are invalidated.
Production callers MUST NOT keep the ref alive across a supertonic_clear_capability_cache call. That helper is test-only and exported with no header declaration; the contract is "callers don't reach this; tests do, single-threaded".
Multi-threaded callers must externally synchronise deref vs. clear (the lock here protects the map's structural invariants during insert/find, NOT the lifetime of returned elements).
If a future refactor adds a production-reachable erase/clear path, this function should switch to return-by-value or std::shared_ptr ownership — otherwise the UaF you flagged becomes reachable from production.
Spelled out explicitly above the function body so the next maintainer doesn't have to derive the constraint from scattered context.

Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>

Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>

`g_s3gen_cache_refcount` is a `std::atomic<int>` (line 189) but `<atomic>` was never included; the file relied on a transitive include chain that broke once any consumer rearranged includes. Surfaces as `error: variable 'std::atomic<int> ... has initializer but incomplete type'` on a clean build. Pre-existing bug, unrelated to QVAC-18605 itself but blocked local CTest runs against the Vulkan-optimisation work. Trivial additive include with no behaviour change. Co-authored-by: Cursor <cursoragent@cursor.com>

…s + prewarm Layered on top of the QVAC-18605 Vulkan bring-up commit; the round-2 changes generalise the bring-up's "load-time backend probe" pattern into a process-wide capability cache and add three more probes / dispatch hooks that fit the same shape. Net effect on Vulkan: redundant supports_op traffic eliminated, defensive auto-policy gating extended to F16 weights, forward- compat Q8_0 K/V probe primed for a follow-up dispatch flip, and an opt-in --prewarm hook that lets operators amortise the ~hundreds-of-ms cold-start shader-compile cost outside the operator-visible first synth call. 1) Process-wide capability-probe cache keyed by ggml_backend_t The bring-up's three load sites (load_supertonic_gguf, Engine::Engine, supertonic_bench's main) each ran the LEAKY_RELU + F16-K/V flash-attn supports_op queries independently — 2-3x redundant probe traffic per backend. On Vulkan, supports_op may inspect the device's pipeline state (~50-200 us per query on Adreno / llvmpipe / RADV in microbenchmarks); the cache short-circuits 100 % of the duplicates. Test seam (supertonic_clear_capability_cache + supertonic_capability_probe_call_count) lets the unit test verify the cache is hit on the second call by comparing the counter before / after. Per-backend independence verified against two distinct CPU backend handles. 2) F16 mul_mat backend-capability probe Symmetric to the F16-K/V flash-attn probe. The bring-up auto-enabled use_f16_weights on `!backend_is_cpu` blindly; a partial-port backend that ships F16 storage but rejects the hot vector-estimator W_query mul_mat shape would crash at first synth call. Probe builds the live shape ([256,256] F16 weight x [256,16] F32 activation) and asks the backend; auto-policy refuses materialisation on a `false` answer (slower F32 path stays correct). Manual --f16-weights 1 still forces materialisation (debug-shim escape hatch). Probe cached; test verifies CPU returns true. 3) Q8_0 K/V flash-attn forward-compat probe Vulkan's GGML_OP_FLASH_ATTN_EXT supports_op advertises Q8_0 (and Q4_0) K/V types in scalar + coopmat2 paths. Switching K/V from F16 to Q8_0 would halve the per-step upload bandwidth (50 KB -> 25 KB per K/V on Supertonic's hot shape; ~1 MB / synth on the default 5-step x 4-site schedule) in exchange for a small (~0.5 %) drift on the attention output. This commit adds the probe + caches the result; live dispatch site is NOT yet wired pending F16-vs-Q8_0 K/V drift measurement against the parity harness on a real Vulkan adapter. Bench output annotates `(q8_0_kv_attn=available)` when the probe says yes so operators can confirm their hardware is ready for the follow-up. 4) Engine::warm_up(text) + EngineOptions::prewarm_text + --prewarm CLI flag (supertonic-cli, tts-cli, supertonic-bench) First-synth-latency reduction on Vulkan / OpenCL. In-tree thread_local graph caches handle every subsequent call but can't avoid the first pipeline-compile cost (~hundreds of ms on Adreno / RADV per chatterbox PROGRESS.md). warm_up runs one throwaway synth at construction time on a caller- supplied sample text so the operator-visible first synth sees steady-state latency. Auto-no-op on CPU (no shader- compile cost). Bench's --prewarm runs the cold-start synth BEFORE the timed loop (independent of --warmup N which only discards N timed runs from the median); cold-start latency logged as `[prewarm] cold-start synth on '...' took N.Nms` and emitted to --json-out as "prewarm_ms". 5) Bench output extended Backend log line surfaces every dispatch flag plus the cold-start prewarm latency: Vulkan (device 0: ...) (f16_attn=on) (f16_weights=on) (native_leaky_relu=on) (q8_0_kv_attn=available) --json-out gains "f16_attn", "f16_weights", "native_leaky_relu", "q8_0_kv_attn_available", "prewarm_ms" keys for downstream analysis tooling. Tests - test-supertonic-capability-cache (NEW, LABEL "unit"): probe cache short-circuit + clear seam + per-backend independence + idempotency + F16 mul_mat probe + Q8_0 K/V probe smoke. 18 / 18 checks pass. - test-supertonic-warm-up-api (NEW, LABEL "unit"): API-surface contract for EngineOptions::prewarm_text + Engine::warm_up via SFINAE. 9 / 9 checks pass. - All existing CPU-only unit tests (test-supertonic-vulkan- dispatch, -portable-ops, -backend-dispatch, -rope-in-graph, -rope-packed-qk, -in-graph-transpose, -convnext-block-fused, -graph-to-graph-blit, -profile-csv, -f16-attn-parity, plus resample / cpu-caches / t3-caches): all 13 pass unchanged. - ctest -L unit reports 100 % pass (15 / 15 binaries; 184+ / 184+ individual checks). Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined. - No public-API break: EngineOptions::prewarm_text is a new optional field defaulting to empty (no-op), Engine::warm_up is a new method (existing callers don't have to invoke it). Deferred (tracked in PROGRESS_SUPERTONIC.md "Deferred work"): persistent VkPipelineCache (cross-process), BF16 K/V flash-attn, Q8_0 K/V live dispatch wiring, multi-device load-balancing. Co-authored-by: Cursor <cursoragent@cursor.com>

…vice auto-pick + 2 forward-compat probes Three more Vulkan-specific deltas, all developed test-first. New tests were committed first, observed to fail on the missing symbol, and only then was the implementation written and the tests re-run to verify green. 1. BF16 K/V flash-attn capability probe (5th cached_backend_capabilities flag). Symmetric to the round-2 Q8_0 K/V probe. Vulkan's FLASH_ATTN_EXT supports_op advertises BF16 K/V via the coopmat2- only path; BF16 has the same 2-byte per-element footprint as F16 (so identical upload bandwidth) but the wider 8-bit exponent range avoids the F16 underflow on small attention scores. Forward-compat — the live --kv-attn-type bf16 dispatch wiring is deferred to a follow-up that measures drift against the parity harness on a real Vulkan adapter. 2. Multi-device auto-pick for --vulkan-device -1. Wires the previously-reserved auto-pick API: walks every visible adapter, queries ggml_backend_vk_get_device_memory() to read free VRAM, and dispatches into a pure-logic helper resolve_vulkan_device_index(requested, free_vram_per_device) that picks argmax(free_vram); ties → lower index for stable per-run assignment on identical-spec multi-GPU machines. The pure-logic helper is testable on CPU with synthetic inputs (8 test functions, 23 checks). Reserved-future negative values (-2, -100, ...) now throw instead of silently falling through to device 0. Verbose mode logs the per-device VRAM table so operators can confirm the auto-pick chose the expected adapter. 3. Pinned-host-buffer-type capability probe (6th cache flag) + bench surface. Probes whether ggml_backend_vk_host_buffer_type() is callable on the resolved backend (Vulkan + non-null buffer- type). Forward-compat — primes the capability cache for a follow-up per-engine input-scratchpad refactor that skips ggml-vulkan's internal staging-buffer hop on per-step uploads. Bench output now shows bf16_kv_attn_available + pinned_host_buffer_available in both the human-readable backend tag and the JSON output so operators can pre-flight whether a future opt-in will be effective on their machine. Test plan (TDD round 3): - test-supertonic-capability-cache: 27 / 27 checks pass (was 18, +9 checks for round-3: BF16 K/V smoke + cache-slot share, pinned-host-buffer smoke + cache-slot share, null-backend defensive checks for both new probes). - test-supertonic-vulkan-device-select (NEW): 23 / 23 checks pass (8 test functions: empty-list, single-device, argmax-VRAM, tie- break, explicit-index passthrough, out-of-range, reserved- negative, zero-VRAM handling). - Whole CPU-only ctest -L unit reports 16 / 16 tests passing, zero regressions on round-1 / round-2 / audit-follow-up tests. CLI surface: - supertonic CLI + chatterbox CLI usage strings updated to document --vulkan-device -1 = auto-pick adapter with most free VRAM. - supertonic-bench usage string updated likewise. Co-authored-by: Cursor <cursoragent@cursor.com>

…hts operator deny-list Round 6 layers a user-overridable extra deny-list on top of the existing hand-curated should_materialise_f16_weight() allow-list. The curated allow-list (Phase 2A) already excludes biases, norms, embeddings, depthwise convs, and pre-transposed companions; the round-6 deny-list lets operators force-keep specific additional tensors as F32 even when --f16-weights is on. Use cases: - A/B testing: researcher excludes a specific tensor pattern temporarily without recompiling. - Hardware-specific drift mitigation: operator pins a problematic tensor to F32 via config rather than disabling F16 weights wholesale. - Future-GGUF safety net: new tensor patterns added in future GGUFs that the curated allow-list inadvertently scoops in can be excluded via config without a code change. Smallest blast radius of the four follow-up rounds — load-time policy only, runtime dispatch unaffected, zero behaviour change on the empty-deny-list default path. Strict TDD discipline (per the user's "double check, don't break anything" constraint): - Both new tests committed FIRST. - Both confirmed to fail to compile on the missing symbols (predicate test: 'too many arguments to should_materialise_f16_weight'; API test: 'EngineOptions has no member f16_weights_deny_list'). - Implementation written. - Both tests + every existing unit test re-run; all green. What changed: 1. 2-arg overload should_materialise_f16_weight(name, extra_deny_substrings) added alongside the existing 1-arg version (existing test + call sites unchanged). Substring matching matches the curated predicate's audit-friendly style; no regex compile cost or invalid-pattern surface. The deny- list can only flip true → false, never false → true. Empty strings inside the deny-list are SKIPPED defensively, not treated as universal matches (config-typo guard). 2. EngineOptions::f16_weights_deny_list (vector<string>, default empty) — public API surface. Wired through Engine::Impl → load_supertonic_gguf → the per-tensor allocation loop. 3. load_supertonic_gguf 7th parameter added at the end of the signature with a {} default — every existing call site keeps compiling without modification. 4. supertonic_model::f16_weights_excluded_count counter bumped at load time when a curated-hot tensor is excluded by the user's deny-list. Surfaced in bench's human + JSON output so operators can confirm their config took effect. 5. CLI plumbing: --f16-weights-deny PAT1,PAT2,... flag on supertonic-cli, tts-cli (chatterbox), and supertonic-bench (comma-separated substring patterns). 6. Verbose-log line in load_supertonic_gguf when the deny-list is non-empty (silent on the default path — no visual noise on existing operator workflows). Test plan (TDD round 6): - test-supertonic-f16-weights (UPDATED): existing 36 checks (positives, negatives, edges) + 29 new round-6 checks across 7 new test functions (empty-list passthrough, matching-deny- excludes, non-matching-no-op, cannot-promote-cold, multiple- patterns ANY-match, empty-string defensive skip, empty-name safety) → 65 / 65 PASS. - test-supertonic-f16-deny-list-api (NEW): SFINAE compile-time gate for EngineOptions::f16_weights_deny_list + load_supertonic_gguf 7th param; runtime defaults check + assignability + regression guards on every other documented EngineOptions default → 9 / 9 PASS. - Whole CPU-only ctest -L unit reports 17 / 17 tests, 0 failures, 0 regressions on round-1/2/3 + audit follow-up + the baseline tests. - Smoke-tested supertonic-cli + tts-cli + supertonic-bench binaries: --f16-weights-deny flag parses correctly, surfaces in --help output, and threads through to the load layer. Co-authored-by: Cursor <cursoragent@cursor.com>

…ype K/V flash-attention dispatch Generalises the round-1 `--f16-attn` boolean (F16 vs F32 only) into a four-valued enum + `--kv-attn-type {auto,f32,f16,bf16,q8_0}` CLI flag so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth as F16, no F16 underflow on small attention scores) or Q8_0 K/V (Vulkan + half the K/V upload bandwidth) on adapters that advertise the corresponding capability. Default `auto` falls back to `--f16-attn` so every existing operator config sees zero behaviour change. Strict TDD throughout: Prereq B extends the F16 parity harness to cover BF16 (4 → 8 checks, 5e-3 abs / 5e-3 rel tolerance band, both hot shapes) BEFORE touching any production code; new pure-logic resolver test (`test-supertonic-kv-attn-type`, 106 checks across the full {-1, 0..3} × legacy × probe-mask matrix); new API-surface SFINAE lockdown (`test-supertonic-kv-attn-type-api`, 18 checks). Tests committed first, observed to fail on missing symbols, then implementation added. Pure-logic resolver (`resolve_kv_attn_type`) split from the dispatch site (same pattern as round-3's `resolve_vulkan_device_index`). Probe-rejected explicit requests fall back to F32 silently (advisory-probe contract); out-of-range int throws to surface CLI typos loudly. Vector-estimator dispatch site (`build_text_attention_cache`) replaces the F16-only cast with a switch on the enum; cache key promoted from `bool f16_kv_attn` to `kv_attn_dtype kv_attn_type`. Bench surface adds `(kv_attn_type=…)` to the human-readable backend line and `"kv_attn_type"` + `"kv_attn_type_requested"` to the JSON output so log-grep / CI attribution works across machines. Bonus: `supertonic-cli`'s arg-parse loop is now wrapped in try/catch so invalid values surface as a clean `error: ...` line + exit 2 (also fixes the pre-existing latent crash on `--vulkan-device abc` / `--seed nonsense`). Whole CPU-only `ctest -L unit` reports 19 / 19 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…servability + voice cache + Vulkan env-var passthrough Lowest impact-÷-risk round of the four planned in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup. 1. Voice ttl/dp host cache (`detail::voice_host_cache`). Eliminates 2 sync points / synthesize() after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a full Engine; reference-stability contract documented for the synthesis-pipeline call site. 2. Vulkan env-var passthrough (`apply_vulkan_env_overrides(map)` public helper + `EngineOptions::vulkan_env_overrides` field + `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` / `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` / `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags on all three binaries). ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched. `set_env_if_unset` semantics so an operator-set env var still WINS over the EngineOptions override. 3. Bench `ggml_backend_synchronize` boundaries (`--no-bench-sync` opt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware. 4. Bench per-denoise-step breakdown (`--bench-per-step`). Times each `supertonic_vector_step_ggml` call individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape. Strict TDD throughout. Two new test executables committed first, observed to fail on missing symbols, then implementation written. TDD also caught a real bug: the original env-key validator used `std::string()` empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced a `bool / out-param` API fix BEFORE any production wiring went in. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions (was 19; +2 new tests = 54 new checks). Co-authored-by: Cursor <cursoragent@cursor.com>

…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…U bridge Extends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win). - vector_res_style_qkv_result extended with `sq_gpu / sk_gpu / sv_gpu` GPU handles, populated unconditionally by `run_res_style_qkv_cache` (cheap — no GPU sync; just `ggml_graph_get_tensor` lookups). Same shape as `vector_group_graph_result::q_rope_gpu` etc from the round-1 2C-lite work. - `run_res_style_qkv_cache` host-download gating: the 3 `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv` are now gated on `trace != nullptr`. Production path skips them entirely. Mirrors the round-1 2C-lite `need_host_qkv = (trace != nullptr)` gate. `post` stays unconditional — consumed by the next-stage `run_style_residual_cache` which still expects a host vector (cross-stage GPU bridge for `post` is deferred). - 4 dispatch sites rewired with the same gating pattern as the round-8 front-block bridge: `!include_ggml_trace && sq_gpu && sk_gpu && sv_gpu` → GPU bridge; otherwise legacy host bridge. Trace mode falls back to the legacy host bridge so the trace harness still gets all the host vectors. Strict TDD: parity test (`test-supertonic-graph-to-graph-blit`) extended with explicit style-shape coverage (`style_sq_L1` trip-wire + clarified `style0_q_rope_L20` / `style0_k_rope_kv50`) BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exact `max_abs = 0.0`. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…t upload-skip After rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is `text_emb` (uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used for `style_v_in` / `kctx_in`) into a reusable `upload_skip_tracker` helper and applies it to the front-block + 3 group caches. CRITICAL CORRECTNESS HAZARD addressed: `text_emb` is a stack-local `std::vector<float>` in `Engine::Impl::synthesize()` (and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may have `text_emb.data() == synth_N.text_emb.data()` despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer. Mitigation: caller MUST invoke `tracker.reset()` at every synth boundary (`current_step == 0`). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it. Per-synth wins: - 16 fewer `ggml_backend_tensor_set` host→GPU uploads per synth - ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length) Strict TDD: `test-supertonic-upload-skip-tracker` (NEW, 7 functions, 41 checks) committed first, observed to fail compile (`upload_skip_tracker was not declared`), then implementation added. Whole CPU-only `ctest -L unit` reports 22 / 22 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>

… + text-encoder GPU bridge + pinned-host-buffer per-step inputs Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware. == tetherto#10 — Auto-pick UMA bias == Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40x realtime regression for any operator following the help text "auto-pick adapter with most free VRAM". Extended `resolve_vulkan_device_index` with an optional third arg: int resolve_vulkan_device_index(int requested, const std::vector<size_t> & free_vram_per_device, const std::vector<bool> & is_uma_per_device = {}); Empty UMA list -> round-3 behaviour preserved verbatim. Non-empty + at least one discrete -> argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit `requested >= 0` passthrough is UMA-agnostic. Caller wiring (in `init_supertonic_backend`) collects UMA flags via the public `ggml_backend_dev_get_props()` API on `ggml_backend_vk_reg()` - sets `is_uma = true` for `GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`. `test_supertonic_vulkan_device_select.cpp` extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit- index-ignores-UMA-bias, mismatched-length-throws). == tetherto#6 — Text-encoder speech-prompted-attention GPU bridge == Master's Metal-port branch (PR tetherto#15) built `speech_prompted_merged_cache` (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download -> pack -> re-upload between the QKV cache and the flash-attn cache. Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the dispatch in `speech_prompted_attention_ggml`: if (!model_prefers_cpu_kernels(m)) { thread_local speech_prompted_merged_cache merged_caches[2]; // rebuild on key change, then: run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc); return; } // ... legacy two-cache CPU path unchanged Eliminates per call: - 2 GPU->host downloads (q_out, v_out) - 3 host->GPU uploads (q_pack, k_pack, v_pack) - 1 graph dispatch - All host pack work (q_pack / k_pack / v_pack head-split) = 5 sync points x 2 layers = 10 sync points / synth at the text encoder alone. CPU stays on the legacy two-cache path: master's `dense_matmul_time_ggml` CPU fast path uses cblas + the host- side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU). `test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins: - run_speech_prompted_merged_cache symbol via SFINAE - speech_prompted_merged_cache struct field contract (x_in, style_in, out, idx, L) via SFINAE - free-default-cache trip-wire (catches a buggy free path that segfaults on never-built `thread_local` cache slots at process exit) 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests (`test-supertonic-text-encoder-trace`, `test-supertonic-pipeline`). == tetherto#5 — Pinned-host-buffer per-step input scratchpad == Round 3 shipped the capability probe `supertonic_backend_supports_pinned_host_buffer`, which returns `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the resolved backend. The actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 tetherto#5 lands the helper: ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer( const supertonic_model & model, ggml_context * input_ctx); Returns nullptr on null model.backend / null input_ctx / non- Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from `ggml_backend_vk_host_buffer_type()` via `ggml_backend_alloc_ctx_tensors_from_buft`. Caller owns the returned buffer; frees at cache destruction via `ggml_backend_buffer_free`. Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in - ve_front_block_graph_cache: x_in + mask_in + t_emb_in Total: 9 per-step input tensors moved to host-pinned memory. Each `ggml_backend_tensor_set` on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy). Dual-context pattern: 1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots 2. Create x_in / temb_in / etc. in input_ctx 3. Try host-pinned alloc; fall back to default backend buffer via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)` 4. Build the rest of the graph in cache.ctx; gallocr handles intermediates + outputs, skipping the pre-allocated inputs via the `tensor->buffer != nullptr` check Free order: gallocr -> main ctx -> input_buf -> input_ctx (reversed order would dangle gallocr pointers into freed input tensor metadata) CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains. `test_supertonic_pinned_host_buffer.cpp` (NEW) pins: - Helper symbol existence (SFINAE) - nullptr return on CPU backend (idempotent across repeats) - Null-pointer safety on null model.backend / null input_ctx 11 / 11 CPU-only checks pass. == Combined perf snapshot on RTX 5090 == Long-prompt bench (173 chars, ~15s of audio): Round 11 baseline: 76.11 ms / 5 steps (123x realtime) Round 12 (all three): 27.99 ms / 5 steps (537x realtime) ^ 2.7x faster Vector estimator step: 12.7 ms -> 3.28 ms (3.9x faster) Prewarm cold-start: 330 ms -> 21 ms (15x faster) Short-prompt bench (Hello-world class, ~3s audio): Round 11 baseline: 44.08 ms (74x realtime) Round 12: 23.31 ms (394x realtime) Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU): Round 11 `--vulkan-device -1`: picks RADV -> 178 ms (7x realtime) Round 12 `--vulkan-device -1`: picks RTX 5090 -> 28 ms (537x realtime) ^ 6.4x faster for users following help text == Test plan == CPU build: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF cmake --build tts-cpp/build -j ctest --test-dir tts-cpp/build -L unit -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text- encoder-gpu-bridge, +1 pinned-host-buffer) Vulkan build: cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON cmake --build tts-cpp/build-vulkan -j ctest --test-dir tts-cpp/build-vulkan -L unit -> 24 / 24 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter writes a valid WAV. Co-authored-by: Cursor <cursoragent@cursor.com>

…lidation + Q8_0 K/V finding Round 13 is a strict-improvement-only follow-up to round 12: no code path is removed, no optimisation is rolled back, and the end-to-end perf on every backend stays at the round-12 level. Two deliverables, both no-regret: == 1. New helper `alloc_input_scratchpad_or_throw` == Round 12 tetherto#5 inlined the "try pinned-host first, fall back to default backend buffer, throw on both-fail" idiom at 4 cache sites (front block + 3 group caches): cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(model, cache.input_ctx); if (!cache.input_buf) { cache.input_buf = ggml_backend_alloc_ctx_tensors(cache.input_ctx, model.backend); if (!cache.input_buf) { // per-cache teardown + throw with cache-specific message } } Round 13 factors it into one helper. Each caller becomes: cache.input_buf = alloc_input_scratchpad_or_throw( model, cache.input_ctx, "vector_group_graph_cache"); Same correctness contract — CPU / Metal / OpenCL fall back to default backend buffer; Vulkan tries pinned-host first. Defensive failure modes consolidated: null model.backend, null input_ctx, null cache_name all throw std::runtime_error with a message that includes the cache name, instead of segfaulting in an error-handler path. Single point of maintenance for the pattern; future cache builds that want pinned-host inputs use the helper directly. `test_supertonic_input_scratchpad.cpp` (NEW, 9 / 9 checks) pins the contract via SFINAE on the symbol + CPU-fallback round-trip through `ggml_backend_tensor_set` / `get` + null-arg throws + empty-ctx error message includes the cache name. CPU-only — no GGUF fixture required. CI test count goes from 24 / 24 (round 12) to 25 / 25 (round 13). Perf impact: zero — same code path, same allocations, same data movement, just one fewer level of nesting at each call site. == 2. Q8_0 K/V no-win documented for RTX 5090 == Round 4 shipped the `--kv-attn-type q8_0` CLI option and bench output advertises `q8_0_kv_attn=available`. Round 13 measures the trade-off on the test rig (RTX 5090, 1.79 TB/s memory bandwidth, long prompt 206 chars / 18 s audio): --kv-attn-type f16: total=31.11 ms (588x realtime) <- default --kv-attn-type q8_0: total=31.84 ms (575x realtime) <- 2 % slower The F32->Q8_0 cast overhead exceeds the saved K/V upload bandwidth on a high-bandwidth discrete GPU. Operator guidance: stick with the F16 default on RTX 5090 and similar high- bandwidth discretes. Q8_0 is shipped for adapters where the K/V upload bottlenecks the synth (older PCIe 3.0, lower-end discretes, iGPUs with slow BAR); cross-over point to be measured per-adapter by operators using `--bench-per-step` from round 7. == Test plan == ctest --test-dir tts-cpp/build -L unit -> 25 / 25 PASS (was 24 / 24 in round 12; +1 input-scratchpad) ctest --test-dir tts-cpp/build-vulkan -L unit -> 25 / 25 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter writes a valid WAV. Perf on RTX 5090 (10 runs + 3 warmup, long prompt): Round 12 baseline: med= 31.11 ms (588x realtime) Round 13: med= 31.71 ms (577x realtime) -> within run-to-run noise; no regression. Co-authored-by: Cursor <cursoragent@cursor.com>

…sumption + voice cache threading + round-5 gap Pure docs / comments change. No production-logic surface modified. CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit` 25 / 25; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples, healthy rms). Addresses three reviewer asks on PR tetherto#18: 1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md). Adds an explicit "Note on the round 5 gap" section between round 4 and round 7 documenting that the round-4 plan reserved the name "Round 5 = pinned-host-buffer per-step uploads" as a placeholder, that the actual implementation was deferred behind round-7's bench observability prerequisite, and that it ultimately landed as round 12 tetherto#5. No code was dropped; round numbers stay contiguous so PR descriptions and CI logs match the round labels in this log without rebase churn. 2. UMA-bias assumption (supertonic_gguf.cpp — resolve_vulkan_device_index). Adds a long comment in the requested == -1 auto-pick branch documenting the assumption that is_uma_per_device[i] is sourced from ggml_backend_dev_get_props().type and the failure mode when a discrete adapter's driver mis-reports its type as _IGPU (some Thunderbolt eGPU configs; some ARM SoC dGPU paths). Three sub-cases enumerated: (a) discrete-only with mis-classification falls through to round-3 all-device argmax and still picks discrete by free-VRAM (coincidentally correct), (b) mixed UMA-iGPU + mis-classified-discrete picks iGPU silently (regression vs. round 3 — operator escape hatch: --vulkan-device N is UMA-agnostic and --vulkan-perf-logger exposes the choice). Future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be re-classified) tracked in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. 3. voice_host_cache threading model (supertonic_internal.h). Tightens the reference-stability docstring from "must NOT call clear() while holding the reference" to a full thread-safety section explicitly calling out single-threaded -per-Engine as the supported model (matches what the iOS load/unload race fix 36a2c56 enforces for s3gen). Explains why no internal lock today (cache exists to eliminate per -call GPU downloads; internal locking would give back the saving) and what a future thread-pool refactor must do (external mutex around get_or_load + downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Also clarifies the unordered_map guarantee: element references survive insert even when the table rehashes; only iterators are invalidated. Reviewer's fourth ask — "the round-11 fix is redone in PR tetherto#21" — was resolved by the rebase landing in this same branch state. After rebasing onto upstream/supertonic_optimizations (which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix), this branch's round-11 commit is a delta of only the 2 Vulkan-only V-transpose sites needed for round 8's front-block GPU bridge + round 9's style GPU bridge. No double-application; the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the new base. Co-authored-by: Cursor <cursoragent@cursor.com>

… tests + surface explicit-dtype downgrades Pure additive change (one new resolver out-param defaulting to nullptr; two test files extended; two doc-comment blocks added). No production-logic surface modified for existing callers. Regression status: - CPU `ctest -L unit`: 25 / 25, 256 individual checks (was 25 / 25, ~209 checks pre-change). - Vulkan `ctest -L unit`: 25 / 25. - CPU + Vulkan end-to-end synth: bit-identical 10.10 s WAV (rms=285.6, abs_max=4703 on both backends, same seed + text), confirming no rounds-1..13 optimisation regressed. Addresses Omar's five non-blocker findings on PR tetherto#18: 1. test_resolver_returns_concrete_only (kv_attn_type). The original exhaustive 5 x 2 x 8 sweep only asserted dt != autoselect, so a typo returning f16 when bf16 was requested+supported would pass silently. Rewritten with a second pure-function `expected()` mirror of the resolver's matrix; every one of the 80 grid points now CHECKs the resolver's return value against the expected concrete dtype. Added cross-contamination spot checks (requesting bf16 with f16+q8_0 supported but bf16 NOT supported must fall to f32, not silently to f16 or q8_0). Now 205 checks passed in test-supertonic-kv-attn-type. 2. test_cpu_fallback_returns_valid_buffer (input_scratchpad). Original only round-tripped x_in (one of two allocated tensors). Now round-trips BOTH x_in and temb_in with distinct payload patterns (1.0f vs 2.5f), plus a cross-aliasing recheck (after writing temb_in, x_in must still read back its original 1.0f) — a binding-collision bug where both tensors share memory would now fail this check. 3. resolve_kv_attn_type silent fallback on explicit operator request. Added optional `bool * out_was_downgraded` output parameter to the resolver — set to true IFF the operator explicitly requested f16/bf16/q8_0 AND the corresponding backend probe returned false AND we therefore returned f32. The auto path (-1) leaves the flag false (no operator surprise — auto-policy is doing its job). Engine ctor + supertonic-bench wired to emit a one-line `fprintf(stderr, "warning: requested --kv-attn-type %s but the resolved backend's flash-attn probe rejected it; falling back to f32 (set --kv-attn-type auto to silence)")` on a downgrade. Defaulted nullptr keeps the pure-logic unit tests stderr-clean. New test_downgrade_flag_signal pins the contract on every relevant path (auto + missing probe -> flag false; explicit + matching probe -> flag false; explicit + missing probe -> flag true; nullptr out- ptr safe). 4. test_uma_aware_tiebreak_equal_vram_discretes (vulkan_device_select). Added a dedicated UMA-bias-active test case: two discrete cards with EQUAL VRAM (32 GB each) alongside a UMA iGPU. Pins three sub-cases: interleaved UMA in the middle, adjacent discretes with no UMA, three- way all-discrete tie. Lower index wins in every case. The existing test 11's second CHECK already covered the interleaved-UMA case; this hoists the contract into its own named test so a future refactor reading the test names knows the tiebreak case is pinned. 5. cached_backend_capabilities UaF risk under test-only clear(). Added a long comment on the function documenting the four invariants: (a) production callers may hold the returned ref across subsequent calls for OTHER backends (unordered_map's insert-doesn't-invalidate-references guarantee); (b) production callers MUST NOT keep the ref alive across a clear() call (test code's responsibility); (c) multi-threaded callers must externally synchronise deref vs. clear (the cache's lock protects map structure, NOT element lifetime); (d) if a future refactor adds a production-reachable erase / clear path, this function must switch to return-by-value or std::shared_ptr<const T>. Co-authored-by: Cursor <cursoragent@cursor.com>

ogad-tether

All five findings from the previous review have been addressed in commits 16b9b90 and bf0ce3bb:

kv_attn_type resolver test — Rewritten with a separate expected() mirror function that verifies the exact concrete dtype on all 80 grid points + cross-contamination spot checks. Solid.
Input scratchpad tensor coverage — Now round-trips both x_in and temb_in with distinct payload patterns (1.0f vs 2.5f) plus a cross-aliasing recheck. Would catch binding-collision bugs.
Silent fallback warning — resolve_kv_attn_type now takes an optional bool * out_was_downgraded out-param. Engine + bench emit a stderr warning on explicit-request downgrade. Auto path stays quiet. Clean API design with nullptr default.
UMA-bias tiebreak — New test_uma_aware_tiebreak_equal_vram_discretes covers the equal-VRAM discrete case with three sub-cases (interleaved UMA, adjacent discretes, three-way all-discrete tie).
Capability cache UaF docs — Thorough 4-point invariant comment on cached_backend_capabilities documenting the reference-lifetime contract and the conditions under which it would need to change.

The doc commit also adds a clear explanation for the round-5 gap and documents the UMA-bias driver-misreport failure modes.

25/25 tests, 256 individual checks. LGTM.

Zbig9000 requested review from GustavoA1604, freddy311082, ishanvohra2 and ogad-tether May 14, 2026 09:37

Zbig9000 requested review from a team as code owners May 14, 2026 09:37

Zbig9000 force-pushed the supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch 2 times, most recently from b9f9535 to 51a17d9 Compare May 15, 2026 14:25

GustavoA1604 requested changes May 15, 2026

View reviewed changes

Zbig9000 force-pushed the supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch from 51a17d9 to 1632e45 Compare May 18, 2026 10:23

ogad-tether reviewed May 18, 2026

View reviewed changes

GustavoA1604 and others added 13 commits May 19, 2026 10:41

Zbig9000 and others added 3 commits May 19, 2026 10:42

Zbig9000 force-pushed the supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch from 903c312 to bf0ce3b Compare May 19, 2026 09:08

Zbig9000 requested review from GustavoA1604 and ogad-tether May 19, 2026 09:09

ogad-tether approved these changes May 19, 2026

View reviewed changes

GustavoA1604 approved these changes May 19, 2026

View reviewed changes

GustavoA1604 merged commit 184c641 into tetherto:supertonic_optimizations May 19, 2026
59 of 66 checks passed

Zbig9000 deleted the supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch May 19, 2026 12:18

ishanvohra2 mentioned this pull request Jun 5, 2026

Qvac 18605 tts ggml add and optimize vulkan for supertonic #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic#18

Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic#18
GustavoA1604 merged 16 commits into
tetherto:supertonic_optimizationsfrom
Zbig9000:supertonic_optimizations-QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic

Zbig9000 commented May 14, 2026 •

edited

Loading

Uh oh!

GustavoA1604 left a comment

Uh oh!

Zbig9000 commented May 18, 2026 •

edited

Loading

Uh oh!

ogad-tether left a comment

Uh oh!

Zbig9000 commented May 19, 2026

Review — Vulkan backend for Supertonic (PR #18)

Findings

Positive observations

Uh oh!

ogad-tether left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Zbig9000 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Investigation methodology (TDD throughout)

Commit-by-commit walkthrough

787d966b — Round 1: Vulkan bring-up (initial commit)

d5518ee8 — Pre-existing missing-include fix

6ab085f6 — Round 2: capability-cache + 3 probes + prewarm

36dc758c — Round 3: multi-device auto-pick + 2 forward-compat probes

8087852b — Round 6: F16-weights operator deny-list

60eed5e9 — Round 4: multi-dtype K/V flash-attention dispatch

3c59e523 — Round 7: bench observability + voice cache + Vulkan env-var passthrough

5b166a79 — Round 8: front-block attn0 GPU bridge

0fa1593c — Round 9: style flash-attn GPU bridge

38a67e45 — Round 10: per-step text-input upload-skip

b54b7d43 — Round 11: packed-QK RoPE + GPU-bridge layout fix (CRITICAL CORRECTNESS)

bb99d3ce — Round 12: auto-pick UMA bias + text-encoder GPU bridge + pinned-host-buffer per-step inputs

#10 — Auto-pick UMA bias

#6 — Text-encoder speech-prompted-attention GPU bridge

#5 — Pinned-host-buffer per-step input scratchpad

Combined perf snapshot on RTX 5090

b9f95358 — Round 13: code-quality consolidation + Q8_0 K/V finding

1. New helper alloc_input_scratchpad_or_throw

2. Q8_0 K/V no-win documented for RTX 5090

Backwards-compatibility contract

Test plan

Smoke testing the CLIs

End-to-end real-Vulkan validation

File-by-file change summary

Deferred follow-ups (intentionally out of scope)

Linked

Uh oh!

GustavoA1604 left a comment

Choose a reason for hiding this comment

Uh oh!

Zbig9000 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogad-tether left a comment

Choose a reason for hiding this comment

Review — Vulkan backend for Supertonic (PR #18)

Findings

Positive observations

Uh oh!

Zbig9000 commented May 19, 2026

Review — Vulkan backend for Supertonic (PR #18)

Findings

Positive observations

Uh oh!

ogad-tether left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Zbig9000 commented May 14, 2026 •

edited

Loading

`787d966b` — Round 1: Vulkan bring-up (initial commit)

`d5518ee8` — Pre-existing missing-include fix

`6ab085f6` — Round 2: capability-cache + 3 probes + prewarm

`36dc758c` — Round 3: multi-device auto-pick + 2 forward-compat probes

`8087852b` — Round 6: F16-weights operator deny-list

`60eed5e9` — Round 4: multi-dtype K/V flash-attention dispatch

`3c59e523` — Round 7: bench observability + voice cache + Vulkan env-var passthrough

`5b166a79` — Round 8: front-block attn0 GPU bridge

`0fa1593c` — Round 9: style flash-attn GPU bridge

`38a67e45` — Round 10: per-step text-input upload-skip

`b54b7d43` — Round 11: packed-QK RoPE + GPU-bridge layout fix (CRITICAL CORRECTNESS)

`bb99d3ce` — Round 12: auto-pick UMA bias + text-encoder GPU bridge + pinned-host-buffer per-step inputs

`b9f95358` — Round 13: code-quality consolidation + Q8_0 K/V finding

1. New helper `alloc_input_scratchpad_or_throw`

Zbig9000 commented May 18, 2026 •

edited

Loading