Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic#18
Conversation
b9f9535 to
51a17d9
Compare
GustavoA1604
left a comment
There was a problem hiding this comment.
Please help address/clarify the following:
- Round 5 is skipped — no explanation
The summary says "twelve rounds of Vulkan-specific deltas (rounds 1–13, round 5 skipped)" but nowhere in the PR is there an explanation of what round 5 was or why it was skipped. Was it superseded by another round? Rolled into a different commit? Abandoned after testing? This leaves a gap in the audit log that makes it harder to assess whether the omission is safe or whether something was quietly dropped.
- The round-11 fix is redone in PR #21
PR #21 is a standalone fix for the same apply_rope_to_packed_qk layout bug fixed in round 11 here, but targeting supertonic_optimizations (without Vulkan). The PR description acknowledges the bug came from PR #16. What's unclear is the merge strategy: does PR #18 subsume PR #21 when it lands, or will both be merged separately and cause a double-application of the fix? The V-transpose fix in PR #21 also says it only touches 2 GPU-bridge call sites, while round 11 here touches 4 (build_group_graph_cache, ve_front_block_proj_cache, build_res_style_qkv_cache, and style sq/sk/sv). The difference needs to be reconciled before either merges.
- UMA bias heuristic is fragile on some device topologies
The round-12 fix (resolve_vulkan_device_index with is_uma_per_device) picks the discrete adapter by excluding GGML_BACKEND_DEVICE_TYPE_IGPU / _CPU / _ACCEL. This works for the RTX 5090 + AMD RADV iGPU test case. However, on machines where the discrete GPU is the only device and reports GGML_BACKEND_DEVICE_TYPE_IGPU (some Thunderbolt eGPUs, some ARM SoC configurations), the "all-UMA fallback" path would fire and argmax(free_vram) would still pick the right device. That's correct by the test matrix. But if someone has two UMA iGPUs and one discrete that also happens to report IGPU type due to a driver quirk, they'd silently get the wrong device with no warning. The existing test cases don't cover this; it might be worth a code comment documenting the assumption.
- Voice host cache reference stability — documented but not enforced
Round 7 introduces voice_host_cache and documents that "reference-stability contract [is] documented for the synthesis-pipeline call site." The test pins the contract via CPU-only checks. However, if a synthesizer call happens concurrently (e.g., from a thread pool or the iOS scenario described in the iOS concurrency fix commit), and the cache is evicted or a new voice is loaded mid-synthesis, the reference would dangle. The PR doesn't show any locking on the cache access path. Given that the iOS race fixes landed in the same PR history (the 36a2c56 commit fixing the gguf_init_from_file race), this deserves explicit scrutiny: is voice_host_cache accessed under any lock, or is it the caller's responsibility to ensure single-threaded access?
51a17d9 to
1632e45
Compare
…sumption + voice cache threading + round-5 gap Pure docs / comments change. No production-logic surface modified. CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit` 25 / 25; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples, healthy rms). Addresses three reviewer asks on PR tetherto#18: 1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md). Adds an explicit "Note on the round 5 gap" section between round 4 and round 7 documenting that the round-4 plan reserved the name "Round 5 = pinned-host-buffer per-step uploads" as a placeholder, that the actual implementation was deferred behind round-7's bench observability prerequisite, and that it ultimately landed as round 12 tetherto#5. No code was dropped; round numbers stay contiguous so PR descriptions and CI logs match the round labels in this log without rebase churn. 2. UMA-bias assumption (supertonic_gguf.cpp — resolve_vulkan_device_index). Adds a long comment in the requested == -1 auto-pick branch documenting the assumption that is_uma_per_device[i] is sourced from ggml_backend_dev_get_props().type and the failure mode when a discrete adapter's driver mis-reports its type as _IGPU (some Thunderbolt eGPU configs; some ARM SoC dGPU paths). Three sub-cases enumerated: (a) discrete-only with mis-classification falls through to round-3 all-device argmax and still picks discrete by free-VRAM (coincidentally correct), (b) mixed UMA-iGPU + mis-classified-discrete picks iGPU silently (regression vs. round 3 — operator escape hatch: --vulkan-device N is UMA-agnostic and --vulkan-perf-logger exposes the choice). Future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be re-classified) tracked in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. 3. voice_host_cache threading model (supertonic_internal.h). Tightens the reference-stability docstring from "must NOT call clear() while holding the reference" to a full thread-safety section explicitly calling out single-threaded -per-Engine as the supported model (matches what the iOS load/unload race fix 36a2c56 enforces for s3gen). Explains why no internal lock today (cache exists to eliminate per -call GPU downloads; internal locking would give back the saving) and what a future thread-pool refactor must do (external mutex around get_or_load + downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Also clarifies the unordered_map guarantee: element references survive insert even when the table rehashes; only iterators are invalidated. Reviewer's fourth ask — "the round-11 fix is redone in PR tetherto#21" — was resolved by the rebase landing in this same branch state. After rebasing onto upstream/supertonic_optimizations (which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix), this branch's round-11 commit is a delta of only the 2 Vulkan-only V-transpose sites needed for round 8's front-block GPU bridge + round 9's style GPU bridge. No double-application; the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the new base. Co-authored-by: Cursor <cursoragent@cursor.com>
Reply 1 — "Round 5 is skipped — no explanation" The contiguous round-12 / round-13 numbering (instead of retroactively renaming round 12 to "round 5 (delayed)") is deliberate: the commit hashes referenced in PR descriptions and CI logs match the round labels in PROGRESS_SUPERTONIC.md without rebase churn. Added an explicit "Note on the round 5 gap" section in PROGRESS_SUPERTONIC.md between round 4 and round 7 so the audit log makes this unambiguous. Reply 2 — "The round-11 fix is redone in PR #21" This PR's round 11 originally covered 4 sites: the same 2 sites PR #21 covers + 2 more (ve_front_block_proj_cache's V transpose for round 8's front-block GPU bridge + build_res_style_qkv_cache's sq/sk/sv transposes for round 9's style GPU bridge). Those 2 extras only matter when the Vulkan-only round-8/9 GPU bridges are wired — which is why PR #21's narrower scope was correct for the non-Vulkan branch. Merge strategy after rebase: PR #21 is already in supertonic_optimizations. I just rebased this branch onto the new base, and the round-11 commit (ef266e4) is now a delta of only the 2 Vulkan-only V-transpose sites + comment merges. No double-application: the QVAC-18966 fix is applied exactly once via PR #21 in the new base. Verified: CPU + Vulkan ctest -L unit 25/25 PASS post-rebase; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples). Reply 3 — "UMA bias heuristic is fragile on some device topologies" Empty UMA-flag list → falls back to round-3 argmax(free_vram) (unchanged behaviour for callers that haven't wired the UMA flags). Single discrete reporting _IGPU due to driver quirk: discrete is flagged UMA → excluded from the discrete-subset argmax → any_discrete == false → falls through to round-3 all-device argmax → discrete still picked by free-VRAM (correct outcome by coincidence on a single-discrete rig). Reply 4 — "Voice host cache reference stability — documented but not enforced" Why no internal lock today: the cache exists to eliminate per-call GPU downloads of ttl / dp (~2 sync points per synthesize() on Vulkan / OpenCL). Adding an internal mutex would give back a measurable chunk of that saving (an uncontended std::mutex lock+unlock pair is small but not free on the hot path of every synth). Since the existing iOS fix already mandates one-Engine-per-thread for concurrent synthesis, the cache inherits the same constraint at zero extra cost. Standard unordered_map guarantee re: rehash: element references are NOT invalidated by insert (only iterators are). So even if a second voice loads mid-call on the same thread (impossible today, but allowed for completeness), a held entry & from a prior get_or_load survives. The only operations that can invalidate are clear() / erase() on that entry — and clear() is only reachable on Engine destruction. Strengthened the docstring in supertonic_internal.h with an explicit THREAD-SAFETY section documenting all of the above, including what a future thread-pool refactor would need (external mutex around get_or_load + the downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Commit: 1632e45. |
ogad-tether
left a comment
There was a problem hiding this comment.
Review — Vulkan backend for Supertonic (PR #18)
Thorough review of the 8753-line addition across 30 files. The overall engineering quality is high — TDD discipline is genuine, the commit-per-round structure makes the evolution auditable, and the backwards-compatibility contract is well-documented. The PR is in good shape for merge with a few items to consider.
Findings
1. test_resolver_returns_concrete_only asserts too weakly (test_supertonic_kv_attn_type.cpp)
The exhaustive 5×2×8 resolver sweep only checks dt != kv_attn_dtype::autoselect. A typo in the resolver (e.g., returning f16 when bf16 was requested + supported) would pass this test silently. Consider spot-checking the "happy path" cases with exact enum comparisons — e.g., requested=2, supports_bf16=true → bf16.
2. test_cpu_fallback_returns_valid_buffer only round-trips one of two tensors (test_supertonic_input_scratchpad.cpp)
The test allocates x_in (512B) and temb_in (256B) but only does a tensor_set/tensor_get round-trip on x_in. If the buffer allocation failed to bind temb_in, this test wouldn't catch it.
3. Probe-gated silent fallback vs explicit operator request (resolve_kv_attn_type, supertonic_gguf.cpp:1473-1478)
When an operator explicitly requests --kv-attn-type bf16 but the backend doesn't support it, the resolver silently falls back to F32. This is documented as intentional (advisory-probe contract), but a fprintf(stderr, "warning: ...") on the explicit-request + unsupported path would save operators from silently getting F32 when they thought they had BF16. The auto path (-1) correctly stays silent. The bench JSON does surface the resolved type, so it's partially observable already.
4. Minor: resolve_vulkan_device_index UMA-bias tiebreak within discrete subset (test_supertonic_vulkan_device_select.cpp)
The test for test_hybrid_prefer_discrete_over_uma uses devices with distinct VRAM sizes (32GB vs 120GB). The tiebreak case of two discrete cards with equal VRAM (should pick lower index) is not tested. Covered by the non-UMA auto-pick tests, but worth adding one UMA-specific tiebreak case for completeness.
5. cached_backend_capabilities returns const& through a lock boundary (supertonic_gguf.cpp:779)
The returned reference outlives the lock_guard. This is safe in production because unordered_map references aren't invalidated by insert, and clear() is test-only. But supertonic_clear_capability_cache() could create a dangling reference in multi-threaded test scenarios. If test code ever calls clear() while another thread holds a reference from cached_backend_capabilities, that's UaF. Low risk given single-threaded test execution today, but worth a comment.
Positive observations
- The TDD caught real bugs (V layout transpose, env-var empty-string sentinel, pointer-compare upload-skip). The commit messages document the red→green cycle with specific failure modes — this is exactly how TDD should be practiced on low-level GPU code.
- The pure-logic resolver split (
resolve_vulkan_device_index,resolve_kv_attn_type) makes the policy layer fully testable on CPU without a Vulkan adapter. Smart design. - Backwards-compatibility is meticulously maintained — every existing flag/default preserves its semantics.
- The 25/25 CPU-only
ctestsuite catches regressions in the dispatch/capability/resolver contracts without needing GPU hardware in CI. - Performance results are impressive (2.4–2.7× end-to-end speedup, 15× prewarm improvement on RTX 5090).
None of the findings are merge-blockers. Items 1–2 are low-effort test improvements; items 3–5 are suggestions for consideration.
… tests + surface explicit-dtype downgrades Pure additive change (one new resolver out-param defaulting to nullptr; two test files extended; two doc-comment blocks added). No production-logic surface modified for existing callers. Regression status: - CPU `ctest -L unit`: 25 / 25, 256 individual checks (was 25 / 25, ~209 checks pre-change). - Vulkan `ctest -L unit`: 25 / 25. - CPU + Vulkan end-to-end synth: bit-identical 10.10 s WAV (rms=285.6, abs_max=4703 on both backends, same seed + text), confirming no rounds-1..13 optimisation regressed. Addresses Omar's five non-blocker findings on PR tetherto#18: 1. test_resolver_returns_concrete_only (kv_attn_type). The original exhaustive 5 x 2 x 8 sweep only asserted dt != autoselect, so a typo returning f16 when bf16 was requested+supported would pass silently. Rewritten with a second pure-function `expected()` mirror of the resolver's matrix; every one of the 80 grid points now CHECKs the resolver's return value against the expected concrete dtype. Added cross-contamination spot checks (requesting bf16 with f16+q8_0 supported but bf16 NOT supported must fall to f32, not silently to f16 or q8_0). Now 205 checks passed in test-supertonic-kv-attn-type. 2. test_cpu_fallback_returns_valid_buffer (input_scratchpad). Original only round-tripped x_in (one of two allocated tensors). Now round-trips BOTH x_in and temb_in with distinct payload patterns (1.0f vs 2.5f), plus a cross-aliasing recheck (after writing temb_in, x_in must still read back its original 1.0f) — a binding-collision bug where both tensors share memory would now fail this check. 3. resolve_kv_attn_type silent fallback on explicit operator request. Added optional `bool * out_was_downgraded` output parameter to the resolver — set to true IFF the operator explicitly requested f16/bf16/q8_0 AND the corresponding backend probe returned false AND we therefore returned f32. The auto path (-1) leaves the flag false (no operator surprise — auto-policy is doing its job). Engine ctor + supertonic-bench wired to emit a one-line `fprintf(stderr, "warning: requested --kv-attn-type %s but the resolved backend's flash-attn probe rejected it; falling back to f32 (set --kv-attn-type auto to silence)")` on a downgrade. Defaulted nullptr keeps the pure-logic unit tests stderr-clean. New test_downgrade_flag_signal pins the contract on every relevant path (auto + missing probe -> flag false; explicit + matching probe -> flag false; explicit + missing probe -> flag true; nullptr out- ptr safe). 4. test_uma_aware_tiebreak_equal_vram_discretes (vulkan_device_select). Added a dedicated UMA-bias-active test case: two discrete cards with EQUAL VRAM (32 GB each) alongside a UMA iGPU. Pins three sub-cases: interleaved UMA in the middle, adjacent discretes with no UMA, three- way all-discrete tie. Lower index wins in every case. The existing test 11's second CHECK already covered the interleaved-UMA case; this hoists the contract into its own named test so a future refactor reading the test names knows the tiebreak case is pinned. 5. cached_backend_capabilities UaF risk under test-only clear(). Added a long comment on the function documenting the four invariants: (a) production callers may hold the returned ref across subsequent calls for OTHER backends (unordered_map's insert-doesn't-invalidate-references guarantee); (b) production callers MUST NOT keep the ref alive across a clear() call (test code's responsibility); (c) multi-threaded callers must externally synchronise deref vs. clear (the cache's lock protects map structure, NOT element lifetime); (d) if a future refactor adds a production-reachable erase / clear path, this function must switch to return-by-value or std::shared_ptr<const T>. Co-authored-by: Cursor <cursoragent@cursor.com>
Reply 1 — test_resolver_returns_concrete_only asserts too weakly Reply 2 — test_cpu_fallback_returns_valid_buffer only round-trips one of two tensors Reply 3 — Probe-gated silent fallback vs explicit operator request Engine ctor and supertonic-bench are wired to emit: supertonic: warning: requested --kv-attn-type bf16 but the resolved backend's One observation worth noting: on this dev rig the CPU backend's ggml_backend_supports_op(FLASH_ATTN_EXT(F32, BF16, BF16)) actually returns true (the CPU flash_attn_ext is generic), so the warning doesn't fire on CPU + --kv-attn-type bf16. That's correct probe behaviour, not a wiring bug. The warning will fire on adapters that genuinely reject the op (e.g., Vulkan without cooperative_matrix2 for BF16, or future backends that selectively reject Q8_0 K/V). Reply 4 — resolve_vulkan_device_index UMA-bias tiebreak Interleaved UMA: [32GB discrete, 32GB discrete, 120GB UMA] → picks index 0 (lower discrete). Reply 5 — cached_backend_capabilities returns const & through a lock boundary Production callers may hold the returned ref across subsequent cached_backend_capabilities calls for OTHER backends — std::unordered_map's reference-stability guarantee survives insert/emplace rehash; only iterators are invalidated. |
Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>
`g_s3gen_cache_refcount` is a `std::atomic<int>` (line 189) but `<atomic>` was never included; the file relied on a transitive include chain that broke once any consumer rearranged includes. Surfaces as `error: variable 'std::atomic<int> ... has initializer but incomplete type'` on a clean build. Pre-existing bug, unrelated to QVAC-18605 itself but blocked local CTest runs against the Vulkan-optimisation work. Trivial additive include with no behaviour change. Co-authored-by: Cursor <cursoragent@cursor.com>
…s + prewarm
Layered on top of the QVAC-18605 Vulkan bring-up commit; the
round-2 changes generalise the bring-up's "load-time backend
probe" pattern into a process-wide capability cache and add
three more probes / dispatch hooks that fit the same shape.
Net effect on Vulkan: redundant supports_op traffic eliminated,
defensive auto-policy gating extended to F16 weights, forward-
compat Q8_0 K/V probe primed for a follow-up dispatch flip,
and an opt-in --prewarm hook that lets operators amortise the
~hundreds-of-ms cold-start shader-compile cost outside the
operator-visible first synth call.
1) Process-wide capability-probe cache keyed by ggml_backend_t
The bring-up's three load sites (load_supertonic_gguf,
Engine::Engine, supertonic_bench's main) each ran the
LEAKY_RELU + F16-K/V flash-attn supports_op queries
independently — 2-3x redundant probe traffic per backend.
On Vulkan, supports_op may inspect the device's pipeline
state (~50-200 us per query on Adreno / llvmpipe / RADV in
microbenchmarks); the cache short-circuits 100 % of the
duplicates. Test seam (supertonic_clear_capability_cache +
supertonic_capability_probe_call_count) lets the unit test
verify the cache is hit on the second call by comparing the
counter before / after. Per-backend independence verified
against two distinct CPU backend handles.
2) F16 mul_mat backend-capability probe
Symmetric to the F16-K/V flash-attn probe. The bring-up
auto-enabled use_f16_weights on `!backend_is_cpu` blindly;
a partial-port backend that ships F16 storage but rejects
the hot vector-estimator W_query mul_mat shape would crash
at first synth call. Probe builds the live shape ([256,256]
F16 weight x [256,16] F32 activation) and asks the backend;
auto-policy refuses materialisation on a `false` answer
(slower F32 path stays correct). Manual --f16-weights 1
still forces materialisation (debug-shim escape hatch).
Probe cached; test verifies CPU returns true.
3) Q8_0 K/V flash-attn forward-compat probe
Vulkan's GGML_OP_FLASH_ATTN_EXT supports_op advertises Q8_0
(and Q4_0) K/V types in scalar + coopmat2 paths. Switching
K/V from F16 to Q8_0 would halve the per-step upload
bandwidth (50 KB -> 25 KB per K/V on Supertonic's hot shape;
~1 MB / synth on the default 5-step x 4-site schedule) in
exchange for a small (~0.5 %) drift on the attention output.
This commit adds the probe + caches the result; live
dispatch site is NOT yet wired pending F16-vs-Q8_0 K/V drift
measurement against the parity harness on a real Vulkan
adapter. Bench output annotates `(q8_0_kv_attn=available)`
when the probe says yes so operators can confirm their
hardware is ready for the follow-up.
4) Engine::warm_up(text) + EngineOptions::prewarm_text +
--prewarm CLI flag (supertonic-cli, tts-cli, supertonic-bench)
First-synth-latency reduction on Vulkan / OpenCL. In-tree
thread_local graph caches handle every subsequent call but
can't avoid the first pipeline-compile cost (~hundreds of
ms on Adreno / RADV per chatterbox PROGRESS.md). warm_up
runs one throwaway synth at construction time on a caller-
supplied sample text so the operator-visible first synth
sees steady-state latency. Auto-no-op on CPU (no shader-
compile cost). Bench's --prewarm runs the cold-start synth
BEFORE the timed loop (independent of --warmup N which only
discards N timed runs from the median); cold-start latency
logged as `[prewarm] cold-start synth on '...' took N.Nms`
and emitted to --json-out as "prewarm_ms".
5) Bench output extended
Backend log line surfaces every dispatch flag plus the
cold-start prewarm latency:
Vulkan (device 0: ...) (f16_attn=on) (f16_weights=on)
(native_leaky_relu=on) (q8_0_kv_attn=available)
--json-out gains "f16_attn", "f16_weights",
"native_leaky_relu", "q8_0_kv_attn_available", "prewarm_ms"
keys for downstream analysis tooling.
Tests
- test-supertonic-capability-cache (NEW, LABEL "unit"): probe
cache short-circuit + clear seam + per-backend independence
+ idempotency + F16 mul_mat probe + Q8_0 K/V probe smoke.
18 / 18 checks pass.
- test-supertonic-warm-up-api (NEW, LABEL "unit"): API-surface
contract for EngineOptions::prewarm_text + Engine::warm_up
via SFINAE. 9 / 9 checks pass.
- All existing CPU-only unit tests (test-supertonic-vulkan-
dispatch, -portable-ops, -backend-dispatch, -rope-in-graph,
-rope-packed-qk, -in-graph-transpose, -convnext-block-fused,
-graph-to-graph-blit, -profile-csv, -f16-attn-parity, plus
resample / cpu-caches / t3-caches): all 13 pass unchanged.
- ctest -L unit reports 100 % pass (15 / 15 binaries; 184+ /
184+ individual checks).
Build
- All changed source files compile clean with both
-DGGML_USE_VULKAN defined and undefined.
- No public-API break: EngineOptions::prewarm_text is a new
optional field defaulting to empty (no-op), Engine::warm_up
is a new method (existing callers don't have to invoke it).
Deferred (tracked in PROGRESS_SUPERTONIC.md "Deferred work"):
persistent VkPipelineCache (cross-process), BF16 K/V flash-attn,
Q8_0 K/V live dispatch wiring, multi-device load-balancing.
Co-authored-by: Cursor <cursoragent@cursor.com>
…vice auto-pick + 2 forward-compat probes Three more Vulkan-specific deltas, all developed test-first. New tests were committed first, observed to fail on the missing symbol, and only then was the implementation written and the tests re-run to verify green. 1. BF16 K/V flash-attn capability probe (5th cached_backend_capabilities flag). Symmetric to the round-2 Q8_0 K/V probe. Vulkan's FLASH_ATTN_EXT supports_op advertises BF16 K/V via the coopmat2- only path; BF16 has the same 2-byte per-element footprint as F16 (so identical upload bandwidth) but the wider 8-bit exponent range avoids the F16 underflow on small attention scores. Forward-compat — the live --kv-attn-type bf16 dispatch wiring is deferred to a follow-up that measures drift against the parity harness on a real Vulkan adapter. 2. Multi-device auto-pick for --vulkan-device -1. Wires the previously-reserved auto-pick API: walks every visible adapter, queries ggml_backend_vk_get_device_memory() to read free VRAM, and dispatches into a pure-logic helper resolve_vulkan_device_index(requested, free_vram_per_device) that picks argmax(free_vram); ties → lower index for stable per-run assignment on identical-spec multi-GPU machines. The pure-logic helper is testable on CPU with synthetic inputs (8 test functions, 23 checks). Reserved-future negative values (-2, -100, ...) now throw instead of silently falling through to device 0. Verbose mode logs the per-device VRAM table so operators can confirm the auto-pick chose the expected adapter. 3. Pinned-host-buffer-type capability probe (6th cache flag) + bench surface. Probes whether ggml_backend_vk_host_buffer_type() is callable on the resolved backend (Vulkan + non-null buffer- type). Forward-compat — primes the capability cache for a follow-up per-engine input-scratchpad refactor that skips ggml-vulkan's internal staging-buffer hop on per-step uploads. Bench output now shows bf16_kv_attn_available + pinned_host_buffer_available in both the human-readable backend tag and the JSON output so operators can pre-flight whether a future opt-in will be effective on their machine. Test plan (TDD round 3): - test-supertonic-capability-cache: 27 / 27 checks pass (was 18, +9 checks for round-3: BF16 K/V smoke + cache-slot share, pinned-host-buffer smoke + cache-slot share, null-backend defensive checks for both new probes). - test-supertonic-vulkan-device-select (NEW): 23 / 23 checks pass (8 test functions: empty-list, single-device, argmax-VRAM, tie- break, explicit-index passthrough, out-of-range, reserved- negative, zero-VRAM handling). - Whole CPU-only ctest -L unit reports 16 / 16 tests passing, zero regressions on round-1 / round-2 / audit-follow-up tests. CLI surface: - supertonic CLI + chatterbox CLI usage strings updated to document --vulkan-device -1 = auto-pick adapter with most free VRAM. - supertonic-bench usage string updated likewise. Co-authored-by: Cursor <cursoragent@cursor.com>
…hts operator deny-list
Round 6 layers a user-overridable extra deny-list on top of the
existing hand-curated should_materialise_f16_weight() allow-list.
The curated allow-list (Phase 2A) already excludes biases, norms,
embeddings, depthwise convs, and pre-transposed companions; the
round-6 deny-list lets operators force-keep specific additional
tensors as F32 even when --f16-weights is on. Use cases:
- A/B testing: researcher excludes a specific tensor pattern
temporarily without recompiling.
- Hardware-specific drift mitigation: operator pins a problematic
tensor to F32 via config rather than disabling F16 weights
wholesale.
- Future-GGUF safety net: new tensor patterns added in future
GGUFs that the curated allow-list inadvertently scoops in can
be excluded via config without a code change.
Smallest blast radius of the four follow-up rounds — load-time
policy only, runtime dispatch unaffected, zero behaviour change
on the empty-deny-list default path.
Strict TDD discipline (per the user's "double check, don't break
anything" constraint):
- Both new tests committed FIRST.
- Both confirmed to fail to compile on the missing symbols
(predicate test: 'too many arguments to should_materialise_f16_weight';
API test: 'EngineOptions has no member f16_weights_deny_list').
- Implementation written.
- Both tests + every existing unit test re-run; all green.
What changed:
1. 2-arg overload should_materialise_f16_weight(name,
extra_deny_substrings) added alongside the existing 1-arg
version (existing test + call sites unchanged). Substring
matching matches the curated predicate's audit-friendly style;
no regex compile cost or invalid-pattern surface. The deny-
list can only flip true → false, never false → true. Empty
strings inside the deny-list are SKIPPED defensively, not
treated as universal matches (config-typo guard).
2. EngineOptions::f16_weights_deny_list (vector<string>, default
empty) — public API surface. Wired through Engine::Impl →
load_supertonic_gguf → the per-tensor allocation loop.
3. load_supertonic_gguf 7th parameter added at the end of the
signature with a {} default — every existing call site keeps
compiling without modification.
4. supertonic_model::f16_weights_excluded_count counter bumped at
load time when a curated-hot tensor is excluded by the user's
deny-list. Surfaced in bench's human + JSON output so
operators can confirm their config took effect.
5. CLI plumbing: --f16-weights-deny PAT1,PAT2,... flag on
supertonic-cli, tts-cli (chatterbox), and supertonic-bench
(comma-separated substring patterns).
6. Verbose-log line in load_supertonic_gguf when the deny-list is
non-empty (silent on the default path — no visual noise on
existing operator workflows).
Test plan (TDD round 6):
- test-supertonic-f16-weights (UPDATED): existing 36 checks
(positives, negatives, edges) + 29 new round-6 checks across 7
new test functions (empty-list passthrough, matching-deny-
excludes, non-matching-no-op, cannot-promote-cold, multiple-
patterns ANY-match, empty-string defensive skip, empty-name
safety) → 65 / 65 PASS.
- test-supertonic-f16-deny-list-api (NEW): SFINAE compile-time
gate for EngineOptions::f16_weights_deny_list +
load_supertonic_gguf 7th param; runtime defaults check +
assignability + regression guards on every other documented
EngineOptions default → 9 / 9 PASS.
- Whole CPU-only ctest -L unit reports 17 / 17 tests, 0
failures, 0 regressions on round-1/2/3 + audit follow-up + the
baseline tests.
- Smoke-tested supertonic-cli + tts-cli + supertonic-bench
binaries: --f16-weights-deny flag parses correctly, surfaces in
--help output, and threads through to the load layer.
Co-authored-by: Cursor <cursoragent@cursor.com>
…ype K/V flash-attention dispatch
Generalises the round-1 `--f16-attn` boolean (F16 vs F32 only) into a
four-valued enum + `--kv-attn-type {auto,f32,f16,bf16,q8_0}` CLI flag
so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth
as F16, no F16 underflow on small attention scores) or Q8_0 K/V
(Vulkan + half the K/V upload bandwidth) on adapters that advertise
the corresponding capability. Default `auto` falls back to
`--f16-attn` so every existing operator config sees zero behaviour
change.
Strict TDD throughout: Prereq B extends the F16 parity harness to
cover BF16 (4 → 8 checks, 5e-3 abs / 5e-3 rel tolerance band, both
hot shapes) BEFORE touching any production code; new pure-logic
resolver test (`test-supertonic-kv-attn-type`, 106 checks across the
full {-1, 0..3} × legacy × probe-mask matrix); new API-surface
SFINAE lockdown (`test-supertonic-kv-attn-type-api`, 18 checks).
Tests committed first, observed to fail on missing symbols, then
implementation added.
Pure-logic resolver (`resolve_kv_attn_type`) split from the dispatch
site (same pattern as round-3's `resolve_vulkan_device_index`).
Probe-rejected explicit requests fall back to F32 silently
(advisory-probe contract); out-of-range int throws to surface CLI
typos loudly. Vector-estimator dispatch site
(`build_text_attention_cache`) replaces the F16-only cast with a
switch on the enum; cache key promoted from `bool f16_kv_attn` to
`kv_attn_dtype kv_attn_type`. Bench surface adds `(kv_attn_type=…)`
to the human-readable backend line and `"kv_attn_type"` +
`"kv_attn_type_requested"` to the JSON output so log-grep / CI
attribution works across machines.
Bonus: `supertonic-cli`'s arg-parse loop is now wrapped in try/catch
so invalid values surface as a clean `error: ...` line + exit 2
(also fixes the pre-existing latent crash on `--vulkan-device abc` /
`--seed nonsense`).
Whole CPU-only `ctest -L unit` reports 19 / 19 tests, 0 failures, 0
regressions.
Co-authored-by: Cursor <cursoragent@cursor.com>
…servability + voice cache + Vulkan env-var passthrough Lowest impact-÷-risk round of the four planned in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup. 1. Voice ttl/dp host cache (`detail::voice_host_cache`). Eliminates 2 sync points / synthesize() after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a full Engine; reference-stability contract documented for the synthesis-pipeline call site. 2. Vulkan env-var passthrough (`apply_vulkan_env_overrides(map)` public helper + `EngineOptions::vulkan_env_overrides` field + `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` / `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` / `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags on all three binaries). ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched. `set_env_if_unset` semantics so an operator-set env var still WINS over the EngineOptions override. 3. Bench `ggml_backend_synchronize` boundaries (`--no-bench-sync` opt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware. 4. Bench per-denoise-step breakdown (`--bench-per-step`). Times each `supertonic_vector_step_ggml` call individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape. Strict TDD throughout. Two new test executables committed first, observed to fail on missing symbols, then implementation written. TDD also caught a real bug: the original env-key validator used `std::string()` empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced a `bool / out-param` API fix BEFORE any production wiring went in. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions (was 19; +2 new tests = 54 new checks). Co-authored-by: Cursor <cursoragent@cursor.com>
…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…U bridge Extends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win). - vector_res_style_qkv_result extended with `sq_gpu / sk_gpu / sv_gpu` GPU handles, populated unconditionally by `run_res_style_qkv_cache` (cheap — no GPU sync; just `ggml_graph_get_tensor` lookups). Same shape as `vector_group_graph_result::q_rope_gpu` etc from the round-1 2C-lite work. - `run_res_style_qkv_cache` host-download gating: the 3 `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv` are now gated on `trace != nullptr`. Production path skips them entirely. Mirrors the round-1 2C-lite `need_host_qkv = (trace != nullptr)` gate. `post` stays unconditional — consumed by the next-stage `run_style_residual_cache` which still expects a host vector (cross-stage GPU bridge for `post` is deferred). - 4 dispatch sites rewired with the same gating pattern as the round-8 front-block bridge: `!include_ggml_trace && sq_gpu && sk_gpu && sv_gpu` → GPU bridge; otherwise legacy host bridge. Trace mode falls back to the legacy host bridge so the trace harness still gets all the host vectors. Strict TDD: parity test (`test-supertonic-graph-to-graph-blit`) extended with explicit style-shape coverage (`style_sq_L1` trip-wire + clarified `style0_q_rope_L20` / `style0_k_rope_kv50`) BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exact `max_abs = 0.0`. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…t upload-skip After rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is `text_emb` (uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used for `style_v_in` / `kctx_in`) into a reusable `upload_skip_tracker` helper and applies it to the front-block + 3 group caches. CRITICAL CORRECTNESS HAZARD addressed: `text_emb` is a stack-local `std::vector<float>` in `Engine::Impl::synthesize()` (and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may have `text_emb.data() == synth_N.text_emb.data()` despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer. Mitigation: caller MUST invoke `tracker.reset()` at every synth boundary (`current_step == 0`). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it. Per-synth wins: - 16 fewer `ggml_backend_tensor_set` host→GPU uploads per synth - ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length) Strict TDD: `test-supertonic-upload-skip-tracker` (NEW, 7 functions, 41 checks) committed first, observed to fail compile (`upload_skip_tracker was not declared`), then implementation added. Whole CPU-only `ctest -L unit` reports 22 / 22 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware. == tetherto#10 — Auto-pick UMA bias == Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40x realtime regression for any operator following the help text "auto-pick adapter with most free VRAM". Extended `resolve_vulkan_device_index` with an optional third arg: int resolve_vulkan_device_index(int requested, const std::vector<size_t> & free_vram_per_device, const std::vector<bool> & is_uma_per_device = {}); Empty UMA list -> round-3 behaviour preserved verbatim. Non-empty + at least one discrete -> argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit `requested >= 0` passthrough is UMA-agnostic. Caller wiring (in `init_supertonic_backend`) collects UMA flags via the public `ggml_backend_dev_get_props()` API on `ggml_backend_vk_reg()` - sets `is_uma = true` for `GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`. `test_supertonic_vulkan_device_select.cpp` extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit- index-ignores-UMA-bias, mismatched-length-throws). == tetherto#6 — Text-encoder speech-prompted-attention GPU bridge == Master's Metal-port branch (PR tetherto#15) built `speech_prompted_merged_cache` (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download -> pack -> re-upload between the QKV cache and the flash-attn cache. Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the dispatch in `speech_prompted_attention_ggml`: if (!model_prefers_cpu_kernels(m)) { thread_local speech_prompted_merged_cache merged_caches[2]; // rebuild on key change, then: run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc); return; } // ... legacy two-cache CPU path unchanged Eliminates per call: - 2 GPU->host downloads (q_out, v_out) - 3 host->GPU uploads (q_pack, k_pack, v_pack) - 1 graph dispatch - All host pack work (q_pack / k_pack / v_pack head-split) = 5 sync points x 2 layers = 10 sync points / synth at the text encoder alone. CPU stays on the legacy two-cache path: master's `dense_matmul_time_ggml` CPU fast path uses cblas + the host- side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU). `test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins: - run_speech_prompted_merged_cache symbol via SFINAE - speech_prompted_merged_cache struct field contract (x_in, style_in, out, idx, L) via SFINAE - free-default-cache trip-wire (catches a buggy free path that segfaults on never-built `thread_local` cache slots at process exit) 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests (`test-supertonic-text-encoder-trace`, `test-supertonic-pipeline`). == tetherto#5 — Pinned-host-buffer per-step input scratchpad == Round 3 shipped the capability probe `supertonic_backend_supports_pinned_host_buffer`, which returns `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the resolved backend. The actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 tetherto#5 lands the helper: ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer( const supertonic_model & model, ggml_context * input_ctx); Returns nullptr on null model.backend / null input_ctx / non- Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from `ggml_backend_vk_host_buffer_type()` via `ggml_backend_alloc_ctx_tensors_from_buft`. Caller owns the returned buffer; frees at cache destruction via `ggml_backend_buffer_free`. Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in - ve_front_block_graph_cache: x_in + mask_in + t_emb_in Total: 9 per-step input tensors moved to host-pinned memory. Each `ggml_backend_tensor_set` on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy). Dual-context pattern: 1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots 2. Create x_in / temb_in / etc. in input_ctx 3. Try host-pinned alloc; fall back to default backend buffer via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)` 4. Build the rest of the graph in cache.ctx; gallocr handles intermediates + outputs, skipping the pre-allocated inputs via the `tensor->buffer != nullptr` check Free order: gallocr -> main ctx -> input_buf -> input_ctx (reversed order would dangle gallocr pointers into freed input tensor metadata) CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains. `test_supertonic_pinned_host_buffer.cpp` (NEW) pins: - Helper symbol existence (SFINAE) - nullptr return on CPU backend (idempotent across repeats) - Null-pointer safety on null model.backend / null input_ctx 11 / 11 CPU-only checks pass. == Combined perf snapshot on RTX 5090 == Long-prompt bench (173 chars, ~15s of audio): Round 11 baseline: 76.11 ms / 5 steps (123x realtime) Round 12 (all three): 27.99 ms / 5 steps (537x realtime) ^ 2.7x faster Vector estimator step: 12.7 ms -> 3.28 ms (3.9x faster) Prewarm cold-start: 330 ms -> 21 ms (15x faster) Short-prompt bench (Hello-world class, ~3s audio): Round 11 baseline: 44.08 ms (74x realtime) Round 12: 23.31 ms (394x realtime) Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU): Round 11 `--vulkan-device -1`: picks RADV -> 178 ms (7x realtime) Round 12 `--vulkan-device -1`: picks RTX 5090 -> 28 ms (537x realtime) ^ 6.4x faster for users following help text == Test plan == CPU build: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF cmake --build tts-cpp/build -j ctest --test-dir tts-cpp/build -L unit -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text- encoder-gpu-bridge, +1 pinned-host-buffer) Vulkan build: cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON cmake --build tts-cpp/build-vulkan -j ctest --test-dir tts-cpp/build-vulkan -L unit -> 24 / 24 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter writes a valid WAV. Co-authored-by: Cursor <cursoragent@cursor.com>
…lidation + Q8_0 K/V finding Round 13 is a strict-improvement-only follow-up to round 12: no code path is removed, no optimisation is rolled back, and the end-to-end perf on every backend stays at the round-12 level. Two deliverables, both no-regret: == 1. New helper `alloc_input_scratchpad_or_throw` == Round 12 tetherto#5 inlined the "try pinned-host first, fall back to default backend buffer, throw on both-fail" idiom at 4 cache sites (front block + 3 group caches): cache.input_buf = try_alloc_inputs_in_pinned_host_buffer(model, cache.input_ctx); if (!cache.input_buf) { cache.input_buf = ggml_backend_alloc_ctx_tensors(cache.input_ctx, model.backend); if (!cache.input_buf) { // per-cache teardown + throw with cache-specific message } } Round 13 factors it into one helper. Each caller becomes: cache.input_buf = alloc_input_scratchpad_or_throw( model, cache.input_ctx, "vector_group_graph_cache"); Same correctness contract — CPU / Metal / OpenCL fall back to default backend buffer; Vulkan tries pinned-host first. Defensive failure modes consolidated: null model.backend, null input_ctx, null cache_name all throw std::runtime_error with a message that includes the cache name, instead of segfaulting in an error-handler path. Single point of maintenance for the pattern; future cache builds that want pinned-host inputs use the helper directly. `test_supertonic_input_scratchpad.cpp` (NEW, 9 / 9 checks) pins the contract via SFINAE on the symbol + CPU-fallback round-trip through `ggml_backend_tensor_set` / `get` + null-arg throws + empty-ctx error message includes the cache name. CPU-only — no GGUF fixture required. CI test count goes from 24 / 24 (round 12) to 25 / 25 (round 13). Perf impact: zero — same code path, same allocations, same data movement, just one fewer level of nesting at each call site. == 2. Q8_0 K/V no-win documented for RTX 5090 == Round 4 shipped the `--kv-attn-type q8_0` CLI option and bench output advertises `q8_0_kv_attn=available`. Round 13 measures the trade-off on the test rig (RTX 5090, 1.79 TB/s memory bandwidth, long prompt 206 chars / 18 s audio): --kv-attn-type f16: total=31.11 ms (588x realtime) <- default --kv-attn-type q8_0: total=31.84 ms (575x realtime) <- 2 % slower The F32->Q8_0 cast overhead exceeds the saved K/V upload bandwidth on a high-bandwidth discrete GPU. Operator guidance: stick with the F16 default on RTX 5090 and similar high- bandwidth discretes. Q8_0 is shipped for adapters where the K/V upload bottlenecks the synth (older PCIe 3.0, lower-end discretes, iGPUs with slow BAR); cross-over point to be measured per-adapter by operators using `--bench-per-step` from round 7. == Test plan == ctest --test-dir tts-cpp/build -L unit -> 25 / 25 PASS (was 24 / 24 in round 12; +1 input-scratchpad) ctest --test-dir tts-cpp/build-vulkan -L unit -> 25 / 25 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) — every adapter writes a valid WAV. Perf on RTX 5090 (10 runs + 3 warmup, long prompt): Round 12 baseline: med= 31.11 ms (588x realtime) Round 13: med= 31.71 ms (577x realtime) -> within run-to-run noise; no regression. Co-authored-by: Cursor <cursoragent@cursor.com>
…sumption + voice cache threading + round-5 gap Pure docs / comments change. No production-logic surface modified. CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit` 25 / 25; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples, healthy rms). Addresses three reviewer asks on PR tetherto#18: 1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md). Adds an explicit "Note on the round 5 gap" section between round 4 and round 7 documenting that the round-4 plan reserved the name "Round 5 = pinned-host-buffer per-step uploads" as a placeholder, that the actual implementation was deferred behind round-7's bench observability prerequisite, and that it ultimately landed as round 12 tetherto#5. No code was dropped; round numbers stay contiguous so PR descriptions and CI logs match the round labels in this log without rebase churn. 2. UMA-bias assumption (supertonic_gguf.cpp — resolve_vulkan_device_index). Adds a long comment in the requested == -1 auto-pick branch documenting the assumption that is_uma_per_device[i] is sourced from ggml_backend_dev_get_props().type and the failure mode when a discrete adapter's driver mis-reports its type as _IGPU (some Thunderbolt eGPU configs; some ARM SoC dGPU paths). Three sub-cases enumerated: (a) discrete-only with mis-classification falls through to round-3 all-device argmax and still picks discrete by free-VRAM (coincidentally correct), (b) mixed UMA-iGPU + mis-classified-discrete picks iGPU silently (regression vs. round 3 — operator escape hatch: --vulkan-device N is UMA-agnostic and --vulkan-perf-logger exposes the choice). Future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be re-classified) tracked in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. 3. voice_host_cache threading model (supertonic_internal.h). Tightens the reference-stability docstring from "must NOT call clear() while holding the reference" to a full thread-safety section explicitly calling out single-threaded -per-Engine as the supported model (matches what the iOS load/unload race fix 36a2c56 enforces for s3gen). Explains why no internal lock today (cache exists to eliminate per -call GPU downloads; internal locking would give back the saving) and what a future thread-pool refactor must do (external mutex around get_or_load + downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Also clarifies the unordered_map guarantee: element references survive insert even when the table rehashes; only iterators are invalidated. Reviewer's fourth ask — "the round-11 fix is redone in PR tetherto#21" — was resolved by the rebase landing in this same branch state. After rebasing onto upstream/supertonic_optimizations (which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix), this branch's round-11 commit is a delta of only the 2 Vulkan-only V-transpose sites needed for round 8's front-block GPU bridge + round 9's style GPU bridge. No double-application; the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the new base. Co-authored-by: Cursor <cursoragent@cursor.com>
… tests + surface explicit-dtype downgrades Pure additive change (one new resolver out-param defaulting to nullptr; two test files extended; two doc-comment blocks added). No production-logic surface modified for existing callers. Regression status: - CPU `ctest -L unit`: 25 / 25, 256 individual checks (was 25 / 25, ~209 checks pre-change). - Vulkan `ctest -L unit`: 25 / 25. - CPU + Vulkan end-to-end synth: bit-identical 10.10 s WAV (rms=285.6, abs_max=4703 on both backends, same seed + text), confirming no rounds-1..13 optimisation regressed. Addresses Omar's five non-blocker findings on PR tetherto#18: 1. test_resolver_returns_concrete_only (kv_attn_type). The original exhaustive 5 x 2 x 8 sweep only asserted dt != autoselect, so a typo returning f16 when bf16 was requested+supported would pass silently. Rewritten with a second pure-function `expected()` mirror of the resolver's matrix; every one of the 80 grid points now CHECKs the resolver's return value against the expected concrete dtype. Added cross-contamination spot checks (requesting bf16 with f16+q8_0 supported but bf16 NOT supported must fall to f32, not silently to f16 or q8_0). Now 205 checks passed in test-supertonic-kv-attn-type. 2. test_cpu_fallback_returns_valid_buffer (input_scratchpad). Original only round-tripped x_in (one of two allocated tensors). Now round-trips BOTH x_in and temb_in with distinct payload patterns (1.0f vs 2.5f), plus a cross-aliasing recheck (after writing temb_in, x_in must still read back its original 1.0f) — a binding-collision bug where both tensors share memory would now fail this check. 3. resolve_kv_attn_type silent fallback on explicit operator request. Added optional `bool * out_was_downgraded` output parameter to the resolver — set to true IFF the operator explicitly requested f16/bf16/q8_0 AND the corresponding backend probe returned false AND we therefore returned f32. The auto path (-1) leaves the flag false (no operator surprise — auto-policy is doing its job). Engine ctor + supertonic-bench wired to emit a one-line `fprintf(stderr, "warning: requested --kv-attn-type %s but the resolved backend's flash-attn probe rejected it; falling back to f32 (set --kv-attn-type auto to silence)")` on a downgrade. Defaulted nullptr keeps the pure-logic unit tests stderr-clean. New test_downgrade_flag_signal pins the contract on every relevant path (auto + missing probe -> flag false; explicit + matching probe -> flag false; explicit + missing probe -> flag true; nullptr out- ptr safe). 4. test_uma_aware_tiebreak_equal_vram_discretes (vulkan_device_select). Added a dedicated UMA-bias-active test case: two discrete cards with EQUAL VRAM (32 GB each) alongside a UMA iGPU. Pins three sub-cases: interleaved UMA in the middle, adjacent discretes with no UMA, three- way all-discrete tie. Lower index wins in every case. The existing test 11's second CHECK already covered the interleaved-UMA case; this hoists the contract into its own named test so a future refactor reading the test names knows the tiebreak case is pinned. 5. cached_backend_capabilities UaF risk under test-only clear(). Added a long comment on the function documenting the four invariants: (a) production callers may hold the returned ref across subsequent calls for OTHER backends (unordered_map's insert-doesn't-invalidate-references guarantee); (b) production callers MUST NOT keep the ref alive across a clear() call (test code's responsibility); (c) multi-threaded callers must externally synchronise deref vs. clear (the cache's lock protects map structure, NOT element lifetime); (d) if a future refactor adds a production-reachable erase / clear path, this function must switch to return-by-value or std::shared_ptr<const T>. Co-authored-by: Cursor <cursoragent@cursor.com>
903c312 to
bf0ce3b
Compare
ogad-tether
left a comment
There was a problem hiding this comment.
All five findings from the previous review have been addressed in commits 16b9b90 and bf0ce3bb:
-
kv_attn_type resolver test — Rewritten with a separate
expected()mirror function that verifies the exact concrete dtype on all 80 grid points + cross-contamination spot checks. Solid. -
Input scratchpad tensor coverage — Now round-trips both
x_inandtemb_inwith distinct payload patterns (1.0f vs 2.5f) plus a cross-aliasing recheck. Would catch binding-collision bugs. -
Silent fallback warning —
resolve_kv_attn_typenow takes an optionalbool * out_was_downgradedout-param. Engine + bench emit a stderr warning on explicit-request downgrade. Auto path stays quiet. Clean API design with nullptr default. -
UMA-bias tiebreak — New
test_uma_aware_tiebreak_equal_vram_discretescovers the equal-VRAM discrete case with three sub-cases (interleaved UMA, adjacent discretes, three-way all-discrete tie). -
Capability cache UaF docs — Thorough 4-point invariant comment on
cached_backend_capabilitiesdocumenting the reference-lifetime contract and the conditions under which it would need to change.
The doc commit also adds a clear explanation for the round-5 gap and documents the UMA-bias driver-misreport failure modes.
25/25 tests, 256 individual checks. LGTM.
184c641
into
tetherto:supertonic_optimizations
Summary
Brings the Supertonic TTS stage of
tts-cppto functional + tunable parity on the Vulkan backend, layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR #16). The audit-driven optimisations from #16 are backend-portable by construction, so Vulkan inherits all ~280 host↔GPU sync-point eliminations + the F16-weight roster + the in-graph RoPE / ConvNeXt fusion / GPU↔GPU blit work without modification. This PR adds twelve rounds of Vulkan-specific deltas (rounds 1–13, round 5 skipped) — each round committed test-first (TDD) with a CPU-only unit gate that locks in the dispatch + capability + correctness contract for future regressions.Scope vs. PR #16: this PR sits on top of the OpenCL branch (
QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic). All 13 commits below are Vulkan-specific deltas; the OpenCL audit work is not restated here. The optimisations layer cleanly because the audit hits the GGML graph layer (backend-portable by construction); Vulkan inherits the wins automatically.Net new surface (against the OpenCL branch):
native_leaky_relu,f16_kv_flash_attn,f16_mul_mat,q8_0_kv_flash_attn,bf16_kv_flash_attn,pinned_host_buffer)use_native_leaky_relu,kv_attn_type) — joins the round-1use_f16_attnEngineOptionsknobsvulkan_device,prewarm_text,f16_weights_deny_list,kv_attn_type,vulkan_env_overrides,bench_per_step)--vulkan-device,--prewarm,--f16-weights-deny,--kv-attn-type,--vulkan-prefer-host-memory,--vulkan-disable-coopmat2,--vulkan-disable-bfloat16,--vulkan-perf-logger,--vulkan-async-transfer,--vulkan-env,--bench-per-step,--no-bench-syncupload_skip_tracker,voice_host_cache,try_alloc_inputs_in_pinned_host_buffer,alloc_input_scratchpad_or_throw,apply_vulkan_env_overrides,run_speech_prompted_merged_cache, plus 5 GPU-bridge dispatch sitesctest -L unit)test-supertonic-vulkan-dispatch,-portable-opsupdated,-capability-cache,-warm-up-api,-vulkan-device-select,-f16-deny-list-api,-kv-attn-type,-kv-attn-type-api,-vulkan-env-overrides,-voice-host-cache,-upload-skip-tracker,-text-encoder-gpu-bridge,-pinned-host-buffer,-input-scratchpad; plus-f16-attn-parityextended for BF16 and-graph-to-graph-blitextended for front-block + style shapes; plus-rope-packed-qkrewritten for the production[L, HD]layout)ctest -L unitCombined perf snapshot — RTX 5090, long prompt (173 chars / ~15 s audio):
Investigation methodology (TDD throughout)
Every round followed the same workflow:
The CPU-only test strategy is deliberate: a fresh checkout's
ctestexercises the dispatch + capability + resolver contracts without needing a Vulkan adapter, so CI on a CPU-only runner catches regressions in the policy layer. Real-Vulkan numerics are validated through the F16 / BF16 K/V parity harness running against the CPUflash_attn_extreference, which lands the sameggml_cpy(K → typed) + ggml_flash_attn_extgraph the live Vulkan dispatch builds.TDD caught real bugs that would otherwise have shipped:
std::string()empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced abool / out-paramAPI fix BEFORE any production wiring went in.dense_matmul_time_ggmlreturns ane=[HD, L]tensor. In fact the matmul producesne=[L, HD]— the bit-exact transpose of the helper's input contract. The original CPU unit test hand-built Q under the wrong shape, so the failure mode was invisible to CI; round 11 rewrote the test under the production shape (RED), then fixed the helper (GREEN), unblocking end-to-end synth on every backend.tracker.reset()API at every synth boundary.Commit-by-commit walkthrough
787d966b— Round 1: Vulkan bring-up (initial commit)Foundational Vulkan dispatch + capability probing. The OpenCL bring-up (#16) used
model.use_f16_attn = !backend_is_cpubecause the chatterbox OpenCL patch unconditionally accepts the F16-K/V op; on Vulkan theHSK % 8 == 0supports_opgate has to be respected, so the auto-policy needs a probe.supertonic_modelflags populated at GGUF load:backend_is_vk(informational; appended to the backend-description string) anduse_native_leaky_relu(resolved viaggml_backend_supports_op(LEAKY_RELU)against a synthetic node — the dispatch helper short-circuits to the fused builtin on backends that shipGGML_OP_LEAKY_RELUnatively, falls back to the conservativeRELU + SCALE + ADDdecomposition otherwise; no hard-coded backend table).supertonic_backend_supports_f16_kv_flash_attngates theuse_f16_attnauto-policy. Builds a synthetic Supertonic-shapedggml_flash_attn_ext(Q=F32, K/V=F16)node and asks the backend whether it would accept it — load-time, zero hot-path cost, graceful auto-disable on afalseanswer.EngineOptions::vulkan_deviceint +--vulkan-device NCLI flag plumbed through all three binaries. Replaces the historical hard-codedggml_backend_vk_init(0); range-checked againstggml_backend_vk_get_device_count()at load (out-of-range = hard error, no silent CPU fallback that would hide CLI typos / wrong-machine config).ggml_backend_vk_get_device_descriptionso multi-GPU / multi-ICD machines (NVIDIA + llvmpipe, AMD RADV + NVIDIA) unambiguously identify which adapter ran.test-supertonic-vulkan-dispatchcovering the new flags throughsupertonic_op_dispatch_scope+ a smoke test for the F16-K/V probe. Pre-existingtest-supertonic-portable-opsupdated to explicitly request the decomposed path on the GPU fixture model so its existing GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU.d5518ee8— Pre-existing missing-include fixtts-cpp/src/chatterbox_tts.cppusedstd::atomic<int>without#include <atomic>; pre-existed before this branch but blocked the Supertonic build under the cleanercmake -S tts-cpp -B build-ttsinvocation used for round 2+ verification. One-line fix in a single TU. Kept as a separate commit so it's trivially revertable / cherry-pickable to other branches.6ab085f6— Round 2: capability-cache + 3 probes + prewarmThe round-1 probes were already cheap, but
engine.cpp+bench.cpp+load_supertonic_ggufeach ran them independently — three probes × N capabilities = up to 9 redundantggml_backend_supports_opcalls per backend per process.cached_backend_capabilitiesmap keyed byggml_backend_t, guarded by a singlestd::mutex. Hot path is load-time only, so contention is negligible. Probe-call counter (capability_probe_call_counter) exposed for the regression test.supertonic_backend_supports_f16_mul_mat— gates theuse_f16_weightsauto-policy (Phase 2A made it!backend_is_cpuunconditionally; round 2 makes it probe-gated so a backend that ships F16 storage but rejects the hotmul_mat(F16, F32)shape doesn't crash at first synth call).supertonic_backend_supports_q8_0_kv_flash_attn— forward-compat probe; primes the cache for round 4's live dispatch.supertonic_backend_supports_native_leaky_relu— wraps round 1's inline probe so the auto-policy can use the cached path.Engine::warm_up(text)API +EngineOptions::prewarm_text+--prewarm TEXTCLI flag. Runs one throwaway synth at engine construction so the Vulkan / OpenCL shader pipelines for every Supertonic stage compile up-front; the operator-visible firstsynthesize()call hits steady-state latency instead of paying the ~hundreds-of-ms cold-start hit chatterbox PROGRESS.md measured on Adreno + RADV. No-op on CPU backends.test-supertonic-capability-cache(probe-counter regression — 1 cache miss + N hits) andtest-supertonic-warm-up-api(SFINAE compile-time gate on the new API).36dc758c— Round 3: multi-device auto-pick + 2 forward-compat probesThe round-1
--vulkan-device Nflag covered manual selection but every multi-GPU operator has to pin a specific index in their config; auto-pick across heterogeneous machines requires VRAM introspection.--vulkan-device -1auto-pick policy:resolve_vulkan_device_indexpure-logic helper picks the device with the most free VRAM viaggml_backend_vk_get_device_memory(). Tie-break = lower index (deterministic). Reserved negatives < -1 throw to surface CLI typos. The pure-logic split makes the behaviour matrix testable on CPU with synthetic(index, [vram_per_device])tuples — no real Vulkan device required for CI.supertonic_backend_supports_bf16_kv_flash_attn— symmetric to F16-K/V, picks BF16 instead. Mostly relevant on Vulkan with cooperative_matrix2 (NVIDIA Ampere+ / RDNA3+).supertonic_backend_supports_pinned_host_buffer—trueiff the backend is Vulkan ANDggml_backend_vk_host_buffer_type()returns non-null. Primes the cache for round 12's per-engine input-scratchpad refactor.test-supertonic-vulkan-device-select(8 functions, 23 checks — empty list, single device, auto-pick max VRAM, tie-breaking, explicit index passthrough, out-of-range, reserved negatives, zero-VRAM device).test-supertonic-capability-cacheextended with new-probe coverage.8087852b— Round 6: F16-weights operator deny-listThe Phase 2A F16-weights policy was all-or-nothing — operators couldn't keep one specific tensor at F32 if it caused drift on a particular adapter / driver combo without disabling F16 weights for the entire model.
should_materialise_f16_weight(source_name, deny_list)overload layered on top of the curated allow-list. Each entry is a substring; if ANY non-empty entry is found inside a tensor's source name, that tensor stays at its native storage type. Empty entries are skipped defensively (config-typo guard so a stray empty entry doesn't silently disable F16 for the whole model).EngineOptions::f16_weights_deny_list+--f16-weights-deny PAT1,PAT2,...CLI flag (comma-split parser shared betweensupertonic-cli/tts-cli/supertonic-bench). Default empty (zero behaviour change for every existing operator config).supertonic_model::f16_weights_excluded_countcounter surfaced in bench output (human + JSON) so operators can confirm their deny-list took effect. Silent on the default empty path.test-supertonic-f16-deny-list-api(SFINAE + runtime defaults + assignability + regression guards). Existingtest-supertonic-f16-weightsextended with 7 new test functions / 29 new checks (empty-list passthrough, matching-deny-excludes, non-matching-no-op, cannot-promote-cold, multiple-patterns ANY-match, empty-string defensive skip, empty-name safety).60eed5e9— Round 4: multi-dtype K/V flash-attention dispatchThe round-1
--f16-attnboolean only let operators pick between F32 and F16 K/V flash-attention. Round 4 generalises the dispatch into a four-valued enum + CLI flag so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth as F16, no F16 underflow on small attention scores) or Q8_0 K/V (Vulkan + half the K/V upload bandwidth) on adapters that advertise the corresponding capability. Live wiring that turns the round-2 / round-3 probe results into actual GPU work.tts_cpp::supertonic::detail::kv_attn_dtype { autoselect=-1, f32=0, f16=1, bf16=2, q8_0=3 }+ pure-logic resolverresolve_kv_attn_type(requested, legacy_use_f16_attn, supports_f16, supports_bf16, supports_q8_0). Same testable-policy split as round-3'sresolve_vulkan_device_index.EngineOptions::kv_attn_typeint field (-1= auto,0..3explicit) — same-1= auto convention asf16_attn/f16_weights/vulkan_device, so operator configs are consistent. Default falls back tof16_attn's value, so every existing operator config sees zero behaviour change.--kv-attn-type bf16once in their production config works on both NVIDIA Ampere+ (BF16 effective via Vulkan coopmat2) and Intel ARC (no coopmat2 → silent F32 fallback) without crashing. Out-of-range--kv-attn-type Nthrows loudly to surface CLI typos.build_text_attention_cache):if (cache.f16_kv_attn) { cast→F16 }replaced with a switch on the enum; cast target picked from{F16, BF16, Q8_0}percache.kv_attn_type. Cache invalidation key promoted fromboolto enum (rebuilds the graph when the enum flips, same correctness contract as the rest of the cache key tuple).--kv-attn-type {auto,f32,f16,bf16,q8_0}CLI on all three binaries. Bench surface adds(kv_attn_type=…)to the human-readable line and"kv_attn_type"+"kv_attn_type_requested"to the JSON output so log-grep / CI attribution works across machines.supertonic-cliarg-parse loop wrapped intry/catchso invalid values surface as a cleanerror: ...line + exit 2 (also fixes a pre-existing latent crash on--vulkan-device abc/--seed nonsense/ etc).test-supertonic-f16-attn-parityextended with 2 new BF16-vs-F32 parity checks (vector-estimator + style shapes; CPUmax_abs_err = 5.263e-3and3.596e-3, both within the same 5e-3 tolerance band as the existing F16 baseline). Written BEFORE any production change — the parity gate was in place before the cast logic was touched.test-supertonic-kv-attn-type(106 checks across the full {requested × legacy × probe-mask} matrix, out-of-range throws, exhaustive resolver-never-leaks-autoselectsweep) andtest-supertonic-kv-attn-type-api(18 checks — SFINAE compile-time gates, runtime defaults, RAII restoration, regression guards on every other documentedEngineOptionsdefault).3c59e523— Round 7: bench observability + voice cache + Vulkan env-var passthroughLowest impact-÷-risk round of those planned in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup.detail::voice_host_cache). Eliminates 2 sync points /synthesize()after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a fullEngine; reference-stability contract documented for the synthesis-pipeline call site.apply_vulkan_env_overrides(map)public helper +EngineOptions::vulkan_env_overridesfield +--vulkan-prefer-host-memory/--vulkan-disable-coopmat2/--vulkan-disable-bfloat16/--vulkan-perf-logger/--vulkan-async-transfer/--vulkan-env KEY=VALUECLI flags on all three binaries. ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched.set_env_if_unsetsemantics so an operator-set env var still WINS over the EngineOptions override.ggml_backend_synchronizeboundaries (--no-bench-syncopt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware.--bench-per-step). Times eachsupertonic_vector_step_ggmlcall individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape.Two new test executables (
test-supertonic-voice-host-cache,test-supertonic-vulkan-env-overrides). TDD caught the env-key validator's empty-string-as-success bug BEFORE wiring went in.5b166a79— Round 8: front-block attn0 GPU bridgeSingle largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (run_text_attention_cache_gpu) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function.Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on
front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0— trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs withoutvector_rope_thetacontinue to take the host-rotate path.The blit primitive parity gate already shipped with PR #16 (
test-supertonic-graph-to-graph-blit); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exactmax_abs = 0.0).0fa1593c— Round 9: style flash-attn GPU bridgeExtends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win).
vector_res_style_qkv_resultextended withsq_gpu / sk_gpu / sv_gpuGPU handles, populated unconditionally byrun_res_style_qkv_cache(cheap — no GPU sync; justggml_graph_get_tensorlookups).run_res_style_qkv_cachehost-download gating: the 3tensor_to_time_channel(...)downloads ofsq/sk/svare now gated ontrace != nullptr. Production path skips them entirely.poststays unconditional — consumed by the next-stagerun_style_residual_cachewhich still expects a host vector (cross-stage GPU bridge forpostis deferred).!include_ggml_trace && sq_gpu && sk_gpu && sv_gpu→ GPU bridge; otherwise legacy host bridge.Strict TDD: parity test (
test-supertonic-graph-to-graph-blit) extended with explicit style-shape coverage BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exactmax_abs = 0.0.38a67e45— Round 10: per-step text-input upload-skipAfter rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is
text_emb(uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used forstyle_v_in/kctx_in) into a reusableupload_skip_trackerhelper and applies it to the front-block + 3 group caches.CRITICAL CORRECTNESS HAZARD addressed:
text_embis a stack-localstd::vector<float>inEngine::Impl::synthesize()(and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may havetext_emb.data() == synth_N.text_emb.data()despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer.Mitigation: caller MUST invoke
tracker.reset()at every synth boundary (current_step == 0). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it.Per-synth wins: 16 fewer host→GPU uploads + ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length).
test-supertonic-upload-skip-tracker(NEW, 7 functions, 41 checks) committed first, observed to fail compile, then implementation added.b54b7d43— Round 11: packed-QK RoPE + GPU-bridge layout fix (CRITICAL CORRECTNESS)Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying
vector_rope_theta. The first end-to-end synth attempt (CPU OR Vulkan) aborted atGGML_ASSERT(HD == n_heads * head_dim)insideapply_rope_to_packed_qk, and even past that assertion everyggml_backend_tensor_copy(q_src, q_tc_in)on the GPU-bridge fast paths would have hitGGML_ASSERT(ggml_are_same_layout(src, dst))because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache'sq_tc_in/k_tc_in/v_tc_intensors expect.Root cause:
apply_rope_to_packed_qk(PR #16 audit follow-up #5) was written under the assumption thatdense_matmul_time_ggmlreturns ane=[HD, L]channel-fastest-in-memory tensor. In fact the matmul (CPUcblas_sgemmand GPUconv1d_f32(K=1)) producesne=[L, HD]with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong[HD, L]shape, so the failure mode was invisible to CI.The fix (strict TDD):
test_supertonic_rope_packed_qk.cpprewritten under the production matmul shapene=[L, HD](channel-major-flat memory). Reference built in scalarapply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pinsy->ne[0] = HD, y->ne[1] = Lso the downstreamq_tc_inblit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks).apply_rope_to_packed_qk(supertonic_internal.h): add a head-of-pipelineggml_cont(ggml_transpose(q))to flip fromne=[L, HD]channel-major-flat tone=[HD, L]time-major-flat (which IS the layoutq_tc_inexpects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalarapply_rope's native layout ANDq_tc_in's blit target bit-for-bit.ggml_cont(ggml_transpose(...))at the matmul output inbuild_group_graph_cache,ve_front_block_proj_cache, andbuild_res_style_qkv_cacheso all four GPU-bridge attention sites get bit-for-bit matching layouts.tensor_to_time_channel(<post-rope-or-v>)totensor_raw_f32(...). The new graph-side layout puts the bytes already in the time-major-flat shape scalarapply_rope/flash_attention_qkvhost references read, so the raw download is the correct call.Verification:
The round-1..10 wins are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path.
bb99d3ce— Round 12: auto-pick UMA bias + text-encoder GPU bridge + pinned-host-buffer per-step inputsThree independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware.
#10 — Auto-pick UMA bias
Round 3's
argmax(free_vram)picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40× realtime regression for any operator following the help text "auto-pick adapter with most free VRAM".Extended
resolve_vulkan_device_indexwith an optional third argis_uma_per_device. Empty UMA list → round-3 behaviour preserved verbatim. Non-empty + at least one discrete → argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicitrequested >= 0passthrough is UMA-agnostic.Caller wiring (in
init_supertonic_backend) collects UMA flags via the publicggml_backend_dev_get_props()API onggml_backend_vk_reg()— setsis_uma = trueforGGML_BACKEND_DEVICE_TYPE_IGPU/_CPU/_ACCEL.test_supertonic_vulkan_device_select.cppextended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-index-ignores-UMA-bias, mismatched-length-throws).#6 — Text-encoder speech-prompted-attention GPU bridge
Master's Metal-port branch (PR #15) built
speech_prompted_merged_cache(one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download → pack → re-upload between the QKV cache and the flash-attn cache.Round 12 #6 adds
run_speech_prompted_merged_cacheand the dispatch inspeech_prompted_attention_ggml. Eliminates per call: 2 GPU→host downloads + 3 host→GPU uploads + 1 graph dispatch + all host pack work = 5 sync points × 2 layers = 10 sync points / synth at the text encoder alone.CPU stays on the legacy two-cache path: master's
dense_matmul_time_ggmlCPU fast path uses cblas + the host-side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU).test_supertonic_text_encoder_gpu_bridge.cpp(NEW) pins the symbol via SFINAE + struct field contract + a free-default-cache trip-wire (catches a buggy free path that segfaults on never-builtthread_localcache slots at process exit). 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests.#5 — Pinned-host-buffer per-step input scratchpad
Round 3 shipped the capability probe; the actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 #5 lands the helper
try_alloc_inputs_in_pinned_host_buffer.Returns nullptr on null model.backend / null input_ctx / non-Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from
ggml_backend_vk_host_buffer_type()viaggml_backend_alloc_ctx_tensors_from_buft. Caller owns the returned buffer; frees at cache destruction.Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites:
vector_group_graph_cache(× 3 for g1/g2/g3) andve_front_block_graph_cache. Total: 9 per-step input tensors moved to host-pinned memory. Eachggml_backend_tensor_seton these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy).CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains.
test_supertonic_pinned_host_buffer.cpp(NEW) — 11 / 11 CPU-only checks pass.Combined perf snapshot on RTX 5090
Long-prompt bench (173 chars, ~15s of audio):
Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
--vulkan-device -1: picks RADV → 178 ms (7× realtime)--vulkan-device -1: picks RTX 5090 → 28 ms (537× realtime) — 6.4× faster for users following help textb9f95358— Round 13: code-quality consolidation + Q8_0 K/V findingStrict-improvement-only follow-up to round 12: no code path is removed, no optimisation is rolled back, end-to-end perf on every backend stays at the round-12 level. Two deliverables, both no-regret:
1. New helper
alloc_input_scratchpad_or_throwRound 12 #5 inlined the "try pinned-host first, fall back to default backend buffer, throw on both-fail" idiom at 4 cache sites (front block + 3 group caches). Round 13 factors it into one helper. Same correctness contract — CPU / Metal / OpenCL fall back to default backend buffer; Vulkan tries pinned-host first. Defensive failure modes consolidated: null model.backend, null input_ctx, null cache_name all throw
std::runtime_errorwith a message that includes the cache name, instead of segfaulting in an error-handler path. Single point of maintenance for the pattern; future cache builds that want pinned-host inputs use the helper directly.test_supertonic_input_scratchpad.cpp(NEW, 9 / 9 checks) pins the contract via SFINAE on the symbol + CPU-fallback round-trip throughggml_backend_tensor_set/get+ null-arg throws + empty-ctx error message includes the cache name. CPU-only — no GGUF fixture required.Perf impact: zero — same code path, same allocations, same data movement, just one fewer level of nesting at each call site.
2. Q8_0 K/V no-win documented for RTX 5090
Round 4 shipped the
--kv-attn-type q8_0CLI option and bench output advertisesq8_0_kv_attn=available. Round 13 measures the trade-off on the test rig (RTX 5090, 1.79 TB/s memory bandwidth, long prompt 206 chars / 18 s audio):--kv-attn-typef16(default)q8_0The F32→Q8_0 cast overhead exceeds the saved K/V upload bandwidth on a high-bandwidth discrete GPU. Operator guidance: stick with the F16 default on RTX 5090 and similar high-bandwidth discretes. Q8_0 is shipped for adapters where the K/V upload bottlenecks the synth (older PCIe 3.0, lower-end discretes, iGPUs with slow BAR); cross-over point to be measured per-adapter by operators using
--bench-per-stepfrom round 7.Backwards-compatibility contract
Every round preserves the existing operator-config baseline:
--f16-attn 0|1semantics unchanged — round 4's--kv-attn-type auto(the default) falls back to--f16-attnvia the resolver.--vulkan-device 0semantics unchanged — round 1 introduced the flag; round 3's-1is opt-in only; round 12's UMA-bias only activates on hybrid rigs and never overrides an explicit index.--f16-weights 0|1semantics unchanged — round 6's--f16-weights-denyis opt-in only and has no effect when--f16-weights 0.--prewarmdefaults to empty (no-op).--vulkan-env/--vulkan-prefer-host-memory/--vulkan-disable-coopmat2etc. (round 7) all default off; an operator-set env var still wins over the EngineOptions override.--bench-per-step/--no-bench-sync(round 7) default off; legacy JSON shape preserved on the default path.model.use_f16_attnboolean is still populated and is kept in sync with the round-4 enum (= (kv_attn_type == f16)) so any external code keying on the boolean stays consistent.!include_ggml_trace, so the trace harness still captures pre-attention Q/K/V host vectors.tracker.reset()at every synth boundary; without the reset, the tracker behaves identically to a no-op (each call uploads).Test plan
CPU-only — a fresh checkout's
ctest -L unitexercises every new contract without needing a Vulkan adapter.Expected: 25 / 25 tests, 0 failures, 0 regressions.
Vulkan build (same expectations):
test-supertonic-vulkan-dispatchsupertonic_op_dispatch_scope+ F16-K/V probe smoketest-supertonic-portable-ops(UPDATED)test-supertonic-capability-cachetest-supertonic-warm-up-apiEngine::warm_up+EngineOptions::prewarm_texttest-supertonic-vulkan-device-selectresolve_vulkan_device_indexbehaviour matrix (extended in r12 with UMA-bias coverage)test-supertonic-f16-weights(UPDATED)test-supertonic-f16-deny-list-apiEngineOptions::f16_weights_deny_listtest-supertonic-kv-attn-typeresolve_kv_attn_typebehaviour matrix (full {requested × legacy × probe-mask} sweep, 106 checks)test-supertonic-kv-attn-type-apitest-supertonic-f16-attn-parity(UPDATED)test-supertonic-voice-host-cachetest-supertonic-vulkan-env-overridestest-supertonic-graph-to-graph-blit(UPDATED)max_abs = 0.0test-supertonic-upload-skip-trackertest-supertonic-rope-packed-qk(REWRITTEN)[L, HD]matmul layout, bit-exact vs scalarapply_ropetest-supertonic-text-encoder-gpu-bridgerun_speech_prompted_merged_cacheSFINAE + struct contract + free-default trip-wiretest-supertonic-pinned-host-buffertry_alloc_inputs_in_pinned_host_buffernullptr safety + non-Vulkan fallbacktest-supertonic-input-scratchpadalloc_input_scratchpad_or_throwSFINAE + CPU-fallback round-trip + null-arg throwsSmoke testing the CLIs
End-to-end real-Vulkan validation
Verified on 4 backends after round 11 unblocked the production path:
Bench JSON includes
"kv_attn_type"(resolved) +"kv_attn_type_requested"(raw int) +"prewarm_ms"+ per-step timings (--bench-per-step) so a probe miss / cold-start cost / per-step regression is visible in the output and CI scripts can attribute drift / perf differences to the right cause.File-by-file change summary
tts-cpp/CMakeLists.txttts-cpp/PROGRESS_SUPERTONIC.mdtts-cpp/include/tts-cpp/supertonic/engine.hEngineOptionsfields:vulkan_device,prewarm_text,f16_weights_deny_list,kv_attn_type,vulkan_env_overrides,bench_per_step+Engine::warm_up()tts-cpp/src/chatterbox_cli.cpptts-clialiastts-cpp/src/chatterbox_tts.cpp#include <atomic>(pre-existing missing-include fix)tts-cpp/src/supertonic_bench.cpptts-cpp/src/supertonic_cli.cpptts-cpp/src/supertonic_engine.cppuse_f16_weightsauto-policy, multi-device auto-pick wiring (with UMA bias),warm_upimpl, round-4 K/V dispatch resolution, voice-cache integration, env-var passthroughtts-cpp/src/supertonic_gguf.cppresolve_vulkan_device_index(with UMA bias),resolve_kv_attn_type, multi-device auto-pick, dispatch-scope rounds 1–13 plumbing, deny-list integration, pinned-host-buffer helper,alloc_input_scratchpad_or_throwtts-cpp/src/supertonic_internal.hkv_attn_dtypeenum, model fields, probe forwarders, resolvers, dispatch-scope extension,voice_host_cache,upload_skip_tracker, GPU-bridge tensor handles, packed-QK RoPE layout fixtts-cpp/src/supertonic_text_encoder.cpprun_speech_prompted_merged_cache+ dispatch inspeech_prompted_attention_ggml(round-12 #6)tts-cpp/src/supertonic_vector_estimator.cpptts-cpp/test/test_supertonic_capability_cache.cpptts-cpp/test/test_supertonic_f16_attn_parity.cpptts-cpp/test/test_supertonic_f16_deny_list_api.cpptts-cpp/test/test_supertonic_f16_weights.cpptts-cpp/test/test_supertonic_graph_to_graph_blit.cpptts-cpp/test/test_supertonic_input_scratchpad.cpptts-cpp/test/test_supertonic_kv_attn_type.cpptts-cpp/test/test_supertonic_kv_attn_type_api.cpptts-cpp/test/test_supertonic_pinned_host_buffer.cpptts-cpp/test/test_supertonic_portable_ops.cppuse_native_leaky_relu = falseon the GPU fixturetts-cpp/test/test_supertonic_rope_packed_qk.cpp[L, HD]matmul layouttts-cpp/test/test_supertonic_text_encoder_gpu_bridge.cpptts-cpp/test/test_supertonic_upload_skip_tracker.cpptts-cpp/test/test_supertonic_voice_host_cache.cpptts-cpp/test/test_supertonic_vulkan_device_select.cpptts-cpp/test/test_supertonic_vulkan_dispatch.cpptts-cpp/test/test_supertonic_vulkan_env_overrides.cpptts-cpp/test/test_supertonic_warm_up_api.cppDeferred follow-ups (intentionally out of scope)
Tracked in
tts-cpp/PROGRESS_SUPERTONIC.md"Deferred work" section:VkPipelineCache: recovers ~91 % of cold→warm shader-compilation gap on first warm run, keyed by<vendorID>-<deviceID>-<driverVersion>and rooted at$XDG_CACHE_HOME/ggml/vulkan. This is aggml-vulkaninternal patch (~199 lines) that benefits all Vulkan workloads, not just Supertonic; tracked separately so the supertonic-specific PR stays reviewable. Round-2's--prewarmis an in-process workaround; the persistent on-disk cache extends the win across process restarts.post(round 9 follow-up): thepostoutput ofrun_res_style_qkv_cacheis still downloaded to host and re-uploaded intorun_style_residual_cache. Would eliminate ~20 more sync points / synth. Deferred until measured impact justifies the dual-graph refactor.--bench-per-stepfrom round 7.Linked