tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage) by ogad-tether · Pull Request #15 · tetherto/qvac-ext-lib-whisper.cpp

ogad-tether · 2026-05-11T11:52:21Z

Summary

End-to-end Metal backend for the Supertonic TTS pipeline on Apple silicon. Starts as a correctness port (Phase B), evolves through Tier 1 graph consolidation, Tier 2 custom Metal kernels + load-time weight pretranspose, Phase A+B follow-up (multi-precision, on-GPU CFM loop), full Phase B2 (all ConvNeXt blocks on [C, T] activations), and finishes with B1 end-to-end f16 + a causal-pad mode in depthwise_1d_ct that lets the vocoder's 10-block chain run with a single entry/exit permute pair.

Cumulative result on Apple M2: Metal total 249.92 ms → 88.44 ms (-65%) with parity vs CPU q8_0 reference maintained throughout (corr ≥ 0.998 / L∞ ≤ 0.05).

Apple M2, q8_0 GGUF, 4 threads, 5-step CFM, 3.20 s of audio, 5 runs + 1 warmup, all four backends benched in sequence on the same machine state:

Stage (ms median)	ggml Metal	ggml CPU	ONNX CPU	ONNX CoreML
preprocess	0.01	0.01	0.05	0.05
duration	3.48	1.49	1.26	8.17
text_encoder	13.22	11.70	8.22	16.26
vector_estimator (5 step)	58.38	90.36	77.04	177.89
vocoder	13.62	39.38	49.55	50.29
total	88.44	142.92	136.32	255.90
RTF (lower is faster)	0.028	0.045	0.043	0.080
real-time multiplier	36.2×	22.4×	23.5×	12.5×

ggml Metal is fastest overall, on every stage that matters. vector_estimator is −24% vs ONNX CPU and −67% vs ONNX CoreML; vocoder is now −73% vs ONNX CoreML and −66% vs ggml CPU. JSONs in artifacts/bench/supertonic-{cpp-metal-b2full-causal,cpp-b2full,onnx-cpu-b2full,onnx-coreml-b2full}.json.

Same ggml model file runs on all three ggml precisions with near-identical Metal perf (q8_0 GGUF is 4× smaller; M2 shapes are compute-bound so the bandwidth-saving precisions are a footprint win, not a perf win):

Precision	Metal total	Metal vec_est	Metal vocoder	Metal RTM
f32	88.44	58.38	13.62	36.2×
f16	92.07	58.46	17.25	34.8×
q8_0	91.93	58.72	18.11	34.9×

(f16/q8_0 numbers from the immediately prior commits — re-bench against the causal kernel would land them within ~1 ms of f32.)

What landed

Phase B (correctness port)

Backend resolution chain via model_prefers_cpu_kernels helper — gates the ggml_custom_4d CPU fast paths so Metal can take the stock-op graph fallback.
supertonic-bench gains --n-gpu-layers N so the same harness drives CPU and Metal runs.

Tier 1 (graph-shape reductions)

Per-step graph consolidation (49511b3a): one ggml_cgraph per CFM step instead of ~17 sub-graphs. Per-step node count 1886 → 1056.
repeat_like returns broadcast-compatible reshape (266e4466): drops 226 REPEAT ops/step.
Drop redundant ggml_cont in rope (be12a9f5): 8 fewer cpy dispatches per per-step graph.

Tier 2 (custom Metal kernels + load-time pretranspose)

Each new GGML_OP_SUPERTONIC_* op has a CPU forward (parity backstop) and a Metal kernel, gated individually by env vars.

kernel_supertonic_depthwise_1d (aa4f65c3) — fuses edge-clamp pad + im2col + mul_mat + add for K ∈ {3, 5}.
kernel_supertonic_layer_norm_channel (55adf87b) — fuses permute + cont + ggml_norm + mul + add + permute + cont.
kernel_supertonic_pw2_residual (7a5c0393) — fuses add(bias) + mul(gamma) + add(residual).
kernel_supertonic_bias_gelu (df20115d + 64efe99a) — fuses add(bias) + gelu_erf.
kernel_supertonic_edge_pad_1d (a647ecfa) — fifth fused kernel.
Load-time matmul weight pretranspose (e935ffb7, da9553e3) — materialize transposed copies of every :onnx::MatMul_* source weight on non-CPU backends.

Phase A+B follow-up (multi-precision + single-graph CFM)

Phase 0 — multi-precision validation harness (bfb44092): --precision {f32,f16,q8_0} on CLI / bench, plumbed through EngineOptions.
A1+A2 — single command buffer per synth + on-GPU latent through CFM loop (8f0be955): all 5 CFM steps unroll into ONE ggml_cgraph. Latent flows step-to-step as a graph-internal node.
A3 step 1 — q8_0 storage on Metal (1b7496f6): --precision q8_0 loads instead of bailing.
A3 step 2 — kernel_mul_mm_q8_0_f32 dispatches (f95a09d9): the quantized matmul kernel finally fires end-to-end.

Phase B2 partial (Q/K/V projections → [A, T])

70bd2ca6 + follow-ups: swap ggml_mul_mat argument order at Q/K/V sites so the weight is src0. Output lands directly in [A, T] — removes one cont(transpose) per projection × 4 groups × 5 steps.

Phase B2 full (ConvNeXt on [C, T])

All five fused kernels parameterised on per-axis element strides (52430516, e2807f41) — same compiled Metal kernel handles both [T, C] and [C, T]. Layout flag in op_params. Overlay port-version 12 → 13.
Prologue + group_prep × 3 + tail ConvNeXt chains on [C, T] (da3400d3, aa167cfa) — vector_convnext_ggml_ct + pointwise_matmul_ct (K=1 Conv1d becomes direct ggml_mul_mat, no im2col). All 16 ConvNeXt blocks in vector_estimator's per-step graph wrap a single entry/exit permute around each chain.
Vocoder ConvNeXt chain on [C, T] (61e9b419) — same pattern for the 10-block vocoder chain, with two intra-block permutes around the (then-still-symmetric-only) depthwise. Vocoder dropped 52 → 17 ms.

Phase B1 — end-to-end f16 (`66ddafab`)

Asymmetric load (same pattern as q8_0): only :onnx::MatMul_* weights stay f16 on Metal (dispatch kernel_mul_mm_f16_f32); other GGUF-f16 tensors expand to f32 so they don't trip ggml_metal_op_bin's f32-only assertion downstream. Pretranspose pass extended to cover f16 alongside f32/q8_0.

Causal-pad mode in `depthwise_1d_ct` (`312ea1ce`)

Extends the fused depthwise kernel with a causal flag (last tap at t, earlier taps strictly left; right-clamp collapses to a no-op) and K=7 support. New _causal_ct ctor. Vocoder block now runs depthwise + layer_norm + pw1 + bias_gelu + pw2 + scalar-gamma + residual end-to-end on [C, T] — no intra-block permutes. Single entry permute + single exit permute span the 10-block chain. Overlay port-version 13 → 15. Vocoder dropped 17.11 → 13.62 ms (−20%).

Quirk found along the way

The legacy pw2_residual_ggml wrapper had a gamma->ne[0] == x->ne[1] gate that was silently rejecting the fused path for ConvNeXt the whole time — GGUF ships .gamma as [1, C, 1, 1] not [C]. vector_convnext_ggml_ct flattens the per-channel params with a ggml_reshape_1d, so the _ct path is the first time the fused pw2_residual op actually runs on the ConvNeXt residual.

CPU q8_0 perf is unchanged — every fused-kernel, pretranspose, asymmetric-load, and _ct path is gated on !use_cpu_fastpath or roundtrips through the legacy [T, C] block on CPU so cblas/AMX still wins there.

What's deferred

Phase	Status	Why deferred	Realistic ROI
B3 — argument buffer reuse	deferred to upstream	ggml-metal backend internals (`MTLIndirectCommandBuffer`). Better as an upstream contribution.	-1 to -3 ms
text_encoder / duration on Metal	left as-is	13.2 / 3.5 ms already small; further work is dominated by command-buffer encode overhead. Probably out of practical reach.	0–2 ms

Test plan

Phase B smoke: --n-gpu-layers 1 writes a valid WAV; sample count identical to CPU.
CPU regression: --n-gpu-layers 0 bench unchanged from pre-port baseline.
Metal bench (above): 88.44 ms median (5 runs + 1 warmup) on M2.
Multi-backend bench (above): ggml-Metal, ggml-CPU, ONNX-CPU, ONNX-CoreML all benched on same machine state.
Multi-precision: f32, f16, q8_0 all load and synthesize end-to-end on Metal.
Parity vs CPU reference: corr ≥ 0.998 / L∞ ≤ 0.05 throughout the branch.
Env-var A/B: every fused kernel + pretranspose + loop-graph + CT-convnext + CT-vocoder + causal path has an override.
Multilingual smoke: M1/F1 + EN/FR/PT samples generated.
Reviewer to run on M1 / M3 / M4 to confirm the wins generalize.

Notable mechanical details

ggml-supertonic-ops.patch lives in tts-cpp/cmake/vcpkg-overlay-ports/ggml/, chains on top of the QVAC ggml port — no upstream ggml changes required. Overlay port-version now at 15.
Every fused kernel got a stride-parameterised body (sxt, sxc, syt, syc) so the same compiled Metal kernel handles both [T, C] and [C, T] activations — no separate _ct kernel binaries, just a layout flag in op_params.
The depthwise kernel's causal flag uses k_offset = -(K-1) instead of -K/2 and skips the right-clamp branch. Symmetric (K∈{3,5}) and causal (K∈{3,5,7}) share the same Metal source.
pointwise_matmul_ct (vector_estimator) and pointwise_matmul_ct_voc (vocoder) are the K=1 Conv1d shortcut for [C, T] activations: weight [1, IC, OC] reshapes to [IC, OC], mul_mat directly with [IC, T], output [OC, T]. No im2col, no transpose.
The harmless GGML_ASSERT([rsets->data count] == 0) at process exit is the same atexit-ordering quirk chatterbox handles via t3_stack_registry — fires after the WAV is written.

Out of scope

CUDA / Vulkan / OpenCL paths — Metal is the priority for Apple-silicon hosts.

🤖 Generated with Claude Code

Documents the Tier 2 op-level reduction work landed in PR #15 so far: - SUPERTONIC_DUMP_OP_HISTOGRAM env-var-gated dump of per-graph op-type histograms. Confirmed RESHAPE/VIEW/PERMUTE/TRANSPOSE are no-ops on Metal (ggml-metal-ops.cpp:186-195) — 808 of 1660 nodes in the consolidated per-step graph are metadata-only, so only ~852 ops actually dispatch. - repeat_like across all four supertonic source files (vector, vocoder, text_encoder, duration) returns the broadcast- compatible reshape directly instead of inserting an explicit ggml_repeat node, since ggml_add / ggml_mul broadcast natively. -226 REPEAT ops per step graph. - apply_supertonic_rope_ggml drops the defensive ggml_cont — the view onto a contiguous [H*D, q_len] tensor is itself contiguous, so ggml_rope_ext accepts it directly. Cumulative Tier 1 + Tier 2 wins land Apple M2 Metal at 199.90 ms total (RTF 0.062, 16× real-time) — a -20 % drop from the Phase B baseline of 249.92 ms. Parity preserved (correlation 0.9999 vs CPU reference, max abs diff 249 LSB on peak amplitude 6639). Remaining backlog covers the next op-level wins (K=1 conv1d_f32 → direct mul_mat, cache transposed weights, custom pad kernel) and the original Tier 2 custom Metal kernels (kernel_convnext_block, kernel_attention_block, kernel_depthwise_conv_1d, f16 activations) that need the QVAC ggml-speech port patch flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Squash-rebase of feat/metal-optimization-supertonic onto master post-#16 (OpenCL Supertonic merge). Combines: - Five custom fused Metal kernels (supertonic_depthwise_1d / layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with `_ct` and `_causal_ct` variants for [C, T] activation layout. Patches live upstream in qvac-ext-ggml@speech (PR #8, merged); our overlay-port redirects vcpkg to that branch. - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks) and vocoder (10 blocks) runs end-to-end on [C, T] activations. K=1 pointwise becomes direct ggml_mul_mat (no im2col). Single entry/exit permute spans each chain. - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*` stays f16 on Metal, expands to f32 elsewhere). - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent stays in GPU memory step-to-step. - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches. - Tier 2 load-time matmul weight pretranspose. - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder. Coexists with master's OpenCL Supertonic work: - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d fast paths via thread-local; replaces our `use_cpu_fastpath` parameter plumbing. - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved. - F7 vocoder convnext-block fusion (master) runs on the CPU path; Metal path runs our `_ct` chain. Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase: Metal med 98.4 ms vec_est 65.6 vocoder 13.1 RTM 32.6x CPU (unchanged from master) ONNX CPU (unchanged from master) Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase), ~10 ms slip absorbed where master's front_cache refactor replaced parts of our trace_proj step-builder per the agent's resolution rule "prefer master's cache pattern when refactored." Causal kernel intact; vocoder at 13.1 ms vs master's CPU 39.4 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ay-port Replaces the local vcpkg overlay-port machinery with a simpler bundled- ggml dev flow that clones tetherto/qvac-ext-ggml@speech directly into `tts-cpp/ggml/` and lets CMake's `add_subdirectory(ggml)` consume it. What's in / what's out: + tts-cpp/scripts/setup-ggml.sh — clones qvac-ext-ggml@speech at the pinned commit (currently 60a172e48f, the merge of #8) into tts-cpp/ggml/. Idempotent; re-run to bump the pin via the script's GGML_REF variable. + tts-cpp/CMakeLists.txt — bundled path (`TTS_CPP_USE_SYSTEM_GGML=OFF`) no longer requires a `patches/` directory. Speech branch is pre-patched at the commit level, so `add_subdirectory(ggml)` consumes the source directly. - tts-cpp/cmake/vcpkg-overlay-ports/ggml/ (all 4 files) - tts-cpp/vcpkg-configuration.json - tts-cpp/vcpkg.json Net diff: −250 lines of bridge plumbing, +50 lines of clone-and-build script. The vcpkg overlay was always a stopgap until the registry pin advanced past 60a172e (see qvac-registry-vcpkg#144); switching to the bundled flow side-steps that wait entirely for dev builds. Performance bonus: bundled `add_subdirectory(ggml)` defaults to GGML_NATIVE=ON (native ARM dotprod / SVE / wider SIMD on M-series), where the vcpkg port had GGML_NATIVE=OFF for portable redistributables. On Apple M2, the dev flow benches ~9 ms faster total median and ~30 ms tighter variance — back within 3 ms of the pre-rebase 88 ms peak: vcpkg-overlay (rebased): total med 100.48 range 96-125 ms 31.9x bundled-ggml (this): total med 91.15 range 88-92 ms 35.2x ^ +3.3x Downstream production builds still go through vcpkg via `TTS_CPP_USE_SYSTEM_GGML=ON` and find_package(ggml) — those pull from the `ggml` port in qvac-registry-vcpkg (which qvac-registry-vcpkg#144 bumps to the same speech commit). README §1 updated with the new dev flow as the canonical recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ogad-tether · 2026-05-13T11:29:54Z

Cross-ref: registry bump needed for production builds

This PR's dev flow works standalone — bash tts-cpp/scripts/setup-ggml.sh clones qvac-ext-ggml@speech directly into tts-cpp/ggml/ and add_subdirectory(ggml) consumes it. No vcpkg, no registry dependency.

But the production flow (downstream apps using find_package(ggml) via the system vcpkg) reads its ggml from the registry's ggml port — currently pinned at port-version 7 / commit 05afdc59 (pre-supertonic-ops). After this PR lands on master, anyone consuming tts-cpp through vcpkg will hit undeclared identifier 'ggml_supertonic_*' until the registry catches up.

Companion PR

qvac-registry-vcpkg#144 bumps the registry's ggml port to port-version 8 / commit 60a172e (post-supertonic-ops). Once merged, production builds work end-to-end.

Suggested merge order

qvac-registry-vcpkg#144 → main (registry has the new ggml)
This PR (tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage) #15) → master (production consumers can now build)

The reverse order works too but opens a window where vcpkg-based tts-cpp consumers see compile errors until ggml-org#144 lands.

🤖 Generated with Claude Code

… + text-encoder GPU bridge + pinned-host-buffer per-step inputs Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware. == tetherto#10 — Auto-pick UMA bias == Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40x realtime regression for any operator following the help text "auto-pick adapter with most free VRAM". Extended `resolve_vulkan_device_index` with an optional third arg: int resolve_vulkan_device_index(int requested, const std::vector<size_t> & free_vram_per_device, const std::vector<bool> & is_uma_per_device = {}); Empty UMA list -> round-3 behaviour preserved verbatim. Non-empty + at least one discrete -> argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit `requested >= 0` passthrough is UMA-agnostic. Caller wiring (in `init_supertonic_backend`) collects UMA flags via the public `ggml_backend_dev_get_props()` API on `ggml_backend_vk_reg()` - sets `is_uma = true` for `GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`. `test_supertonic_vulkan_device_select.cpp` extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit- index-ignores-UMA-bias, mismatched-length-throws). == tetherto#6 — Text-encoder speech-prompted-attention GPU bridge == Master's Metal-port branch (PR tetherto#15) built `speech_prompted_merged_cache` (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download -> pack -> re-upload between the QKV cache and the flash-attn cache. Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the dispatch in `speech_prompted_attention_ggml`: if (!model_prefers_cpu_kernels(m)) { thread_local speech_prompted_merged_cache merged_caches[2]; // rebuild on key change, then: run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc); return; } // ... legacy two-cache CPU path unchanged Eliminates per call: - 2 GPU->host downloads (q_out, v_out) - 3 host->GPU uploads (q_pack, k_pack, v_pack) - 1 graph dispatch - All host pack work (q_pack / k_pack / v_pack head-split) = 5 sync points x 2 layers = 10 sync points / synth at the text encoder alone. CPU stays on the legacy two-cache path: master's `dense_matmul_time_ggml` CPU fast path uses cblas + the host- side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU). `test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins: - run_speech_prompted_merged_cache symbol via SFINAE - speech_prompted_merged_cache struct field contract (x_in, style_in, out, idx, L) via SFINAE - free-default-cache trip-wire (catches a buggy free path that segfaults on never-built `thread_local` cache slots at process exit) 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests (`test-supertonic-text-encoder-trace`, `test-supertonic-pipeline`). == tetherto#5 — Pinned-host-buffer per-step input scratchpad == Round 3 shipped the capability probe `supertonic_backend_supports_pinned_host_buffer`, which returns `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the resolved backend. The actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 tetherto#5 lands the helper: ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer( const supertonic_model & model, ggml_context * input_ctx); Returns nullptr on null model.backend / null input_ctx / non- Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from `ggml_backend_vk_host_buffer_type()` via `ggml_backend_alloc_ctx_tensors_from_buft`. Caller owns the returned buffer; frees at cache destruction via `ggml_backend_buffer_free`. Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in - ve_front_block_graph_cache: x_in + mask_in + t_emb_in Total: 9 per-step input tensors moved to host-pinned memory. Each `ggml_backend_tensor_set` on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy). Dual-context pattern: 1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots 2. Create x_in / temb_in / etc. in input_ctx 3. Try host-pinned alloc; fall back to default backend buffer via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)` 4. Build the rest of the graph in cache.ctx; gallocr handles intermediates + outputs, skipping the pre-allocated inputs via the `tensor->buffer != nullptr` check Free order: gallocr -> main ctx -> input_buf -> input_ctx (reversed order would dangle gallocr pointers into freed input tensor metadata) CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains. `test_supertonic_pinned_host_buffer.cpp` (NEW) pins: - Helper symbol existence (SFINAE) - nullptr return on CPU backend (idempotent across repeats) - Null-pointer safety on null model.backend / null input_ctx 11 / 11 CPU-only checks pass. == Combined perf snapshot on RTX 5090 == Long-prompt bench (173 chars, ~15s of audio): Round 11 baseline: 76.11 ms / 5 steps (123x realtime) Round 12 (all three): 27.99 ms / 5 steps (537x realtime) ^ 2.7x faster Vector estimator step: 12.7 ms -> 3.28 ms (3.9x faster) Prewarm cold-start: 330 ms -> 21 ms (15x faster) Short-prompt bench (Hello-world class, ~3s audio): Round 11 baseline: 44.08 ms (74x realtime) Round 12: 23.31 ms (394x realtime) Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU): Round 11 `--vulkan-device -1`: picks RADV -> 178 ms (7x realtime) Round 12 `--vulkan-device -1`: picks RTX 5090 -> 28 ms (537x realtime) ^ 6.4x faster for users following help text == Test plan == CPU build: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF cmake --build tts-cpp/build -j ctest --test-dir tts-cpp/build -L unit -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text- encoder-gpu-bridge, +1 pinned-host-buffer) Vulkan build: cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON cmake --build tts-cpp/build-vulkan -j ctest --test-dir tts-cpp/build-vulkan -L unit -> 24 / 24 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter writes a valid WAV. Co-authored-by: Cursor <cursoragent@cursor.com>

…in init_gpu_backend On Adreno + PR #14/#15 the policy correctly picks OpenCL and Chatterbox runs to completion. On Vulkan-on-Mali (Google Pixel 9 Pro XL / Tensor G4) ggml_backend_dev_init throws an unhandled C++ exception during pipeline init, which bubbles up to libc++abi::terminate() and SIGABRT crashes the host process before the caller can react. Wrap the call in try-catch inside try_init: on any exception, log verbosely and 'continue' to the next candidate; if every candidate in a bucket throws or returns null, the lambda returns nullptr and the policy proceeds to the next bucket. After all buckets fail init_gpu_backend returns nullptr and the caller falls back to CPU -- which is exactly what 'no usable GPU available' should mean. Defensive layer that handles any future bad-GPU vendor (not Mali specific): SIGABRT during GPU init is never an acceptable failure mode for a TTS engine that has a working CPU path. Validated against Pixel 9 Pro XL on AWS Device Farm via the QVAC-19254 [DO NOT MERGE] test PR (tetherto/qvac#2320). QVAC-19254

ogad-tether requested review from a team as code owners May 11, 2026 11:52

ishanvohra2 approved these changes May 11, 2026

View reviewed changes

github-advanced-security AI found potential problems May 11, 2026

View reviewed changes

Comment thread tts-cpp/src/supertonic_vector_estimator.cpp Fixed

Comment thread tts-cpp/src/supertonic_vector_estimator.cpp Fixed

ogad-tether changed the title ~~tts-cpp: Supertonic ggml Metal backend correctness port~~ tts-cpp: Supertonic ggml Metal backend — correctness + Tier 2 optimization (174 ms, 18.4× real-time) May 11, 2026

ogad-tether changed the title ~~tts-cpp: Supertonic ggml Metal backend — correctness + Tier 2 optimization (174 ms, 18.4× real-time)~~ tts-cpp: Supertonic ggml Metal backend — correctness + Tier 2 + Phase A+B follow-up (166 ms, 19.3× real-time) May 11, 2026

ogad-tether changed the title ~~tts-cpp: Supertonic ggml Metal backend — correctness + Tier 2 + Phase A+B follow-up (166 ms, 19.3× real-time)~~ tts-cpp: Supertonic ggml Metal — correctness + Tier 2 + Phase A+B (160 ms, 19.9× real-time, vec_est beats CPU) May 11, 2026

ogad-tether changed the title ~~tts-cpp: Supertonic ggml Metal — correctness + Tier 2 + Phase A+B (160 ms, 19.9× real-time, vec_est beats CPU)~~ tts-cpp: Supertonic ggml Metal — full B2 + custom kernels (128 ms, 25× real-time, beats ONNX-CPU & CoreML) May 11, 2026

ogad-tether changed the title ~~tts-cpp: Supertonic ggml Metal — full B2 + custom kernels (128 ms, 25× real-time, beats ONNX-CPU & CoreML)~~ tts-cpp: Supertonic ggml Metal — full B2 vector+vocoder (91 ms, 35× real-time, fastest on every stage) May 11, 2026

ogad-tether changed the title ~~tts-cpp: Supertonic ggml Metal — full B2 vector+vocoder (91 ms, 35× real-time, fastest on every stage)~~ tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage) May 12, 2026

This was referenced May 12, 2026

supertonic: fused Metal kernels + layout-flexible activations tetherto/qvac-ext-ggml#8

Merged

ggml: bump to qvac-ext-ggml#8 (Supertonic ops + Vulkan/Metal fixes) tetherto/qvac-registry-vcpkg#143

Closed

ogad-tether force-pushed the feat/metal-optimization-supertonic branch from e3f4f61 to 5403d10 Compare May 13, 2026 10:34

ogad-tether self-assigned this May 13, 2026

ogad-tether mentioned this pull request May 13, 2026

ggml: bump to qvac-ext-ggml#8 (speech HEAD 60a172e) — port-version 8 tetherto/qvac-registry-vcpkg#144

Closed

4 tasks

GustavoA1604 merged commit 6c60e4c into master May 13, 2026
66 of 74 checks passed

Zbig9000 mentioned this pull request May 14, 2026

Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic #18

Merged

ogad-tether mentioned this pull request May 15, 2026

tts-cpp: supertonic Engine streaming via multilingual chunker + callback #20

Merged

10 tasks

ogad-tether deleted the feat/metal-optimization-supertonic branch May 18, 2026 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage)#15

tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage)#15
GustavoA1604 merged 2 commits into
masterfrom
feat/metal-optimization-supertonic

ogad-tether commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ogad-tether commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ogad-tether commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What landed

Phase B (correctness port)

Tier 1 (graph-shape reductions)

Tier 2 (custom Metal kernels + load-time pretranspose)

Phase A+B follow-up (multi-precision + single-graph CFM)

Phase B2 partial (Q/K/V projections → [A, T])

Phase B2 full (ConvNeXt on [C, T])

Phase B1 — end-to-end f16 (66ddafab)

Causal-pad mode in depthwise_1d_ct (312ea1ce)

Quirk found along the way

What's deferred

Test plan

Notable mechanical details

Out of scope

Uh oh!

Uh oh!

Uh oh!

ogad-tether commented May 13, 2026

Cross-ref: registry bump needed for production builds

Companion PR

Suggested merge order

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ogad-tether commented May 11, 2026 •

edited

Loading

Phase B1 — end-to-end f16 (`66ddafab`)

Causal-pad mode in `depthwise_1d_ct` (`312ea1ce`)