Add optimized Supertonic GGML CPU path by ogad-tether · Pull Request #7 · GustavoA1604/chatterbox.cpp

ogad-tether · 2026-05-05T13:13:01Z

Summary

Add a Supertonic-specific ONNX/assets to GGUF converter, ONNX reference dumper, CPU runtime, CLI, benchmark tooling, and staged parity harnesses.
Route the production Supertonic 2 path through GGML-backed duration, text encoder, vector estimator, and vocoder stages.
Harden performance with cached GGML graphs, strided Q/K/V views, fused vector group/tail graph boundaries, portable custom CPU kernels, and controlled GGML/BLAS threading.
Support both upstream bundles:
- Supertone/supertonic for stable English, no language wrapping.
- Supertone/supertonic-2 for multilingual, using clean <lang>text</lang> open/close wrapping.

Audio quality finding

The English stutter was not caused by the GGUF/C++ port.
The bad path was the old supertonic-2 prefix-only wrapper: <en>text .
Stable English uses no wrapping with Supertone/supertonic and sounds clean.
supertonic-2 also sounds clean when using open/close tags: <en>text</en>.
Local listening validation passed for the generated English, French, and Portuguese sample sets.

Current GGML status

All four Supertonic stages are GGML-backed in the production path.
Text encoder FFN blocks and relative-position attention use cached GGML graphs; speech-prompted text attention uses ggml_flash_attn_ext.
Vector attention uses strided Q/K/V views and persistent graph/allocr caches for attention, ConvNeXt group, and tail islands.
Vector runtime includes fused ConvNeXt group/tail boundaries, gated production trace outputs, BLAS-backed pointwise Conv1D, custom depthwise Conv1D, direct row-wise layer norm, direct dense time matmul, and fused bias/GELU/residual elementwise ops.
Vocoder uses a persistent GGML graph cache plus BLAS/Accelerate-backed causal Conv1D custom ops for hot projection paths.
--threads controls GGML CPU threading; BLAS worker threads are capped by default to avoid nested oversubscription.

Validation

cmake --build build --target supertonic-bench test-supertonic-pipeline test-supertonic-vocoder-pointwise
./build/test-supertonic-pipeline models/supertonic2.gguf artifacts/supertonic-ref-quick
- PASS, waveform max abs 3.431e-05, RMS 2.086e-06.
./build/test-supertonic-vocoder-pointwise models/supertonic2.gguf artifacts/supertonic-ref-quick
- PASS, BLAS/custom pointwise rows match reference at float tolerance.
Final benchmark matrix: artifacts/supertonic-thread-matrix/, runs=3, warmup=1, F1 voice, 5 steps, speed 1.05, ONNX Runtime CPUExecutionProvider only.

Final benchmark findings

GGML now wins 10 of 12 matched thread/prompt comparisons. The only losses in the final run are quick English 4t and long English 4t, both close.

Prompt	GGML 1t	GGML 2t	GGML 3t	GGML 4t	ONNX 1t	ONNX 2t	ONNX 3t	ONNX 4t
quick English	298.0 ms	189.4 ms	157.7 ms	157.7 ms	373.8 ms	218.5 ms	168.3 ms	148.8 ms
longer English	757.5 ms	491.2 ms	390.3 ms	361.2 ms	1103.0 ms	580.6 ms	555.7 ms	351.5 ms
Portuguese smoke	457.2 ms	292.9 ms	251.0 ms	234.3 ms	610.6 ms	344.6 ms	268.3 ms	250.8 ms

4-thread stage medians show the remaining gap is now narrow and stage-specific:

Prompt	Runtime	Duration	Text	Vector	Vocoder	Total
quick English	GGML	3.9 ms	13.5 ms	96.3 ms	43.6 ms	157.7 ms
quick English	ONNX	1.5 ms	11.5 ms	85.9 ms	49.8 ms	148.8 ms
longer English	GGML	11.9 ms	33.3 ms	201.2 ms	115.1 ms	361.2 ms
longer English	ONNX	2.4 ms	13.1 ms	198.3 ms	138.8 ms	351.5 ms
Portuguese smoke	GGML	6.5 ms	20.8 ms	137.6 ms	68.9 ms	234.3 ms
Portuguese smoke	ONNX	1.7 ms	11.6 ms	141.7 ms	95.6 ms	250.8 ms

Remaining performance notes

Single-thread GGML now beats ONNX across all final prompts.
GGML vocoder wins the final 4-thread stage comparison on all three prompts.
Vector is close but still the main swing stage at higher thread counts, with some 3/4-thread variance remaining.
Text/duration explain most of the remaining 4-thread English losses; text is especially visible on longer prompts.

Made with Cursor

Made-with: Cursor

…rity The vector_estimator step-0 was failing parity (max_abs=2.31e-1) because apply_rope() recomputed the RoPE frequency table from the standard formula theta[d] = 10000^(-d/(D/2)). The actual ONNX-baked values are 10x larger (theta[d] = 10000^((8-d)/32)), so my analytic theta over-rotated keys/queries in every text cross-attention block. Read theta directly from the GGUF tensor main_blocks.3.attn.theta (it is shared across all four RoPE blocks). After the fix all five stages pass: preprocess: 49 tokens, exact duration: abs=0 text_encoder: max_abs=4.8e-6 vector step0: max_abs=1.4e-6 (was 2.3e-1) vocoder: max_abs=1.1e-5 Add test-supertonic-pipeline that chains text_encoder -> 5-step denoise -> vocoder against wav_full.npy. End-to-end parity is max_abs=6.5e-5 in float, ~7.4e-5 after 16-bit PCM round-trip. Add EngineOptions.noise_npy_path / supertonic-cli --noise-npy so users can reproduce the ONNX reference run bit-exactly without depending on NumPy's RNG sequence. Made-with: Cursor

Adds two matched benchmark harnesses that report per-stage wall time (preprocess / duration / text_encoder / N denoise steps / vocoder) plus end-to-end RTF, with min/median/mean/p95/max across N runs after a configurable warmup: build/supertonic-bench - times the C++ GGML CPU path scripts/bench-supertonic-onnx.py - times the ONNX Runtime path Both accept --noise-npy so the runs are deterministic and produce identical audio for direct comparison. Headline numbers on Apple M2 (8 cores), 4.11s of audio: ONNX (CPUExecutionProvider): 180 ms total, RTF 0.044, 22.8x realtime C++ GGML (single thread): 14451 ms total, RTF 3.52, 0.28x realtime Output is bit-identical (max_abs 6.5e-5 in float, 7.4e-5 after PCM round-trip). The 80x perf gap is entirely from the C++ port being single-threaded scalar today (no SIMD, no BLAS, no quant) - it is designed for correctness, not throughput. See artifacts/supertonic-bench.md for the full breakdown and proposed follow-ups. Made-with: Cursor

Support the English-only Supertone/supertonic bundle alongside the multilingual supertonic-2 bundle by storing model-family metadata, the default voice, and whether text should be wrapped in language tags. English now uses the stable no-wrap path, while the existing multilingual fixtures continue to use <lang> wrapping. This matches the latest QVAC behavior that avoids English stuttering on the quick-brown-fox prompt, while preserving parity for the multilingual supertonic-2 flow. Co-authored-by: Cursor <cursoragent@cursor.com>

Latest QVAC Supertonic behavior shows that supertonic-2 English stutters with the old prefix-only wrapper (<en>text ) but is clean with open/close tags (<en>text</en>). Add explicit language_wrap_mode metadata with none, prefix, and open_close modes so stable English supertonic keeps no wrapping while supertonic-2 defaults to the clean open/close path. Regenerated local supertonic-2 references with open_close wrapping and validated preprocessing, duration, text encoder, vector step, vocoder, and full pipeline parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Align the Supertonic runtime with the existing Chatterbox GGML/GGUF patterns: model metadata now carries defaults and ftype, backend initialization supports the same optional GGML backends, and the converter can emit f32/f16/q8_0 GGUFs for future graph backends. Port the vocoder to a real GGML graph path and validate it against the scalar reference. Add trace harnesses for vocoder and vector-estimator graph boundaries so remaining stages can be ported incrementally without losing parity. The vocoder graph now matches the ONNX reference at max_abs ~1.6e-6, and the vector trace is green through projection, mask, first ConvNeXt group, time add, and the following ConvNeXt block. Add a portable Supertonic CPU build workflow for Linux, macOS, and Windows using the existing CMake/GGML switches. Co-authored-by: Cursor <cursoragent@cursor.com>

Extend the GGML vector-estimator trace beyond projection, mask, ConvNeXt, and time-add into the first text-attention block. The trace now validates Q/K/V projections, CPU-applied RoPE tensors, flash-attention context, and the attention output projection against the scalar reference. This establishes a clean parity boundary before the next issue: the residual add after attention currently needs separate layout debugging even though both operands compare correctly on their own. Co-authored-by: Cursor <cursoragent@cursor.com>

Remove the unused residual/norm graph nodes from the vector trace harness. The current green boundary is attention output projection; residual add remains the next focused layout issue. Co-authored-by: Cursor <cursoragent@cursor.com>

Extend the vector-estimator GGML trace beyond text-attention output projection by isolating the residual add, post-attention norm, and following ConvNeXt block in a small continuation graph. This avoids buffer/lifetime ambiguity from the multi-pass attention trace and keeps each boundary parity-checkable. The new trace checkpoints pass through attention residual, norm, and main_blocks.4.convnext.0 with max_abs around 1e-6. Co-authored-by: Cursor <cursoragent@cursor.com>

Route the Supertonic engine through the GGML vector estimator and keep the scalar implementation as a parity baseline. The GGML vector step now covers the full estimator: all repeated groups, text/style attention, final ConvNeXt stack, proj_out, and Euler update. Keep the detailed vector trace harness for diagnostics while making the production vector path skip scalar/intermediate trace emission and retain only the final next-latent output. Also switch Supertonic latent seeding to a NumPy-compatible RandomState sequence so default --seed output matches the clean ONNX reference noise. Co-authored-by: Cursor <cursoragent@cursor.com>

Route the remaining Supertonic stages through GGML-backed execution while keeping parity trace harnesses available for debugging. Co-authored-by: Cursor <cursoragent@cursor.com>

Wire thread-aware graph execution and trim trace overhead so benchmarks exercise the production GGML path more accurately. Co-authored-by: Cursor <cursoragent@cursor.com>

Return duration and vector production outputs directly so trace tensors remain a debug-only transport. Co-authored-by: Cursor <cursoragent@cursor.com>

Use GGML flash attention for speech-prompted text attention and extend the text trace to cover final encoder output. Co-authored-by: Cursor <cursoragent@cursor.com>

Emit structured stage and RTF metrics so GGML and ONNX benchmark runs can be compared consistently. Co-authored-by: Cursor <cursoragent@cursor.com>

Support Supertonic 2 language wrapping modes and JSON metrics for matched GGML comparisons. Co-authored-by: Cursor <cursoragent@cursor.com>

Expand portable build coverage and document benchmark commands, thread policy, and remaining relative-attention work. Co-authored-by: Cursor <cursoragent@cursor.com>

Express learned relative key and value attention terms with stock GGML ops so the text encoder no longer falls back to scalar attention. Co-authored-by: Cursor <cursoragent@cursor.com>

Reuse text and style host layout conversions across denoising steps while preserving vector trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Reuse GGML graph allocations and static band masks for text relative-position attention by layer and sequence length. Co-authored-by: Cursor <cursoragent@cursor.com>

Add opt-in vector island profiling and avoid recomputing the front graph for the first attention pass. Co-authored-by: Cursor <cursoragent@cursor.com>

Avoid recomputing QKV graphs during text-attention flash passes and keep opt-in vector island profiling. Co-authored-by: Cursor <cursoragent@cursor.com>

Avoid rerunning style QKV projections during flash attention passes while preserving residual continuations for parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Use shared runtime buffers for packed QKV attention layouts across vector islands. Co-authored-by: Cursor <cursoragent@cursor.com>

Document the fully GGML-backed text path, vector profiling flag, and matched GGML versus ONNX benchmark baseline. Co-authored-by: Cursor <cursoragent@cursor.com>

Reuse split text-attention graph/allocation state across vector steps while preserving trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Let duration inference return its projection directly without allocating trace vectors in the hot path. Co-authored-by: Cursor <cursoragent@cursor.com>

Reuse generic attention-only graph state for text and style vector islands while preserving trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Reuse speech-prompted attention graph/allocation state across text encoder calls while preserving parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Harden the Supertonic production GGML path with cached graphs, portable custom CPU kernels, and benchmark documentation so the branch reflects the current ONNX-comparable performance work. Co-authored-by: Cursor <cursoragent@cursor.com>

Let the Supertonic converter download official Hugging Face bundles when local ONNX assets are not provided, and surface clear setup guidance when the local GGUF is missing. Co-authored-by: Cursor <cursoragent@cursor.com>

Capture the Supertonic port history, parity findings, optimization wins and failures, final benchmark matrix, and remaining production work in a dedicated progress journal. Co-authored-by: Cursor <cursoragent@cursor.com>

Autodetect Supertonic models from GGUF metadata and dispatch them through the Supertonic engine while preserving the existing Chatterbox routing. Co-authored-by: Cursor <cursoragent@cursor.com>

GustavoA1604

Also could you help address these?

Could make it work for windows with this. But could you pull the latest from main because maybe it will already fix it? 005779a
Stale tensor pointers in thread_local graph cacheschatterbox.cpp/src/supertonic_text_encoder.cpp:508–512, 588–597, 797–809, and the parallel structures in supertonic_vector_estimator.cpp (vector_text_attention_cache, vector_group_graph_cache, vector_res_style_qkv_cache, vector_tail_graph_cache) and supertonic_vocoder.cpp:339–356 all keep thread_local caches keyed by cache.model != &model. The graphs they hold contain require_source_tensor(model, …) pointers that live in the model's ctx_w.
synthesize in chatterbox.cpp/src/supertonic_engine.cpp:117 stack-allocates supertonic_model model; and calls free_supertonic_model(model) before returning. If a host calls synthesize() again from the same thread, the new stack-frame model is very likely to land at the same address, the cache key check passes, and the cached graph runs against ctx_w tensors that have been freed and replaced. Visible only when integrators call synthesize() more than once per thread (i.e. any server use); the bench tool itself loads once and is unaffected.

Suggested fix (any one of):

Key caches by model.ctx_w plus a monotonically increasing model.generation_id rather than &model.
Expose a supertonic_invalidate_thread_caches() and call it from free_supertonic_model.
Keep the caches as members of supertonic_model so they get destroyed with the model.

3 ggml_context leak per call
chatterbox.cpp/src/supertonic_text_encoder.cpp:901, :1021; supertonic_duration.cpp:521; many sites in supertonic_vector_estimator.cpp (g1_style_res_buf/g2_style_res_buf/g3_style_res_buf blocks, srgf block ~2189, etc.) all do:

ggml_init_params gp = { buf_size, buf.data(), true };
ggml_context * ctx = ggml_init(gp);
ggml_cgraph * gf = ggml_new_graph_custom(ctx, MAX_NODES, false);
…
ggml_gallocr_free(allocr);
return …;
with no ggml_free(ctx). With current ggml (chatterbox.cpp/ggml/src/ggml.c:1571) the context struct itself is GGML_MALLOC'd on every call; mem_buffer_owned=false so the buffer isn't freed, but the context struct is. Each step leaks a context struct. For per-CLI-invocation use it's negligible (few KB), but the bench harness loops total_runs × per-run stage compositions and accumulates many leaks. This is the easy fix: add ggml_free(ctx) at the end of every site that calls ggml_init with the local-buffer pattern.

chatterbox.cpp/CMakeLists.txt:202–298 rebuilds every supertonic_*.cpp once per test executable, even though tts-cpp already statically links them. Long compile times. Consider linking the tests against tts-cpp (or a small shared tts-cpp::supertonic_test object library) instead.
chatterbox.cpp/src/supertonic_engine.cpp:122–126 blocks non-f32 GGUFs even though the converter accepts --ftype f16/q8_0. The error message correctly says "use f16/q8_0 only with the GGML graph backend once enabled", but the production path is now the GGML graph backend

Please help add support for other quantization formats as well

chatterbox.cpp/scripts/dump-supertonic-reference.py:42–46 defaults --steps 5 while the converter at convert-supertonic2-to-gguf.py:308 writes default_steps = 10 for supertonic2. The bench/CLI examples in the README all pass --steps 5 explicitly, which is fine, but a user who only follows bash scripts/setup-supertonic2.sh and then supertonic-cli … --steps 0 (i.e. defaults from GGUF) will run the 10-step path and not match the 5-step reference dumps. Document this, or align the defaults.

Merge latest main, remove the extra workflow, harden Supertonic caches, clean up local GGML contexts, reduce Supertonic test rebuilds, and allow f16/q8_0 GGUF storage via load-time F32 expansion. Co-authored-by: Cursor <cursoragent@cursor.com>

…etry/scratch) Five targeted fixes surfaced by review of the multilingual_merged tip after the origin/main merge. Three are real bugs (CFG, top_k, engine crash on MTL GGUFs); one is a perf regression with audible behaviour on MTL (spurious T3 retries); one is a defensive cleanup. 1. src/chatterbox_tts.cpp (CFM step loop): the use_b2 branch correctly computes (1+cfg)*cond - cfg*uncond, but the else branch only computed the conditional pass and silently dropped CFG on every non-Metal backend. Restores the §3.19 (3f0a8da) behaviour: when !meanflow && cfg_rate != 0 and use_b2 is false (CPU and any GPU backend where the b2 path was disabled), run cond + uncond back-to-back on the same B=1 graph (cfm_estimator_cache key (T, b2=false) reuses the cached graph across both calls) and combine via the standard CFG mix. Smoke-tested on CPU (--n-gpu-layers 0): runs cleanly, S3Gen wall-clock doubles vs meanflow as expected (12 CFM steps × 2 forward calls). 2. src/t3_mtl.cpp::sample_next_token_mtl top-k filter: after nth_element(begin, begin+k, end, greater) the (k+1)-th largest sits at idx[k] and positions [0, k) hold the top-k UNORDERED. The previous code took cut = l[idx[k-1]] which is some arbitrary top-k element (often not the smallest), making cut too large and the `x < cut` filter then erased legitimate top-k logits. Fix: partition to begin+(k-1) so idx[k-1] is the k-th largest exactly. Mostly masked by the default top_k=1000 vs an 8194-vocab where the threshold falls into the noise floor; the bug bites at small top_k (e.g. greedy --top-k 1 where the wrong cut could pessimise tie handling). The Turbo sample_next_token_ex in src/main.cpp uses a different (correct) approach via tmp[k] + per-element rescan for ties; left untouched. 3. src/chatterbox_engine.cpp: load_model_gguf dispatches MTL GGUFs into load_model_gguf_mtl (populates layers_mtl, leaves layers empty), but synthesize() unconditionally calls eval_prompt -> build_prompt_graph -> build_transformer_core, which iterates model.layers[il] -- empty std::vector, UB or crash. Add a clean rejection guard right after the load: if model.hparams.variant != CHBX_VARIANT_TURBO, free_model() and throw a clear error pointing the user at the CLI / internal eval_*_mtl helpers. Wiring MTL through the public Engine API (extend EngineOptions with language / cfg_weight / min_p / exaggeration, branch synthesize() on variant) is left as a follow-up; this just stops the crash on the public surface. 4. src/chatterbox_cli.cpp::run_t3_for_segment retry trigger: the 0cad44d merge commit said the 5x speech-tokens-per-BPE-token floor (calibrated for English Turbo / GPT-2 BPE) should be gated to non-MTL because MTL's Llama tokenizer has a ~1.7x ratio. The gating wasn't actually in the code -- a clean stop-token termination on a short MTL segment looked "implausible" and triggered up to 3 spurious retries. `plausible = is_mtl || (int)generated.size() >= min_tokens;` restores the intent. The 3x-repeated-token early-stop above still guards MTL's catastrophic case. Measured on M4 Metal with the ES reference prompt + jfk/gianni voice: T3 wall time drops from ~3.9 s (4 attempts) to ~0.93 s (1 attempt) -- ~4x speedup just from removing the wasted retries. WAV md5 stays byte-exact at 57cc80f27a122f03435fd05f47d1b3d2. 5. src/t3_mtl.cpp stacked-QKV loader scratch sizing: the early type-equality guard implies wq/wk/wv have identical sizes today, but max over all three so a future shape divergence (e.g. an MTL variant with non-square Q/K/V) can't silently truncate a per-layer copy via undersized scratch. No behaviour change today; defensive only. Validation (Apple M4, Metal, Release): - cmake --build: clean, no warnings, all targets link. - test-metal-ops: 14/14 PASS, 0 FAIL. - End-to-end synthesis (ES prompt, gianni.wav, --seed 42, greedy): md5 57cc80f27a122f03435fd05f47d1b3d2 -- byte-exact vs the pre-fix baseline. T3 wall time ~3.9s -> ~0.9s (fix GustavoA1604#4). - CPU CFG smoke test (--n-gpu-layers 0, --text "Hola.", es): completes cleanly, S3Gen ~12s for 12 CFM steps × 2 forward calls (cond + uncond), produces valid 1.1s WAV. Issues GustavoA1604#5 (redundant peek+open in load_model_gguf), GustavoA1604#7 (/g deny-list breadth in requantize-gguf.py), GustavoA1604#8 (forward-hook idiom in dump-t3-mtl-reference.py), and #9 (CMake duplicate cli_main.cpp build) are tracked but intentionally not folded in here -- the reviewer flagged them as cosmetic / trivial / fine. Co-authored-by: Cursor <cursoragent@cursor.com>

ogad-tether · 2026-05-06T10:36:15Z

Addressed the review feedback in 06da42a:

Merged latest main; PR is now clean/mergeable.
Removed .github/workflows/supertonic-portable-build.yml.
Added supertonic_model::generation_id and included it in thread-local graph/layout cache keys to avoid stale ctx_w tensor reuse across repeated synthesize() calls on the same thread.
Added missing ggml_free(ctx) calls for local-buffer graph contexts in text, duration, vector residual islands, and vocoder trace paths.
Refactored Supertonic CMake harnesses to link against tts-cpp instead of recompiling supertonic_*.cpp per test target.
Removed the f32-only engine guard and expanded f16/q8_0 GGUF storage tensors to F32 runtime tensors at load time so those GGUFs can run on the existing graph/custom-kernel path.
Aligned converter default steps to 5, matching the reference dumps and README examples.

Validation run locally:

cmake -S . -B build
cmake --build build --target tts-cli supertonic-cli supertonic-bench test-supertonic-pipeline test-supertonic-vocoder-pointwise
./build/test-supertonic-pipeline models/supertonic2.gguf artifacts/supertonic-ref-quick
./build/test-supertonic-vocoder-pointwise models/supertonic2.gguf artifacts/supertonic-ref-quick

ogad-tether · 2026-05-06T10:51:34Z

Detailed follow-up on each requested item, all addressed in 06da42a:

Windows/latest main: pulled and merged latest main. The conflicting CLI parser area now keeps upstream's Windows/main changes and the Supertonic autodispatch options. PR merge state is now clean.
Stale thread-local graph caches: added supertonic_model::generation_id, assigned a monotonic generation on each successful load, and included that generation in every persistent Supertonic cache key (text_relpos_graph_cache, text_ffn_graph_cache, speech_attention_cache, vector attention/group/res-style/tail caches, vocoder graph cache, and the static layout caches). This prevents a new stack-allocated supertonic_model at the same address from reusing graphs that captured tensors from a freed ctx_w.
ggml_context leaks: added explicit ggml_free(ctx) cleanup for the local-buffer ggml_init sites in text encoder, duration, vector estimator residual/local graph blocks, and vocoder trace, including reserve/allocr failure paths.
Supertonic test compile duplication: refactored the Supertonic harness targets in CMakeLists.txt to link against the existing tts-cpp static library instead of recompiling supertonic_*.cpp into each test executable.
Non-f32 GGUF support: removed the f32-only guard from supertonic_engine.cpp. The loader now accepts f16 and q8_0 GGUF storage by expanding those tensors to F32 runtime tensors during load, so the current GGML/custom-kernel path works for those model files. I also generated and smoke-tested local f16/q8_0 models with tts-cli.
Default-step mismatch: aligned the converter default to 5 steps so locally generated GGUF metadata matches the reference dumps and README examples unless the user explicitly passes --default-steps.

Local validation:

cmake -S . -B build
cmake --build build --target tts-cli supertonic-cli supertonic-bench test-supertonic-pipeline test-supertonic-vocoder-pointwise
./build/test-supertonic-pipeline models/supertonic2.gguf artifacts/supertonic-ref-quick
./build/test-supertonic-vocoder-pointwise models/supertonic2.gguf artifacts/supertonic-ref-quick
Generated smoke WAVs for f32/f16/q8_0 through tts-cli under artifacts/supertonic-quant-smoke/.

… + scaffolding caches (multilingual Vulkan) Targets the per-synth host-CPU overhead that round 1 / round-HIFT didn't address, on top of upstream/multilingual_merged (now in main via PR GustavoA1604#7). Test-first: bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac monorepo locks the pre-change MD5 baseline, then re-verifies after every cache. All 3 invariants (multilingual single-shot, multilingual 6-segment multi-synth, Turbo single-shot) PASS bit-exact. Seven new caches ---------------- All host-side, model-agnostic, no GGUF-format change, no public-API change. Same teardown discipline as the existing g_cfm_estimator_cache (destroy() before ggml_backend_free). Sit alongside the existing round-1 caches. - g_encoder_graph_cache (keyed on T): full run_encoder graph + gallocator. Streaming chunks of varying length still produce correct output (rebuilds on key change). - g_hift_graph_cache (keyed on pack(T_mel, T_stft)) + g_hift_inv_alpha_entries: full run_hift_decode graph + gallocator. Parallel (graph-input-name, source-tensor-ptr) metadata lets cache hits re-feed each alpha-input slot from g_inv_alpha_results without rebuilding the graph. - g_f0_graph_cache (keyed on T_mel): full run_f0_predictor graph + gallocator. - cached_pos_emb (g_pos_emb_results, keyed on pack(T, D)): compute_pos_emb is pure CPU compute (~T * D * 5 trig ops); fired twice per encoder run (T and 2T). Multilingual T~350+ at D=512 is a real wedge of per-synth host time. - cached_inv_alpha (g_inv_alpha_results, keyed on ggml_tensor*): HiFT calls invert_alpha_cpu ~72x per synth (12 ResBlocks × 6 alpha tensors); each is a tensor_get + per-element reciprocal. Alpha tensors are constant for the model lifetime. - cached_hann_window / cached_istft_kernel (g_hann_window_cache / g_istft_kernel_cache, keyed on n_fft): pure functions of n_fft (constant 16 in the chatterbox HiFT path). - cached_window_sum (g_window_sum_cache, keyed on pack(n_fft, hop, T_stft)): T_stft × n_fft adds; stable across same-shape synth calls. A new graph_cache struct (used by encoder / HiFT / F0) and a pack_hift_key helper centralise the explicit destroy()-on-teardown pattern so future per-stage caches can plug in with one struct + one mutex acquisition. The destroy path is unified into a renamed s3gen_release_synth_caches() (replaces the old g_cfm_estimator_cache_destroy()), called from s3gen_model_cache_release, the cache-miss backend-swap path, and the explicit s3gen_unload(). Negative result documented (bug caught and fixed during dev) ------------------------------------------------------------ First implementation of the HiFT cache hung indefinitely on the very first synth call. Root cause: the alpha-input refresh loop held g_synth_caches_mu while calling cached_inv_alpha, which itself takes the same mutex internally — classic re-entrant deadlock. Fix: snapshot g_hift_inv_alpha_entries under the mutex into a local vector, then iterate without the lock (cached_inv_alpha re-acquires the mutex per call but with no nesting). General rule kept as an inline comment: never hold a cache-state mutex while calling any other cached_* helper. Performance — RTX 5090, multilingual auto-split, warm-state seg 2..6 ------------------------------------------------------------------- Within-process win on top of round 1 + round-HIFT: metric | pre-round-2 | post-round-2 | Δ S3GEN_INFER | 159.8 ms | 140.8 ms | -19.0 ms (-11.9 %) cfm_total | 122.2 ms | 118.7 ms | -3.5 ms (-2.9 %) cfm_step0 | 13.24 ms| 13.18 ms | noise (already cached round 1) hift_total | 17.96 ms| 16.30 ms | -1.7 ms (-9.4 %) Combined cumulative win vs upstream/multilingual_merged baseline (round 1 + round-HIFT + round 2): metric | upstream/mtl_merged | this PR (full) | Δ S3GEN_INFER | 169.9 ms | 140.8 ms | -29.1 ms (-17.1 %) cfm_total | 132.5 ms | 118.7 ms | -13.8 ms (-10.4 %) cfm_step0 | 24.1 ms | 13.2 ms | -10.9 ms (-45.2 %) The biggest remaining single piece of S3GEN_INFER (~120 ms cfm) is the actual GPU CFM compute — not host-cacheable; would need shader-side optimisation (e.g. tensor-core engagement via cooperative_matrix2; deferred — see "Next" in PROGRESS.md §3.32). Bit-exactness ------------- Locked invariants pass byte-for-byte vs the pre-change baseline: Multilingual single-shot c65d98f15a59b8fe9cad98e46eb3fb30 ✓ Multilingual 6-segment multi 0b374c7474895a3387b9f1df10b3c1b8 ✓ Turbo single-shot 6219f4338b1b4fb9dc60481216153b49 ✓ Verified across 4 successive iterations on RTX 5090 + NVIDIA 590.48 + Vulkan 1.3.275; bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac monorepo is the test-first harness. Files ----- src/chatterbox_tts.cpp +373 / -79 (net diff vs round-1 head) PROGRESS.md §3.32 round-2 subsection (~+200 lines) The +373 lines in chatterbox_tts.cpp are entirely the new cache infrastructure: graph_cache struct, seven new globals, the s3gen_release_synth_caches lifecycle hook, the five cached_* scaffolding helpers, and the build_graph / cache-hit branches in run_encoder / run_hift_decode / run_f0_predictor. Co-authored-by: Cursor <cursoragent@cursor.com>

Mirrors the parakeet-cpp port README layout so a downstream consumer can answer 'what does this library do, how do I link it, and which CMake knobs do I need to know about?' from the top of the README without scrolling through the 1300-line standalone development walk- through. No content removed; existing standalone material stays verbatim, just shifted down by ~80 lines. Adds three new blocks near the top: - ## API overview (between the benchmark tables and 'Pipeline at a glance'). Two-row table for the high-level entry points exported through TTS_CPP_API: * tts_cpp::chatterbox::Engine::synthesize - Chatterbox T3+S3Gen+HiFT * tts_cpp::supertonic::synthesize - Supertonic CPU TTS Trailing paragraph mentions the lower-level helpers (s3gen_synthesize_to_wav / s3gen_preload / s3gen_unload / tts_cpp_cli_main), points at <tts-cpp/export.h>, and explicitly flags that detail-namespaced symbols (used by the supertonic / chatterbox test harnesses) are not part of the public API and are hidden in SHARED builds. - ### Consumer integration (subsection of API overview). Calls out that the qvac speech-stack qvac-ext-lib-whisper.cpp wrapper port consumes ggml from the qvac-ext-ggml/speech branch directly (Metal / OpenCL / Vulkan patches included) and does NOT ship scripts/setup-ggml.sh or patches/ - those are standalone-dev tools maintained in this repo only. Provides the find_package(tts-cpp CONFIG REQUIRED) + target_link_libraries(... tts-cpp::tts-cpp) + 8-line Engine::synthesize C++ snippet that's the entire consumer-side integration. - ### Useful CMake options (inside section 1, between the GPU backend paragraph and the binaries table). Full table of the project- namespaced flags: TTS_CPP_BUILD_LIBRARY, TTS_CPP_BUILD_SHARED (new from items 7+8), TTS_CPP_BUILD_EXECUTABLES, TTS_CPP_BUILD_TESTS, TTS_CPP_INSTALL, TTS_CPP_USE_SYSTEM_GGML, TTS_CPP_GGML_LIB_PREFIX, TTS_CPP_CCACHE (new from items 7+8). Plus a secondary table for the ctest-fixture cache paths (TTS_CPP_TEST_{MODEL,AUDIO,REF}_DIR) and a one-liner explaining the REQUIRES auto-disable behaviour from item 7. Touches existing prose in two places: - The setup-ggml.sh paragraph in section 1 gets a one-paragraph follow-up clarifying it (and patches/) are standalone-development tools only, with a back-link to the Consumer integration section (item 9: 'document setup-ggml.sh inertness' folded into this framing rather than landed as a separate doc-only commit). Also strengthens the existing 'Re-running is safe' line to 'idempotent and destructive' so a dev hacking on ./ggml is warned before losing local edits. - The ### Alternative: consume ggml from vcpkg subsection now opens with one sentence positioning it as the CMake-mechanic detail behind the Consumer integration story, with a forward link to the qvac-ext-ggml/speech branch. Also updates the binaries table in section 1 to list the missing PR #6 + PR #7 binaries that landed since the README was last refreshed: supertonic-cli, supertonic-bench, test-cpu-caches, test-t3-caches, and the test-supertonic-* family. Trailing paragraph notes that test-* binaries register with CTest so \`ctest -C Release -L unit\` / \`ctest -C Release -L fixture\` works out of the build directory. No code changes, no CMake changes, no install behaviour changes. README.md +128 / -10 lines. Co-authored-by: Cursor <cursoragent@cursor.com>

ogad-tether and others added 30 commits May 1, 2026 12:39

Add experimental Supertonic GGUF CPU path

ae62b43

Made-with: Cursor

Keep vector trace boundary at attention output

6c0ab2f

Remove the unused residual/norm graph nodes from the vector trace harness. The current green boundary is attention output projection; residual add remains the next focused layout issue. Co-authored-by: Cursor <cursoragent@cursor.com>

Promote Supertonic duration and text paths to GGML

dd7d8f4

Route the remaining Supertonic stages through GGML-backed execution while keeping parity trace harnesses available for debugging. Co-authored-by: Cursor <cursoragent@cursor.com>

Add Supertonic GGML production controls

e4c934f

Wire thread-aware graph execution and trim trace overhead so benchmarks exercise the production GGML path more accurately. Co-authored-by: Cursor <cursoragent@cursor.com>

Split Supertonic production outputs from traces

6a6b688

Return duration and vector production outputs directly so trace tensors remain a debug-only transport. Co-authored-by: Cursor <cursoragent@cursor.com>

Move Supertonic speech text attention to GGML

8c756eb

Use GGML flash attention for speech-prompted text attention and extend the text trace to cover final encoder output. Co-authored-by: Cursor <cursoragent@cursor.com>

Add JSON output to Supertonic benchmark

fb1347b

Emit structured stage and RTF metrics so GGML and ONNX benchmark runs can be compared consistently. Co-authored-by: Cursor <cursoragent@cursor.com>

Align Supertonic ONNX benchmark reporting

14cd878

Support Supertonic 2 language wrapping modes and JSON metrics for matched GGML comparisons. Co-authored-by: Cursor <cursoragent@cursor.com>

Document Supertonic production gates

8e099a3

Expand portable build coverage and document benchmark commands, thread policy, and remaining relative-attention work. Co-authored-by: Cursor <cursoragent@cursor.com>

Port Supertonic text relpos attention to GGML

19a16df

Express learned relative key and value attention terms with stock GGML ops so the text encoder no longer falls back to scalar attention. Co-authored-by: Cursor <cursoragent@cursor.com>

Cache Supertonic vector static layouts

14ea376

Reuse text and style host layout conversions across denoising steps while preserving vector trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Cache Supertonic text relpos graphs

63cbd5f

Reuse GGML graph allocations and static band masks for text relative-position attention by layer and sequence length. Co-authored-by: Cursor <cursoragent@cursor.com>

Profile and split Supertonic vector attention

15b9e60

Add opt-in vector island profiling and avoid recomputing the front graph for the first attention pass. Co-authored-by: Cursor <cursoragent@cursor.com>

Split Supertonic vector text attention islands

f6db0eb

Avoid recomputing QKV graphs during text-attention flash passes and keep opt-in vector island profiling. Co-authored-by: Cursor <cursoragent@cursor.com>

Split Supertonic vector style attention islands

b733206

Avoid rerunning style QKV projections during flash attention passes while preserving residual continuations for parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Reuse Supertonic vector attention packing

556337f

Use shared runtime buffers for packed QKV attention layouts across vector islands. Co-authored-by: Cursor <cursoragent@cursor.com>

Update Supertonic production status

0f468ef

Document the fully GGML-backed text path, vector profiling flag, and matched GGML versus ONNX benchmark baseline. Co-authored-by: Cursor <cursoragent@cursor.com>

Cache Supertonic vector text attention graphs

6ef6642

Reuse split text-attention graph/allocation state across vector steps while preserving trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Clean Supertonic duration production path

8e06762

Let duration inference return its projection directly without allocating trace vectors in the hot path. Co-authored-by: Cursor <cursoragent@cursor.com>

Cache Supertonic vector attention graph islands

2de4174

Reuse generic attention-only graph state for text and style vector islands while preserving trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Cache Supertonic speech text attention graphs

cf7d7ba

Reuse speech-prompted attention graph/allocation state across text encoder calls while preserving parity. Co-authored-by: Cursor <cursoragent@cursor.com>

Optimize Supertonic GGML CPU runtime

13e3a2a

Harden the Supertonic production GGML path with cached graphs, portable custom CPU kernels, and benchmark documentation so the branch reflects the current ONNX-comparable performance work. Co-authored-by: Cursor <cursoragent@cursor.com>

ogad-tether and others added 3 commits May 5, 2026 17:37

Add Supertonic GGUF setup flow

da20099

Let the Supertonic converter download official Hugging Face bundles when local ONNX assets are not provided, and surface clear setup guidance when the local GGUF is missing. Co-authored-by: Cursor <cursoragent@cursor.com>

Document Supertonic GGML progress

97a1812

Capture the Supertonic port history, parity findings, optimization wins and failures, final benchmark matrix, and remaining production work in a dedicated progress journal. Co-authored-by: Cursor <cursoragent@cursor.com>

Route Supertonic GGUFs through tts-cli

a88d1d4

Autodetect Supertonic models from GGUF metadata and dispatch them through the Supertonic engine while preserving the existing Chatterbox routing. Co-authored-by: Cursor <cursoragent@cursor.com>

GustavoA1604 requested changes May 5, 2026

View reviewed changes

Comment thread .github/workflows/supertonic-portable-build.yml Outdated

Address Supertonic PR review feedback.

06da42a

Merge latest main, remove the extra workflow, harden Supertonic caches, clean up local GGML contexts, reduce Supertonic test rebuilds, and allow f16/q8_0 GGUF storage via load-time F32 expansion. Co-authored-by: Cursor <cursoragent@cursor.com>

ogad-tether requested a review from GustavoA1604 May 6, 2026 10:54

GustavoA1604 merged commit f807035 into GustavoA1604:main May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized Supertonic GGML CPU path#7

Add optimized Supertonic GGML CPU path#7
GustavoA1604 merged 34 commits into
GustavoA1604:mainfrom
ogad-tether:multilingual_merged

ogad-tether commented May 5, 2026

Uh oh!

GustavoA1604 left a comment

Uh oh!

Uh oh!

ogad-tether commented May 6, 2026

Uh oh!

ogad-tether commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ogad-tether commented May 5, 2026

Summary

Audio quality finding

Current GGML status

Validation

Final benchmark findings

Remaining performance notes

Uh oh!

GustavoA1604 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogad-tether commented May 6, 2026

Uh oh!

ogad-tether commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants