Add optimized Supertonic GGML CPU path#7
Conversation
Made-with: Cursor
…rity The vector_estimator step-0 was failing parity (max_abs=2.31e-1) because apply_rope() recomputed the RoPE frequency table from the standard formula theta[d] = 10000^(-d/(D/2)). The actual ONNX-baked values are 10x larger (theta[d] = 10000^((8-d)/32)), so my analytic theta over-rotated keys/queries in every text cross-attention block. Read theta directly from the GGUF tensor main_blocks.3.attn.theta (it is shared across all four RoPE blocks). After the fix all five stages pass: preprocess: 49 tokens, exact duration: abs=0 text_encoder: max_abs=4.8e-6 vector step0: max_abs=1.4e-6 (was 2.3e-1) vocoder: max_abs=1.1e-5 Add test-supertonic-pipeline that chains text_encoder -> 5-step denoise -> vocoder against wav_full.npy. End-to-end parity is max_abs=6.5e-5 in float, ~7.4e-5 after 16-bit PCM round-trip. Add EngineOptions.noise_npy_path / supertonic-cli --noise-npy so users can reproduce the ONNX reference run bit-exactly without depending on NumPy's RNG sequence. Made-with: Cursor
Adds two matched benchmark harnesses that report per-stage wall time (preprocess / duration / text_encoder / N denoise steps / vocoder) plus end-to-end RTF, with min/median/mean/p95/max across N runs after a configurable warmup: build/supertonic-bench - times the C++ GGML CPU path scripts/bench-supertonic-onnx.py - times the ONNX Runtime path Both accept --noise-npy so the runs are deterministic and produce identical audio for direct comparison. Headline numbers on Apple M2 (8 cores), 4.11s of audio: ONNX (CPUExecutionProvider): 180 ms total, RTF 0.044, 22.8x realtime C++ GGML (single thread): 14451 ms total, RTF 3.52, 0.28x realtime Output is bit-identical (max_abs 6.5e-5 in float, 7.4e-5 after PCM round-trip). The 80x perf gap is entirely from the C++ port being single-threaded scalar today (no SIMD, no BLAS, no quant) - it is designed for correctness, not throughput. See artifacts/supertonic-bench.md for the full breakdown and proposed follow-ups. Made-with: Cursor
Support the English-only Supertone/supertonic bundle alongside the multilingual supertonic-2 bundle by storing model-family metadata, the default voice, and whether text should be wrapped in language tags. English now uses the stable no-wrap path, while the existing multilingual fixtures continue to use <lang> wrapping. This matches the latest QVAC behavior that avoids English stuttering on the quick-brown-fox prompt, while preserving parity for the multilingual supertonic-2 flow. Co-authored-by: Cursor <cursoragent@cursor.com>
Latest QVAC Supertonic behavior shows that supertonic-2 English stutters with the old prefix-only wrapper (<en>text ) but is clean with open/close tags (<en>text</en>). Add explicit language_wrap_mode metadata with none, prefix, and open_close modes so stable English supertonic keeps no wrapping while supertonic-2 defaults to the clean open/close path. Regenerated local supertonic-2 references with open_close wrapping and validated preprocessing, duration, text encoder, vector step, vocoder, and full pipeline parity. Co-authored-by: Cursor <cursoragent@cursor.com>
Align the Supertonic runtime with the existing Chatterbox GGML/GGUF patterns: model metadata now carries defaults and ftype, backend initialization supports the same optional GGML backends, and the converter can emit f32/f16/q8_0 GGUFs for future graph backends. Port the vocoder to a real GGML graph path and validate it against the scalar reference. Add trace harnesses for vocoder and vector-estimator graph boundaries so remaining stages can be ported incrementally without losing parity. The vocoder graph now matches the ONNX reference at max_abs ~1.6e-6, and the vector trace is green through projection, mask, first ConvNeXt group, time add, and the following ConvNeXt block. Add a portable Supertonic CPU build workflow for Linux, macOS, and Windows using the existing CMake/GGML switches. Co-authored-by: Cursor <cursoragent@cursor.com>
Extend the GGML vector-estimator trace beyond projection, mask, ConvNeXt, and time-add into the first text-attention block. The trace now validates Q/K/V projections, CPU-applied RoPE tensors, flash-attention context, and the attention output projection against the scalar reference. This establishes a clean parity boundary before the next issue: the residual add after attention currently needs separate layout debugging even though both operands compare correctly on their own. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the unused residual/norm graph nodes from the vector trace harness. The current green boundary is attention output projection; residual add remains the next focused layout issue. Co-authored-by: Cursor <cursoragent@cursor.com>
Extend the vector-estimator GGML trace beyond text-attention output projection by isolating the residual add, post-attention norm, and following ConvNeXt block in a small continuation graph. This avoids buffer/lifetime ambiguity from the multi-pass attention trace and keeps each boundary parity-checkable. The new trace checkpoints pass through attention residual, norm, and main_blocks.4.convnext.0 with max_abs around 1e-6. Co-authored-by: Cursor <cursoragent@cursor.com>
Route the Supertonic engine through the GGML vector estimator and keep the scalar implementation as a parity baseline. The GGML vector step now covers the full estimator: all repeated groups, text/style attention, final ConvNeXt stack, proj_out, and Euler update. Keep the detailed vector trace harness for diagnostics while making the production vector path skip scalar/intermediate trace emission and retain only the final next-latent output. Also switch Supertonic latent seeding to a NumPy-compatible RandomState sequence so default --seed output matches the clean ONNX reference noise. Co-authored-by: Cursor <cursoragent@cursor.com>
Route the remaining Supertonic stages through GGML-backed execution while keeping parity trace harnesses available for debugging. Co-authored-by: Cursor <cursoragent@cursor.com>
Wire thread-aware graph execution and trim trace overhead so benchmarks exercise the production GGML path more accurately. Co-authored-by: Cursor <cursoragent@cursor.com>
Return duration and vector production outputs directly so trace tensors remain a debug-only transport. Co-authored-by: Cursor <cursoragent@cursor.com>
Use GGML flash attention for speech-prompted text attention and extend the text trace to cover final encoder output. Co-authored-by: Cursor <cursoragent@cursor.com>
Emit structured stage and RTF metrics so GGML and ONNX benchmark runs can be compared consistently. Co-authored-by: Cursor <cursoragent@cursor.com>
Support Supertonic 2 language wrapping modes and JSON metrics for matched GGML comparisons. Co-authored-by: Cursor <cursoragent@cursor.com>
Expand portable build coverage and document benchmark commands, thread policy, and remaining relative-attention work. Co-authored-by: Cursor <cursoragent@cursor.com>
Express learned relative key and value attention terms with stock GGML ops so the text encoder no longer falls back to scalar attention. Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse text and style host layout conversions across denoising steps while preserving vector trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse GGML graph allocations and static band masks for text relative-position attention by layer and sequence length. Co-authored-by: Cursor <cursoragent@cursor.com>
Add opt-in vector island profiling and avoid recomputing the front graph for the first attention pass. Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid recomputing QKV graphs during text-attention flash passes and keep opt-in vector island profiling. Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid rerunning style QKV projections during flash attention passes while preserving residual continuations for parity. Co-authored-by: Cursor <cursoragent@cursor.com>
Use shared runtime buffers for packed QKV attention layouts across vector islands. Co-authored-by: Cursor <cursoragent@cursor.com>
Document the fully GGML-backed text path, vector profiling flag, and matched GGML versus ONNX benchmark baseline. Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse split text-attention graph/allocation state across vector steps while preserving trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>
Let duration inference return its projection directly without allocating trace vectors in the hot path. Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse generic attention-only graph state for text and style vector islands while preserving trace parity. Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse speech-prompted attention graph/allocation state across text encoder calls while preserving parity. Co-authored-by: Cursor <cursoragent@cursor.com>
Harden the Supertonic production GGML path with cached graphs, portable custom CPU kernels, and benchmark documentation so the branch reflects the current ONNX-comparable performance work. Co-authored-by: Cursor <cursoragent@cursor.com>
Let the Supertonic converter download official Hugging Face bundles when local ONNX assets are not provided, and surface clear setup guidance when the local GGUF is missing. Co-authored-by: Cursor <cursoragent@cursor.com>
Capture the Supertonic port history, parity findings, optimization wins and failures, final benchmark matrix, and remaining production work in a dedicated progress journal. Co-authored-by: Cursor <cursoragent@cursor.com>
Autodetect Supertonic models from GGUF metadata and dispatch them through the Supertonic engine while preserving the existing Chatterbox routing. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
left a comment
There was a problem hiding this comment.
Also could you help address these?
-
Could make it work for windows with this. But could you pull the latest from main because maybe it will already fix it? 005779a
-
Stale tensor pointers in thread_local graph cacheschatterbox.cpp/src/supertonic_text_encoder.cpp:508–512, 588–597, 797–809, and the parallel structures in supertonic_vector_estimator.cpp (vector_text_attention_cache, vector_group_graph_cache, vector_res_style_qkv_cache, vector_tail_graph_cache) and supertonic_vocoder.cpp:339–356 all keep thread_local caches keyed by cache.model != &model. The graphs they hold contain require_source_tensor(model, …) pointers that live in the model's ctx_w.
synthesize in chatterbox.cpp/src/supertonic_engine.cpp:117 stack-allocates supertonic_model model; and calls free_supertonic_model(model) before returning. If a host calls synthesize() again from the same thread, the new stack-frame model is very likely to land at the same address, the cache key check passes, and the cached graph runs against ctx_w tensors that have been freed and replaced. Visible only when integrators call synthesize() more than once per thread (i.e. any server use); the bench tool itself loads once and is unaffected.
Suggested fix (any one of):
Key caches by model.ctx_w plus a monotonically increasing model.generation_id rather than &model.
Expose a supertonic_invalidate_thread_caches() and call it from free_supertonic_model.
Keep the caches as members of supertonic_model so they get destroyed with the model.
3 ggml_context leak per call
chatterbox.cpp/src/supertonic_text_encoder.cpp:901, :1021; supertonic_duration.cpp:521; many sites in supertonic_vector_estimator.cpp (g1_style_res_buf/g2_style_res_buf/g3_style_res_buf blocks, srgf block ~2189, etc.) all do:
ggml_init_params gp = { buf_size, buf.data(), true };
ggml_context * ctx = ggml_init(gp);
ggml_cgraph * gf = ggml_new_graph_custom(ctx, MAX_NODES, false);
…
ggml_gallocr_free(allocr);
return …;
with no ggml_free(ctx). With current ggml (chatterbox.cpp/ggml/src/ggml.c:1571) the context struct itself is GGML_MALLOC'd on every call; mem_buffer_owned=false so the buffer isn't freed, but the context struct is. Each step leaks a context struct. For per-CLI-invocation use it's negligible (few KB), but the bench harness loops total_runs × per-run stage compositions and accumulates many leaks. This is the easy fix: add ggml_free(ctx) at the end of every site that calls ggml_init with the local-buffer pattern.
-
chatterbox.cpp/CMakeLists.txt:202–298 rebuilds every supertonic_*.cpp once per test executable, even though tts-cpp already statically links them. Long compile times. Consider linking the tests against tts-cpp (or a small shared tts-cpp::supertonic_test object library) instead.
-
chatterbox.cpp/src/supertonic_engine.cpp:122–126 blocks non-f32 GGUFs even though the converter accepts --ftype f16/q8_0. The error message correctly says "use f16/q8_0 only with the GGML graph backend once enabled", but the production path is now the GGML graph backend
Please help add support for other quantization formats as well
- chatterbox.cpp/scripts/dump-supertonic-reference.py:42–46 defaults --steps 5 while the converter at convert-supertonic2-to-gguf.py:308 writes default_steps = 10 for supertonic2. The bench/CLI examples in the README all pass --steps 5 explicitly, which is fine, but a user who only follows bash scripts/setup-supertonic2.sh and then supertonic-cli … --steps 0 (i.e. defaults from GGUF) will run the 10-step path and not match the 5-step reference dumps. Document this, or align the defaults.
Merge latest main, remove the extra workflow, harden Supertonic caches, clean up local GGML contexts, reduce Supertonic test rebuilds, and allow f16/q8_0 GGUF storage via load-time F32 expansion. Co-authored-by: Cursor <cursoragent@cursor.com>
…etry/scratch) Five targeted fixes surfaced by review of the multilingual_merged tip after the origin/main merge. Three are real bugs (CFG, top_k, engine crash on MTL GGUFs); one is a perf regression with audible behaviour on MTL (spurious T3 retries); one is a defensive cleanup. 1. src/chatterbox_tts.cpp (CFM step loop): the use_b2 branch correctly computes (1+cfg)*cond - cfg*uncond, but the else branch only computed the conditional pass and silently dropped CFG on every non-Metal backend. Restores the §3.19 (3f0a8da) behaviour: when !meanflow && cfg_rate != 0 and use_b2 is false (CPU and any GPU backend where the b2 path was disabled), run cond + uncond back-to-back on the same B=1 graph (cfm_estimator_cache key (T, b2=false) reuses the cached graph across both calls) and combine via the standard CFG mix. Smoke-tested on CPU (--n-gpu-layers 0): runs cleanly, S3Gen wall-clock doubles vs meanflow as expected (12 CFM steps × 2 forward calls). 2. src/t3_mtl.cpp::sample_next_token_mtl top-k filter: after nth_element(begin, begin+k, end, greater) the (k+1)-th largest sits at idx[k] and positions [0, k) hold the top-k UNORDERED. The previous code took cut = l[idx[k-1]] which is some arbitrary top-k element (often not the smallest), making cut too large and the `x < cut` filter then erased legitimate top-k logits. Fix: partition to begin+(k-1) so idx[k-1] is the k-th largest exactly. Mostly masked by the default top_k=1000 vs an 8194-vocab where the threshold falls into the noise floor; the bug bites at small top_k (e.g. greedy --top-k 1 where the wrong cut could pessimise tie handling). The Turbo sample_next_token_ex in src/main.cpp uses a different (correct) approach via tmp[k] + per-element rescan for ties; left untouched. 3. src/chatterbox_engine.cpp: load_model_gguf dispatches MTL GGUFs into load_model_gguf_mtl (populates layers_mtl, leaves layers empty), but synthesize() unconditionally calls eval_prompt -> build_prompt_graph -> build_transformer_core, which iterates model.layers[il] -- empty std::vector, UB or crash. Add a clean rejection guard right after the load: if model.hparams.variant != CHBX_VARIANT_TURBO, free_model() and throw a clear error pointing the user at the CLI / internal eval_*_mtl helpers. Wiring MTL through the public Engine API (extend EngineOptions with language / cfg_weight / min_p / exaggeration, branch synthesize() on variant) is left as a follow-up; this just stops the crash on the public surface. 4. src/chatterbox_cli.cpp::run_t3_for_segment retry trigger: the 0cad44d merge commit said the 5x speech-tokens-per-BPE-token floor (calibrated for English Turbo / GPT-2 BPE) should be gated to non-MTL because MTL's Llama tokenizer has a ~1.7x ratio. The gating wasn't actually in the code -- a clean stop-token termination on a short MTL segment looked "implausible" and triggered up to 3 spurious retries. `plausible = is_mtl || (int)generated.size() >= min_tokens;` restores the intent. The 3x-repeated-token early-stop above still guards MTL's catastrophic case. Measured on M4 Metal with the ES reference prompt + jfk/gianni voice: T3 wall time drops from ~3.9 s (4 attempts) to ~0.93 s (1 attempt) -- ~4x speedup just from removing the wasted retries. WAV md5 stays byte-exact at 57cc80f27a122f03435fd05f47d1b3d2. 5. src/t3_mtl.cpp stacked-QKV loader scratch sizing: the early type-equality guard implies wq/wk/wv have identical sizes today, but max over all three so a future shape divergence (e.g. an MTL variant with non-square Q/K/V) can't silently truncate a per-layer copy via undersized scratch. No behaviour change today; defensive only. Validation (Apple M4, Metal, Release): - cmake --build: clean, no warnings, all targets link. - test-metal-ops: 14/14 PASS, 0 FAIL. - End-to-end synthesis (ES prompt, gianni.wav, --seed 42, greedy): md5 57cc80f27a122f03435fd05f47d1b3d2 -- byte-exact vs the pre-fix baseline. T3 wall time ~3.9s -> ~0.9s (fix GustavoA1604#4). - CPU CFG smoke test (--n-gpu-layers 0, --text "Hola.", es): completes cleanly, S3Gen ~12s for 12 CFM steps × 2 forward calls (cond + uncond), produces valid 1.1s WAV. Issues GustavoA1604#5 (redundant peek+open in load_model_gguf), GustavoA1604#7 (/g deny-list breadth in requantize-gguf.py), GustavoA1604#8 (forward-hook idiom in dump-t3-mtl-reference.py), and #9 (CMake duplicate cli_main.cpp build) are tracked but intentionally not folded in here -- the reviewer flagged them as cosmetic / trivial / fine. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Addressed the review feedback in
Validation run locally:
|
|
Detailed follow-up on each requested item, all addressed in
Local validation:
|
… + scaffolding caches (multilingual Vulkan) Targets the per-synth host-CPU overhead that round 1 / round-HIFT didn't address, on top of upstream/multilingual_merged (now in main via PR GustavoA1604#7). Test-first: bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac monorepo locks the pre-change MD5 baseline, then re-verifies after every cache. All 3 invariants (multilingual single-shot, multilingual 6-segment multi-synth, Turbo single-shot) PASS bit-exact. Seven new caches ---------------- All host-side, model-agnostic, no GGUF-format change, no public-API change. Same teardown discipline as the existing g_cfm_estimator_cache (destroy() before ggml_backend_free). Sit alongside the existing round-1 caches. - g_encoder_graph_cache (keyed on T): full run_encoder graph + gallocator. Streaming chunks of varying length still produce correct output (rebuilds on key change). - g_hift_graph_cache (keyed on pack(T_mel, T_stft)) + g_hift_inv_alpha_entries: full run_hift_decode graph + gallocator. Parallel (graph-input-name, source-tensor-ptr) metadata lets cache hits re-feed each alpha-input slot from g_inv_alpha_results without rebuilding the graph. - g_f0_graph_cache (keyed on T_mel): full run_f0_predictor graph + gallocator. - cached_pos_emb (g_pos_emb_results, keyed on pack(T, D)): compute_pos_emb is pure CPU compute (~T * D * 5 trig ops); fired twice per encoder run (T and 2T). Multilingual T~350+ at D=512 is a real wedge of per-synth host time. - cached_inv_alpha (g_inv_alpha_results, keyed on ggml_tensor*): HiFT calls invert_alpha_cpu ~72x per synth (12 ResBlocks × 6 alpha tensors); each is a tensor_get + per-element reciprocal. Alpha tensors are constant for the model lifetime. - cached_hann_window / cached_istft_kernel (g_hann_window_cache / g_istft_kernel_cache, keyed on n_fft): pure functions of n_fft (constant 16 in the chatterbox HiFT path). - cached_window_sum (g_window_sum_cache, keyed on pack(n_fft, hop, T_stft)): T_stft × n_fft adds; stable across same-shape synth calls. A new graph_cache struct (used by encoder / HiFT / F0) and a pack_hift_key helper centralise the explicit destroy()-on-teardown pattern so future per-stage caches can plug in with one struct + one mutex acquisition. The destroy path is unified into a renamed s3gen_release_synth_caches() (replaces the old g_cfm_estimator_cache_destroy()), called from s3gen_model_cache_release, the cache-miss backend-swap path, and the explicit s3gen_unload(). Negative result documented (bug caught and fixed during dev) ------------------------------------------------------------ First implementation of the HiFT cache hung indefinitely on the very first synth call. Root cause: the alpha-input refresh loop held g_synth_caches_mu while calling cached_inv_alpha, which itself takes the same mutex internally — classic re-entrant deadlock. Fix: snapshot g_hift_inv_alpha_entries under the mutex into a local vector, then iterate without the lock (cached_inv_alpha re-acquires the mutex per call but with no nesting). General rule kept as an inline comment: never hold a cache-state mutex while calling any other cached_* helper. Performance — RTX 5090, multilingual auto-split, warm-state seg 2..6 ------------------------------------------------------------------- Within-process win on top of round 1 + round-HIFT: metric | pre-round-2 | post-round-2 | Δ S3GEN_INFER | 159.8 ms | 140.8 ms | -19.0 ms (-11.9 %) cfm_total | 122.2 ms | 118.7 ms | -3.5 ms (-2.9 %) cfm_step0 | 13.24 ms| 13.18 ms | noise (already cached round 1) hift_total | 17.96 ms| 16.30 ms | -1.7 ms (-9.4 %) Combined cumulative win vs upstream/multilingual_merged baseline (round 1 + round-HIFT + round 2): metric | upstream/mtl_merged | this PR (full) | Δ S3GEN_INFER | 169.9 ms | 140.8 ms | -29.1 ms (-17.1 %) cfm_total | 132.5 ms | 118.7 ms | -13.8 ms (-10.4 %) cfm_step0 | 24.1 ms | 13.2 ms | -10.9 ms (-45.2 %) The biggest remaining single piece of S3GEN_INFER (~120 ms cfm) is the actual GPU CFM compute — not host-cacheable; would need shader-side optimisation (e.g. tensor-core engagement via cooperative_matrix2; deferred — see "Next" in PROGRESS.md §3.32). Bit-exactness ------------- Locked invariants pass byte-for-byte vs the pre-change baseline: Multilingual single-shot c65d98f15a59b8fe9cad98e46eb3fb30 ✓ Multilingual 6-segment multi 0b374c7474895a3387b9f1df10b3c1b8 ✓ Turbo single-shot 6219f4338b1b4fb9dc60481216153b49 ✓ Verified across 4 successive iterations on RTX 5090 + NVIDIA 590.48 + Vulkan 1.3.275; bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac monorepo is the test-first harness. Files ----- src/chatterbox_tts.cpp +373 / -79 (net diff vs round-1 head) PROGRESS.md §3.32 round-2 subsection (~+200 lines) The +373 lines in chatterbox_tts.cpp are entirely the new cache infrastructure: graph_cache struct, seven new globals, the s3gen_release_synth_caches lifecycle hook, the five cached_* scaffolding helpers, and the build_graph / cache-hit branches in run_encoder / run_hift_decode / run_f0_predictor. Co-authored-by: Cursor <cursoragent@cursor.com>
Mirrors the parakeet-cpp port README layout so a downstream consumer
can answer 'what does this library do, how do I link it, and which
CMake knobs do I need to know about?' from the top of the README
without scrolling through the 1300-line standalone development walk-
through. No content removed; existing standalone material stays
verbatim, just shifted down by ~80 lines.
Adds three new blocks near the top:
- ## API overview (between the benchmark tables and 'Pipeline at a
glance'). Two-row table for the high-level entry points exported
through TTS_CPP_API:
* tts_cpp::chatterbox::Engine::synthesize - Chatterbox T3+S3Gen+HiFT
* tts_cpp::supertonic::synthesize - Supertonic CPU TTS
Trailing paragraph mentions the lower-level helpers
(s3gen_synthesize_to_wav / s3gen_preload / s3gen_unload /
tts_cpp_cli_main), points at <tts-cpp/export.h>, and explicitly
flags that detail-namespaced symbols (used by the supertonic /
chatterbox test harnesses) are not part of the public API and are
hidden in SHARED builds.
- ### Consumer integration (subsection of API overview). Calls out
that the qvac speech-stack qvac-ext-lib-whisper.cpp wrapper port
consumes ggml from the qvac-ext-ggml/speech branch directly
(Metal / OpenCL / Vulkan patches included) and does NOT ship
scripts/setup-ggml.sh or patches/ - those are standalone-dev tools
maintained in this repo only. Provides the
find_package(tts-cpp CONFIG REQUIRED) +
target_link_libraries(... tts-cpp::tts-cpp) + 8-line
Engine::synthesize C++ snippet that's the entire consumer-side
integration.
- ### Useful CMake options (inside section 1, between the GPU backend
paragraph and the binaries table). Full table of the project-
namespaced flags:
TTS_CPP_BUILD_LIBRARY, TTS_CPP_BUILD_SHARED (new from items 7+8),
TTS_CPP_BUILD_EXECUTABLES, TTS_CPP_BUILD_TESTS, TTS_CPP_INSTALL,
TTS_CPP_USE_SYSTEM_GGML, TTS_CPP_GGML_LIB_PREFIX, TTS_CPP_CCACHE
(new from items 7+8).
Plus a secondary table for the ctest-fixture cache paths
(TTS_CPP_TEST_{MODEL,AUDIO,REF}_DIR) and a one-liner explaining the
REQUIRES auto-disable behaviour from item 7.
Touches existing prose in two places:
- The setup-ggml.sh paragraph in section 1 gets a one-paragraph
follow-up clarifying it (and patches/) are standalone-development
tools only, with a back-link to the Consumer integration section
(item 9: 'document setup-ggml.sh inertness' folded into this
framing rather than landed as a separate doc-only commit). Also
strengthens the existing 'Re-running is safe' line to 'idempotent
and destructive' so a dev hacking on ./ggml is warned before
losing local edits.
- The ### Alternative: consume ggml from vcpkg subsection now opens
with one sentence positioning it as the CMake-mechanic detail
behind the Consumer integration story, with a forward link to the
qvac-ext-ggml/speech branch.
Also updates the binaries table in section 1 to list the missing
PR #6 + PR #7 binaries that landed since the README was last
refreshed: supertonic-cli, supertonic-bench, test-cpu-caches,
test-t3-caches, and the test-supertonic-* family. Trailing paragraph
notes that test-* binaries register with CTest so
\`ctest -C Release -L unit\` / \`ctest -C Release -L fixture\` works
out of the build directory.
No code changes, no CMake changes, no install behaviour changes.
README.md +128 / -10 lines.
Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Supertone/supertonicfor stable English, no language wrapping.Supertone/supertonic-2for multilingual, using clean<lang>text</lang>open/close wrapping.Audio quality finding
supertonic-2prefix-only wrapper:<en>text.Supertone/supertonicand sounds clean.supertonic-2also sounds clean when using open/close tags:<en>text</en>.Current GGML status
ggml_flash_attn_ext.--threadscontrols GGML CPU threading; BLAS worker threads are capped by default to avoid nested oversubscription.Validation
cmake --build build --target supertonic-bench test-supertonic-pipeline test-supertonic-vocoder-pointwise./build/test-supertonic-pipeline models/supertonic2.gguf artifacts/supertonic-ref-quick3.431e-05, RMS2.086e-06../build/test-supertonic-vocoder-pointwise models/supertonic2.gguf artifacts/supertonic-ref-quickartifacts/supertonic-thread-matrix/,runs=3,warmup=1, F1 voice, 5 steps, speed1.05, ONNX RuntimeCPUExecutionProvideronly.Final benchmark findings
GGML now wins 10 of 12 matched thread/prompt comparisons. The only losses in the final run are quick English 4t and long English 4t, both close.
4-thread stage medians show the remaining gap is now narrow and stage-specific:
Remaining performance notes
Made with Cursor