Skip to content

Add optimized Supertonic GGML CPU path#7

Merged
GustavoA1604 merged 34 commits into
GustavoA1604:mainfrom
ogad-tether:multilingual_merged
May 6, 2026
Merged

Add optimized Supertonic GGML CPU path#7
GustavoA1604 merged 34 commits into
GustavoA1604:mainfrom
ogad-tether:multilingual_merged

Conversation

@ogad-tether

Copy link
Copy Markdown

Summary

  • Add a Supertonic-specific ONNX/assets to GGUF converter, ONNX reference dumper, CPU runtime, CLI, benchmark tooling, and staged parity harnesses.
  • Route the production Supertonic 2 path through GGML-backed duration, text encoder, vector estimator, and vocoder stages.
  • Harden performance with cached GGML graphs, strided Q/K/V views, fused vector group/tail graph boundaries, portable custom CPU kernels, and controlled GGML/BLAS threading.
  • Support both upstream bundles:
    • Supertone/supertonic for stable English, no language wrapping.
    • Supertone/supertonic-2 for multilingual, using clean <lang>text</lang> open/close wrapping.

Audio quality finding

  • The English stutter was not caused by the GGUF/C++ port.
  • The bad path was the old supertonic-2 prefix-only wrapper: <en>text .
  • Stable English uses no wrapping with Supertone/supertonic and sounds clean.
  • supertonic-2 also sounds clean when using open/close tags: <en>text</en>.
  • Local listening validation passed for the generated English, French, and Portuguese sample sets.

Current GGML status

  • All four Supertonic stages are GGML-backed in the production path.
  • Text encoder FFN blocks and relative-position attention use cached GGML graphs; speech-prompted text attention uses ggml_flash_attn_ext.
  • Vector attention uses strided Q/K/V views and persistent graph/allocr caches for attention, ConvNeXt group, and tail islands.
  • Vector runtime includes fused ConvNeXt group/tail boundaries, gated production trace outputs, BLAS-backed pointwise Conv1D, custom depthwise Conv1D, direct row-wise layer norm, direct dense time matmul, and fused bias/GELU/residual elementwise ops.
  • Vocoder uses a persistent GGML graph cache plus BLAS/Accelerate-backed causal Conv1D custom ops for hot projection paths.
  • --threads controls GGML CPU threading; BLAS worker threads are capped by default to avoid nested oversubscription.

Validation

  • cmake --build build --target supertonic-bench test-supertonic-pipeline test-supertonic-vocoder-pointwise
  • ./build/test-supertonic-pipeline models/supertonic2.gguf artifacts/supertonic-ref-quick
    • PASS, waveform max abs 3.431e-05, RMS 2.086e-06.
  • ./build/test-supertonic-vocoder-pointwise models/supertonic2.gguf artifacts/supertonic-ref-quick
    • PASS, BLAS/custom pointwise rows match reference at float tolerance.
  • Final benchmark matrix: artifacts/supertonic-thread-matrix/, runs=3, warmup=1, F1 voice, 5 steps, speed 1.05, ONNX Runtime CPUExecutionProvider only.

Final benchmark findings

GGML now wins 10 of 12 matched thread/prompt comparisons. The only losses in the final run are quick English 4t and long English 4t, both close.

Prompt GGML 1t GGML 2t GGML 3t GGML 4t ONNX 1t ONNX 2t ONNX 3t ONNX 4t
quick English 298.0 ms 189.4 ms 157.7 ms 157.7 ms 373.8 ms 218.5 ms 168.3 ms 148.8 ms
longer English 757.5 ms 491.2 ms 390.3 ms 361.2 ms 1103.0 ms 580.6 ms 555.7 ms 351.5 ms
Portuguese smoke 457.2 ms 292.9 ms 251.0 ms 234.3 ms 610.6 ms 344.6 ms 268.3 ms 250.8 ms

4-thread stage medians show the remaining gap is now narrow and stage-specific:

Prompt Runtime Duration Text Vector Vocoder Total
quick English GGML 3.9 ms 13.5 ms 96.3 ms 43.6 ms 157.7 ms
quick English ONNX 1.5 ms 11.5 ms 85.9 ms 49.8 ms 148.8 ms
longer English GGML 11.9 ms 33.3 ms 201.2 ms 115.1 ms 361.2 ms
longer English ONNX 2.4 ms 13.1 ms 198.3 ms 138.8 ms 351.5 ms
Portuguese smoke GGML 6.5 ms 20.8 ms 137.6 ms 68.9 ms 234.3 ms
Portuguese smoke ONNX 1.7 ms 11.6 ms 141.7 ms 95.6 ms 250.8 ms

Remaining performance notes

  • Single-thread GGML now beats ONNX across all final prompts.
  • GGML vocoder wins the final 4-thread stage comparison on all three prompts.
  • Vector is close but still the main swing stage at higher thread counts, with some 3/4-thread variance remaining.
  • Text/duration explain most of the remaining 4-thread English losses; text is especially visible on longer prompts.

Made with Cursor

ogad-tether and others added 30 commits May 1, 2026 12:39
…rity

The vector_estimator step-0 was failing parity (max_abs=2.31e-1) because
apply_rope() recomputed the RoPE frequency table from the standard formula
theta[d] = 10000^(-d/(D/2)).  The actual ONNX-baked values are 10x larger
(theta[d] = 10000^((8-d)/32)), so my analytic theta over-rotated keys/queries
in every text cross-attention block.  Read theta directly from the GGUF
tensor main_blocks.3.attn.theta (it is shared across all four RoPE blocks).

After the fix all five stages pass:
  preprocess:    49 tokens, exact
  duration:      abs=0
  text_encoder:  max_abs=4.8e-6
  vector step0:  max_abs=1.4e-6  (was 2.3e-1)
  vocoder:       max_abs=1.1e-5

Add test-supertonic-pipeline that chains text_encoder -> 5-step denoise ->
vocoder against wav_full.npy.  End-to-end parity is max_abs=6.5e-5 in
float, ~7.4e-5 after 16-bit PCM round-trip.

Add EngineOptions.noise_npy_path / supertonic-cli --noise-npy so users
can reproduce the ONNX reference run bit-exactly without depending on
NumPy's RNG sequence.

Made-with: Cursor
Adds two matched benchmark harnesses that report per-stage wall time
(preprocess / duration / text_encoder / N denoise steps / vocoder) plus
end-to-end RTF, with min/median/mean/p95/max across N runs after a
configurable warmup:

  build/supertonic-bench           - times the C++ GGML CPU path
  scripts/bench-supertonic-onnx.py - times the ONNX Runtime path

Both accept --noise-npy so the runs are deterministic and produce
identical audio for direct comparison.

Headline numbers on Apple M2 (8 cores), 4.11s of audio:
  ONNX (CPUExecutionProvider): 180 ms total, RTF 0.044, 22.8x realtime
  C++ GGML (single thread):    14451 ms total, RTF 3.52, 0.28x realtime

Output is bit-identical (max_abs 6.5e-5 in float, 7.4e-5 after PCM
round-trip).  The 80x perf gap is entirely from the C++ port being
single-threaded scalar today (no SIMD, no BLAS, no quant) - it is
designed for correctness, not throughput.  See
artifacts/supertonic-bench.md for the full breakdown and proposed
follow-ups.

Made-with: Cursor
Support the English-only Supertone/supertonic bundle alongside the multilingual supertonic-2 bundle by storing model-family metadata, the default voice, and whether text should be wrapped in language tags. English now uses the stable no-wrap path, while the existing multilingual fixtures continue to use <lang> wrapping.

This matches the latest QVAC behavior that avoids English stuttering on the quick-brown-fox prompt, while preserving parity for the multilingual supertonic-2 flow.

Co-authored-by: Cursor <cursoragent@cursor.com>
Latest QVAC Supertonic behavior shows that supertonic-2 English stutters with the old prefix-only wrapper (<en>text ) but is clean with open/close tags (<en>text</en>). Add explicit language_wrap_mode metadata with none, prefix, and open_close modes so stable English supertonic keeps no wrapping while supertonic-2 defaults to the clean open/close path.

Regenerated local supertonic-2 references with open_close wrapping and validated preprocessing, duration, text encoder, vector step, vocoder, and full pipeline parity.

Co-authored-by: Cursor <cursoragent@cursor.com>
Align the Supertonic runtime with the existing Chatterbox GGML/GGUF patterns: model metadata now carries defaults and ftype, backend initialization supports the same optional GGML backends, and the converter can emit f32/f16/q8_0 GGUFs for future graph backends.

Port the vocoder to a real GGML graph path and validate it against the scalar reference. Add trace harnesses for vocoder and vector-estimator graph boundaries so remaining stages can be ported incrementally without losing parity. The vocoder graph now matches the ONNX reference at max_abs ~1.6e-6, and the vector trace is green through projection, mask, first ConvNeXt group, time add, and the following ConvNeXt block.

Add a portable Supertonic CPU build workflow for Linux, macOS, and Windows using the existing CMake/GGML switches.

Co-authored-by: Cursor <cursoragent@cursor.com>
Extend the GGML vector-estimator trace beyond projection, mask, ConvNeXt, and time-add into the first text-attention block. The trace now validates Q/K/V projections, CPU-applied RoPE tensors, flash-attention context, and the attention output projection against the scalar reference.

This establishes a clean parity boundary before the next issue: the residual add after attention currently needs separate layout debugging even though both operands compare correctly on their own.

Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the unused residual/norm graph nodes from the vector trace harness. The current green boundary is attention output projection; residual add remains the next focused layout issue.

Co-authored-by: Cursor <cursoragent@cursor.com>
Extend the vector-estimator GGML trace beyond text-attention output projection by isolating the residual add, post-attention norm, and following ConvNeXt block in a small continuation graph. This avoids buffer/lifetime ambiguity from the multi-pass attention trace and keeps each boundary parity-checkable.

The new trace checkpoints pass through attention residual, norm, and main_blocks.4.convnext.0 with max_abs around 1e-6.

Co-authored-by: Cursor <cursoragent@cursor.com>
Route the Supertonic engine through the GGML vector estimator and keep the scalar implementation as a parity baseline. The GGML vector step now covers the full estimator: all repeated groups, text/style attention, final ConvNeXt stack, proj_out, and Euler update.

Keep the detailed vector trace harness for diagnostics while making the production vector path skip scalar/intermediate trace emission and retain only the final next-latent output. Also switch Supertonic latent seeding to a NumPy-compatible RandomState sequence so default --seed output matches the clean ONNX reference noise.

Co-authored-by: Cursor <cursoragent@cursor.com>
Route the remaining Supertonic stages through GGML-backed execution while keeping parity trace harnesses available for debugging.

Co-authored-by: Cursor <cursoragent@cursor.com>
Wire thread-aware graph execution and trim trace overhead so benchmarks exercise the production GGML path more accurately.

Co-authored-by: Cursor <cursoragent@cursor.com>
Return duration and vector production outputs directly so trace tensors remain a debug-only transport.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use GGML flash attention for speech-prompted text attention and extend the text trace to cover final encoder output.

Co-authored-by: Cursor <cursoragent@cursor.com>
Emit structured stage and RTF metrics so GGML and ONNX benchmark runs can be compared consistently.

Co-authored-by: Cursor <cursoragent@cursor.com>
Support Supertonic 2 language wrapping modes and JSON metrics for matched GGML comparisons.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expand portable build coverage and document benchmark commands, thread policy, and remaining relative-attention work.

Co-authored-by: Cursor <cursoragent@cursor.com>
Express learned relative key and value attention terms with stock GGML ops so the text encoder no longer falls back to scalar attention.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse text and style host layout conversions across denoising steps while preserving vector trace parity.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse GGML graph allocations and static band masks for text relative-position attention by layer and sequence length.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add opt-in vector island profiling and avoid recomputing the front graph for the first attention pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid recomputing QKV graphs during text-attention flash passes and keep opt-in vector island profiling.

Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid rerunning style QKV projections during flash attention passes while preserving residual continuations for parity.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use shared runtime buffers for packed QKV attention layouts across vector islands.

Co-authored-by: Cursor <cursoragent@cursor.com>
Document the fully GGML-backed text path, vector profiling flag, and matched GGML versus ONNX benchmark baseline.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse split text-attention graph/allocation state across vector steps while preserving trace parity.

Co-authored-by: Cursor <cursoragent@cursor.com>
Let duration inference return its projection directly without allocating trace vectors in the hot path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse generic attention-only graph state for text and style vector islands while preserving trace parity.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse speech-prompted attention graph/allocation state across text encoder calls while preserving parity.

Co-authored-by: Cursor <cursoragent@cursor.com>
Harden the Supertonic production GGML path with cached graphs, portable custom CPU kernels, and benchmark documentation so the branch reflects the current ONNX-comparable performance work.

Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether and others added 3 commits May 5, 2026 17:37
Let the Supertonic converter download official Hugging Face bundles when local ONNX assets are not provided, and surface clear setup guidance when the local GGUF is missing.

Co-authored-by: Cursor <cursoragent@cursor.com>
Capture the Supertonic port history, parity findings, optimization wins and failures, final benchmark matrix, and remaining production work in a dedicated progress journal.

Co-authored-by: Cursor <cursoragent@cursor.com>
Autodetect Supertonic models from GGUF metadata and dispatch them through the Supertonic engine while preserving the existing Chatterbox routing.

Co-authored-by: Cursor <cursoragent@cursor.com>

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you help address these?

  1. Could make it work for windows with this. But could you pull the latest from main because maybe it will already fix it? 005779a

  2. Stale tensor pointers in thread_local graph cacheschatterbox.cpp/src/supertonic_text_encoder.cpp:508–512, 588–597, 797–809, and the parallel structures in supertonic_vector_estimator.cpp (vector_text_attention_cache, vector_group_graph_cache, vector_res_style_qkv_cache, vector_tail_graph_cache) and supertonic_vocoder.cpp:339–356 all keep thread_local caches keyed by cache.model != &model. The graphs they hold contain require_source_tensor(model, …) pointers that live in the model's ctx_w.
    synthesize in chatterbox.cpp/src/supertonic_engine.cpp:117 stack-allocates supertonic_model model; and calls free_supertonic_model(model) before returning. If a host calls synthesize() again from the same thread, the new stack-frame model is very likely to land at the same address, the cache key check passes, and the cached graph runs against ctx_w tensors that have been freed and replaced. Visible only when integrators call synthesize() more than once per thread (i.e. any server use); the bench tool itself loads once and is unaffected.

Suggested fix (any one of):

Key caches by model.ctx_w plus a monotonically increasing model.generation_id rather than &model.
Expose a supertonic_invalidate_thread_caches() and call it from free_supertonic_model.
Keep the caches as members of supertonic_model so they get destroyed with the model.

3 ggml_context leak per call
chatterbox.cpp/src/supertonic_text_encoder.cpp:901, :1021; supertonic_duration.cpp:521; many sites in supertonic_vector_estimator.cpp (g1_style_res_buf/g2_style_res_buf/g3_style_res_buf blocks, srgf block ~2189, etc.) all do:

ggml_init_params gp = { buf_size, buf.data(), true };
ggml_context * ctx = ggml_init(gp);
ggml_cgraph * gf = ggml_new_graph_custom(ctx, MAX_NODES, false);

ggml_gallocr_free(allocr);
return …;
with no ggml_free(ctx). With current ggml (chatterbox.cpp/ggml/src/ggml.c:1571) the context struct itself is GGML_MALLOC'd on every call; mem_buffer_owned=false so the buffer isn't freed, but the context struct is. Each step leaks a context struct. For per-CLI-invocation use it's negligible (few KB), but the bench harness loops total_runs × per-run stage compositions and accumulates many leaks. This is the easy fix: add ggml_free(ctx) at the end of every site that calls ggml_init with the local-buffer pattern.

  1. chatterbox.cpp/CMakeLists.txt:202–298 rebuilds every supertonic_*.cpp once per test executable, even though tts-cpp already statically links them. Long compile times. Consider linking the tests against tts-cpp (or a small shared tts-cpp::supertonic_test object library) instead.

  2. chatterbox.cpp/src/supertonic_engine.cpp:122–126 blocks non-f32 GGUFs even though the converter accepts --ftype f16/q8_0. The error message correctly says "use f16/q8_0 only with the GGML graph backend once enabled", but the production path is now the GGML graph backend

Please help add support for other quantization formats as well

  1. chatterbox.cpp/scripts/dump-supertonic-reference.py:42–46 defaults --steps 5 while the converter at convert-supertonic2-to-gguf.py:308 writes default_steps = 10 for supertonic2. The bench/CLI examples in the README all pass --steps 5 explicitly, which is fine, but a user who only follows bash scripts/setup-supertonic2.sh and then supertonic-cli … --steps 0 (i.e. defaults from GGUF) will run the 10-step path and not match the 5-step reference dumps. Document this, or align the defaults.

Comment thread .github/workflows/supertonic-portable-build.yml Outdated
Merge latest main, remove the extra workflow, harden Supertonic caches, clean up local GGML contexts, reduce Supertonic test rebuilds, and allow f16/q8_0 GGUF storage via load-time F32 expansion.

Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether pushed a commit to ogad-tether/chatterbox.cpp that referenced this pull request May 6, 2026
…etry/scratch)

Five targeted fixes surfaced by review of the multilingual_merged tip
after the origin/main merge.  Three are real bugs (CFG, top_k, engine
crash on MTL GGUFs); one is a perf regression with audible behaviour
on MTL (spurious T3 retries); one is a defensive cleanup.

1. src/chatterbox_tts.cpp (CFM step loop): the use_b2 branch correctly
   computes (1+cfg)*cond - cfg*uncond, but the else branch only
   computed the conditional pass and silently dropped CFG on every
   non-Metal backend.  Restores the §3.19 (3f0a8da) behaviour: when
   !meanflow && cfg_rate != 0 and use_b2 is false (CPU and any GPU
   backend where the b2 path was disabled), run cond + uncond
   back-to-back on the same B=1 graph (cfm_estimator_cache key
   (T, b2=false) reuses the cached graph across both calls) and
   combine via the standard CFG mix.  Smoke-tested on CPU
   (--n-gpu-layers 0): runs cleanly, S3Gen wall-clock doubles vs
   meanflow as expected (12 CFM steps × 2 forward calls).

2. src/t3_mtl.cpp::sample_next_token_mtl top-k filter: after
   nth_element(begin, begin+k, end, greater) the (k+1)-th largest
   sits at idx[k] and positions [0, k) hold the top-k UNORDERED.
   The previous code took cut = l[idx[k-1]] which is some
   arbitrary top-k element (often not the smallest), making cut
   too large and the `x < cut` filter then erased legitimate
   top-k logits.  Fix: partition to begin+(k-1) so idx[k-1] is
   the k-th largest exactly.  Mostly masked by the default
   top_k=1000 vs an 8194-vocab where the threshold falls into
   the noise floor; the bug bites at small top_k (e.g. greedy
   --top-k 1 where the wrong cut could pessimise tie handling).
   The Turbo sample_next_token_ex in src/main.cpp uses a
   different (correct) approach via tmp[k] + per-element rescan
   for ties; left untouched.

3. src/chatterbox_engine.cpp: load_model_gguf dispatches MTL
   GGUFs into load_model_gguf_mtl (populates layers_mtl, leaves
   layers empty), but synthesize() unconditionally calls
   eval_prompt -> build_prompt_graph -> build_transformer_core,
   which iterates model.layers[il] -- empty std::vector, UB or
   crash.  Add a clean rejection guard right after the load: if
   model.hparams.variant != CHBX_VARIANT_TURBO, free_model() and
   throw a clear error pointing the user at the CLI / internal
   eval_*_mtl helpers.  Wiring MTL through the public Engine API
   (extend EngineOptions with language / cfg_weight / min_p /
   exaggeration, branch synthesize() on variant) is left as a
   follow-up; this just stops the crash on the public surface.

4. src/chatterbox_cli.cpp::run_t3_for_segment retry trigger: the
   0cad44d merge commit said the 5x speech-tokens-per-BPE-token
   floor (calibrated for English Turbo / GPT-2 BPE) should be
   gated to non-MTL because MTL's Llama tokenizer has a ~1.7x
   ratio.  The gating wasn't actually in the code -- a clean
   stop-token termination on a short MTL segment looked
   "implausible" and triggered up to 3 spurious retries.
   `plausible = is_mtl || (int)generated.size() >= min_tokens;`
   restores the intent.  The 3x-repeated-token early-stop above
   still guards MTL's catastrophic case.  Measured on M4 Metal
   with the ES reference prompt + jfk/gianni voice: T3 wall
   time drops from ~3.9 s (4 attempts) to ~0.93 s (1 attempt) --
   ~4x speedup just from removing the wasted retries.  WAV md5
   stays byte-exact at 57cc80f27a122f03435fd05f47d1b3d2.

5. src/t3_mtl.cpp stacked-QKV loader scratch sizing: the early
   type-equality guard implies wq/wk/wv have identical sizes
   today, but max over all three so a future shape divergence
   (e.g. an MTL variant with non-square Q/K/V) can't silently
   truncate a per-layer copy via undersized scratch.  No
   behaviour change today; defensive only.

Validation (Apple M4, Metal, Release):
  - cmake --build: clean, no warnings, all targets link.
  - test-metal-ops: 14/14 PASS, 0 FAIL.
  - End-to-end synthesis (ES prompt, gianni.wav, --seed 42, greedy):
    md5 57cc80f27a122f03435fd05f47d1b3d2 -- byte-exact vs the
    pre-fix baseline.  T3 wall time ~3.9s -> ~0.9s (fix GustavoA1604#4).
  - CPU CFG smoke test (--n-gpu-layers 0, --text "Hola.", es):
    completes cleanly, S3Gen ~12s for 12 CFM steps × 2 forward
    calls (cond + uncond), produces valid 1.1s WAV.

Issues GustavoA1604#5 (redundant peek+open in load_model_gguf), GustavoA1604#7
(/g deny-list breadth in requantize-gguf.py), GustavoA1604#8 (forward-hook
idiom in dump-t3-mtl-reference.py), and #9 (CMake duplicate
cli_main.cpp build) are tracked but intentionally not folded in
here -- the reviewer flagged them as cosmetic / trivial / fine.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ogad-tether

Copy link
Copy Markdown
Author

Addressed the review feedback in 06da42a:

  • Merged latest main; PR is now clean/mergeable.
  • Removed .github/workflows/supertonic-portable-build.yml.
  • Added supertonic_model::generation_id and included it in thread-local graph/layout cache keys to avoid stale ctx_w tensor reuse across repeated synthesize() calls on the same thread.
  • Added missing ggml_free(ctx) calls for local-buffer graph contexts in text, duration, vector residual islands, and vocoder trace paths.
  • Refactored Supertonic CMake harnesses to link against tts-cpp instead of recompiling supertonic_*.cpp per test target.
  • Removed the f32-only engine guard and expanded f16/q8_0 GGUF storage tensors to F32 runtime tensors at load time so those GGUFs can run on the existing graph/custom-kernel path.
  • Aligned converter default steps to 5, matching the reference dumps and README examples.

Validation run locally:

  • cmake -S . -B build
  • cmake --build build --target tts-cli supertonic-cli supertonic-bench test-supertonic-pipeline test-supertonic-vocoder-pointwise
  • ./build/test-supertonic-pipeline models/supertonic2.gguf artifacts/supertonic-ref-quick
  • ./build/test-supertonic-vocoder-pointwise models/supertonic2.gguf artifacts/supertonic-ref-quick

@ogad-tether

Copy link
Copy Markdown
Author

Detailed follow-up on each requested item, all addressed in 06da42a:

  1. Windows/latest main: pulled and merged latest main. The conflicting CLI parser area now keeps upstream's Windows/main changes and the Supertonic autodispatch options. PR merge state is now clean.

  2. Stale thread-local graph caches: added supertonic_model::generation_id, assigned a monotonic generation on each successful load, and included that generation in every persistent Supertonic cache key (text_relpos_graph_cache, text_ffn_graph_cache, speech_attention_cache, vector attention/group/res-style/tail caches, vocoder graph cache, and the static layout caches). This prevents a new stack-allocated supertonic_model at the same address from reusing graphs that captured tensors from a freed ctx_w.

  3. ggml_context leaks: added explicit ggml_free(ctx) cleanup for the local-buffer ggml_init sites in text encoder, duration, vector estimator residual/local graph blocks, and vocoder trace, including reserve/allocr failure paths.

  4. Supertonic test compile duplication: refactored the Supertonic harness targets in CMakeLists.txt to link against the existing tts-cpp static library instead of recompiling supertonic_*.cpp into each test executable.

  5. Non-f32 GGUF support: removed the f32-only guard from supertonic_engine.cpp. The loader now accepts f16 and q8_0 GGUF storage by expanding those tensors to F32 runtime tensors during load, so the current GGML/custom-kernel path works for those model files. I also generated and smoke-tested local f16/q8_0 models with tts-cli.

  6. Default-step mismatch: aligned the converter default to 5 steps so locally generated GGUF metadata matches the reference dumps and README examples unless the user explicitly passes --default-steps.

Local validation:

  • cmake -S . -B build
  • cmake --build build --target tts-cli supertonic-cli supertonic-bench test-supertonic-pipeline test-supertonic-vocoder-pointwise
  • ./build/test-supertonic-pipeline models/supertonic2.gguf artifacts/supertonic-ref-quick
  • ./build/test-supertonic-vocoder-pointwise models/supertonic2.gguf artifacts/supertonic-ref-quick
  • Generated smoke WAVs for f32/f16/q8_0 through tts-cli under artifacts/supertonic-quant-smoke/.

@ogad-tether ogad-tether requested a review from GustavoA1604 May 6, 2026 10:54
@GustavoA1604 GustavoA1604 merged commit f807035 into GustavoA1604:main May 6, 2026
Zbig9000 added a commit to Zbig9000/chatterbox.cpp that referenced this pull request May 6, 2026
… + scaffolding caches (multilingual Vulkan)

Targets the per-synth host-CPU overhead that round 1 / round-HIFT
didn't address, on top of upstream/multilingual_merged (now in main
via PR GustavoA1604#7).  Test-first: bench-logs-vk-mtl/regress-mtl-vk.sh in the
qvac monorepo locks the pre-change MD5 baseline, then re-verifies
after every cache.  All 3 invariants (multilingual single-shot,
multilingual 6-segment multi-synth, Turbo single-shot) PASS bit-exact.

Seven new caches
----------------

All host-side, model-agnostic, no GGUF-format change, no public-API
change.  Same teardown discipline as the existing g_cfm_estimator_cache
(destroy() before ggml_backend_free).  Sit alongside the existing
round-1 caches.

  - g_encoder_graph_cache (keyed on T): full run_encoder graph +
    gallocator.  Streaming chunks of varying length still produce
    correct output (rebuilds on key change).

  - g_hift_graph_cache (keyed on pack(T_mel, T_stft)) +
    g_hift_inv_alpha_entries: full run_hift_decode graph + gallocator.
    Parallel (graph-input-name, source-tensor-ptr) metadata lets
    cache hits re-feed each alpha-input slot from g_inv_alpha_results
    without rebuilding the graph.

  - g_f0_graph_cache (keyed on T_mel): full run_f0_predictor graph +
    gallocator.

  - cached_pos_emb (g_pos_emb_results, keyed on pack(T, D)):
    compute_pos_emb is pure CPU compute (~T * D * 5 trig ops); fired
    twice per encoder run (T and 2T).  Multilingual T~350+ at D=512
    is a real wedge of per-synth host time.

  - cached_inv_alpha (g_inv_alpha_results, keyed on ggml_tensor*):
    HiFT calls invert_alpha_cpu ~72x per synth (12 ResBlocks × 6
    alpha tensors); each is a tensor_get + per-element reciprocal.
    Alpha tensors are constant for the model lifetime.

  - cached_hann_window / cached_istft_kernel (g_hann_window_cache /
    g_istft_kernel_cache, keyed on n_fft): pure functions of n_fft
    (constant 16 in the chatterbox HiFT path).

  - cached_window_sum (g_window_sum_cache, keyed on
    pack(n_fft, hop, T_stft)): T_stft × n_fft adds; stable across
    same-shape synth calls.

A new graph_cache struct (used by encoder / HiFT / F0) and a
pack_hift_key helper centralise the explicit destroy()-on-teardown
pattern so future per-stage caches can plug in with one struct + one
mutex acquisition.  The destroy path is unified into a renamed
s3gen_release_synth_caches() (replaces the old
g_cfm_estimator_cache_destroy()), called from
s3gen_model_cache_release, the cache-miss backend-swap path, and the
explicit s3gen_unload().

Negative result documented (bug caught and fixed during dev)
------------------------------------------------------------

First implementation of the HiFT cache hung indefinitely on the very
first synth call.  Root cause: the alpha-input refresh loop held
g_synth_caches_mu while calling cached_inv_alpha, which itself takes
the same mutex internally — classic re-entrant deadlock.  Fix:
snapshot g_hift_inv_alpha_entries under the mutex into a local vector,
then iterate without the lock (cached_inv_alpha re-acquires the mutex
per call but with no nesting).  General rule kept as an inline comment:
never hold a cache-state mutex while calling any other cached_* helper.

Performance — RTX 5090, multilingual auto-split, warm-state seg 2..6
-------------------------------------------------------------------

Within-process win on top of round 1 + round-HIFT:

  metric        | pre-round-2 |  post-round-2  |          Δ
  S3GEN_INFER   |    159.8 ms |    140.8 ms    |  -19.0 ms (-11.9 %)
  cfm_total     |    122.2 ms |    118.7 ms    |   -3.5 ms (-2.9 %)
  cfm_step0     |     13.24 ms|     13.18 ms   |   noise (already cached round 1)
  hift_total    |     17.96 ms|     16.30 ms   |   -1.7 ms (-9.4 %)

Combined cumulative win vs upstream/multilingual_merged baseline
(round 1 + round-HIFT + round 2):

  metric        | upstream/mtl_merged |  this PR (full) |          Δ
  S3GEN_INFER   |          169.9 ms   |     140.8 ms    |  -29.1 ms (-17.1 %)
  cfm_total     |          132.5 ms   |     118.7 ms    |  -13.8 ms (-10.4 %)
  cfm_step0     |           24.1 ms   |      13.2 ms    |  -10.9 ms (-45.2 %)

The biggest remaining single piece of S3GEN_INFER (~120 ms cfm) is
the actual GPU CFM compute — not host-cacheable; would need
shader-side optimisation (e.g. tensor-core engagement via
cooperative_matrix2; deferred — see "Next" in PROGRESS.md §3.32).

Bit-exactness
-------------

Locked invariants pass byte-for-byte vs the pre-change baseline:

  Multilingual single-shot      c65d98f15a59b8fe9cad98e46eb3fb30  ✓
  Multilingual 6-segment multi  0b374c7474895a3387b9f1df10b3c1b8  ✓
  Turbo single-shot             6219f4338b1b4fb9dc60481216153b49  ✓

Verified across 4 successive iterations on RTX 5090 + NVIDIA 590.48
+ Vulkan 1.3.275; bench-logs-vk-mtl/regress-mtl-vk.sh in the qvac
monorepo is the test-first harness.

Files
-----

  src/chatterbox_tts.cpp         +373 / -79 (net diff vs round-1 head)
  PROGRESS.md                    §3.32 round-2 subsection (~+200 lines)

The +373 lines in chatterbox_tts.cpp are entirely the new cache
infrastructure: graph_cache struct, seven new globals, the
s3gen_release_synth_caches lifecycle hook, the five cached_*
scaffolding helpers, and the build_graph / cache-hit branches in
run_encoder / run_hift_decode / run_f0_predictor.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 added a commit that referenced this pull request May 7, 2026
Mirrors the parakeet-cpp port README layout so a downstream consumer
can answer 'what does this library do, how do I link it, and which
CMake knobs do I need to know about?' from the top of the README
without scrolling through the 1300-line standalone development walk-
through.  No content removed; existing standalone material stays
verbatim, just shifted down by ~80 lines.

Adds three new blocks near the top:

- ## API overview (between the benchmark tables and 'Pipeline at a
  glance').  Two-row table for the high-level entry points exported
  through TTS_CPP_API:
    * tts_cpp::chatterbox::Engine::synthesize  - Chatterbox T3+S3Gen+HiFT
    * tts_cpp::supertonic::synthesize          - Supertonic CPU TTS
  Trailing paragraph mentions the lower-level helpers
  (s3gen_synthesize_to_wav / s3gen_preload / s3gen_unload /
  tts_cpp_cli_main), points at <tts-cpp/export.h>, and explicitly
  flags that detail-namespaced symbols (used by the supertonic /
  chatterbox test harnesses) are not part of the public API and are
  hidden in SHARED builds.

- ### Consumer integration (subsection of API overview).  Calls out
  that the qvac speech-stack qvac-ext-lib-whisper.cpp wrapper port
  consumes ggml from the qvac-ext-ggml/speech branch directly
  (Metal / OpenCL / Vulkan patches included) and does NOT ship
  scripts/setup-ggml.sh or patches/ - those are standalone-dev tools
  maintained in this repo only.  Provides the
  find_package(tts-cpp CONFIG REQUIRED) +
  target_link_libraries(... tts-cpp::tts-cpp) + 8-line
  Engine::synthesize C++ snippet that's the entire consumer-side
  integration.

- ### Useful CMake options (inside section 1, between the GPU backend
  paragraph and the binaries table).  Full table of the project-
  namespaced flags:
    TTS_CPP_BUILD_LIBRARY, TTS_CPP_BUILD_SHARED (new from items 7+8),
    TTS_CPP_BUILD_EXECUTABLES, TTS_CPP_BUILD_TESTS, TTS_CPP_INSTALL,
    TTS_CPP_USE_SYSTEM_GGML, TTS_CPP_GGML_LIB_PREFIX, TTS_CPP_CCACHE
    (new from items 7+8).
  Plus a secondary table for the ctest-fixture cache paths
  (TTS_CPP_TEST_{MODEL,AUDIO,REF}_DIR) and a one-liner explaining the
  REQUIRES auto-disable behaviour from item 7.

Touches existing prose in two places:

- The setup-ggml.sh paragraph in section 1 gets a one-paragraph
  follow-up clarifying it (and patches/) are standalone-development
  tools only, with a back-link to the Consumer integration section
  (item 9: 'document setup-ggml.sh inertness' folded into this
  framing rather than landed as a separate doc-only commit).  Also
  strengthens the existing 'Re-running is safe' line to 'idempotent
  and destructive' so a dev hacking on ./ggml is warned before
  losing local edits.

- The ### Alternative: consume ggml from vcpkg subsection now opens
  with one sentence positioning it as the CMake-mechanic detail
  behind the Consumer integration story, with a forward link to the
  qvac-ext-ggml/speech branch.

Also updates the binaries table in section 1 to list the missing
PR #6 + PR #7 binaries that landed since the README was last
refreshed: supertonic-cli, supertonic-bench, test-cpu-caches,
test-t3-caches, and the test-supertonic-* family.  Trailing paragraph
notes that test-* binaries register with CTest so
\`ctest -C Release -L unit\` / \`ctest -C Release -L fixture\` works
out of the build directory.

No code changes, no CMake changes, no install behaviour changes.
README.md +128 / -10 lines.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants