tts-cpp: supertonic Engine streaming via multilingual chunker + callback#20
Conversation
GustavoA1604
left a comment
There was a problem hiding this comment.
Cursor flagged these two things, could you check if they make sense?
Duplicated sentence-terminator code point table (supertonic_engine.cpp + supertonic_chunker.cpp) — chunk_ends_with_sentence_term() in the engine and is_sentence_end_cp() in the chunker are identical switch statements. If the set grows (e.g. Ethiopic ።, Tibetan །) only one copy will be updated. The fix is to un-anonymize is_sentence_end_cp in the chunker and have the engine call it.
stream_emit_pcm_stdout writes one sample at a time (supertonic_cli.cpp) — It calls fwrite(&v, 2, 1, stdout) in a per-sample loop, meaning ~44k–132k syscall-adjacent calls per chunk. Should build a std::vector<int16_t> and write the whole buffer in one fwrite.
…ered stdout) Two review-comment fixes from PR #20: 1. De-duplicated the sentence-terminator code-point table between supertonic_chunker.cpp's is_sentence_end_cp() and the engine's chunk_ends_with_sentence_term(). is_sentence_end_cp() is now declared in supertonic_chunker.h and called from the engine's per-chunk continuation detector — the engine still owns the UTF-8 trim/decode logic, but the predicate (and its multilingual table) live in one place. Adding Ethiopic ።, Tibetan ། or any other terminator now needs one edit, not two. 2. stream_emit_pcm_stdout was doing a per-sample fwrite(&v, 2, 1, stdout) loop — ~44k-132k syscall-adjacent calls per chunk. Build the chunk's int16 buffer once and write it in a single fwrite; flush after. No semantic change to the bytes on stdout; just throughput. Verified: multi-sentence chunker still produces 3 sentence-aligned chunks (unchanged); stdout streaming byte count still equals samples * 2 exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both fixes are in 1. De-duplicated terminator table. 2. Buffered stdout write. Both verified on smoke: multi-sentence chunker still gives 3 sentence-aligned chunks (no regression in the predicate), stdout streaming byte count is exact. |
Mirrors the chatterbox StreamCallback API: a second synthesize() overload
takes an on_chunk callback that receives PCM chunk-by-chunk while the
returned SynthesisResult still accumulates the full audio (callback is
an addition, not a replacement).
Supertonic's vector estimator is non-autoregressive (5-step CFM denoise
over the full duration-predicted latent), so the chatterbox token-level
streaming pattern doesn't transfer. Instead this splits text into
sentence-aligned chunks and runs the full pipeline per chunk:
- New src/supertonic_chunker.{h,cpp}: Unicode-aware splitter. Sentence-
end gets a wide implicit search window (target/2..3*target) because
sentence prosody dominates audio quality on this model — chunks cut
mid-clause receive an artificial trailing period from preprocess and
the model emits muddled / dropped words in response. Clause and
whitespace fallbacks use the user-supplied tolerance.
- Multilingual punctuation tables: ASCII .?! plus CJK fullwidth, double
exclamation/question, Devanagari danda, Urdu full stop for sentences;
ASCII / fullwidth / Arabic comma, semicolon, colon and closing
brackets for clauses. Whitespace fallback handles CJK / Thai / Lao /
Khmer where punctuation may be absent.
- Engine streaming path runs the full pipeline per chunk with opts.seed
(no per-chunk perturbation; different chunks have different latent_len
so noise tensors differ even with the same seed, and an earlier
per-chunk seed bump occasionally landed chunks on nearby seeds where
the model produces phantom-phoneme tail artifacts).
- 10 ms raised-cosine anti-click fade on inter-chunk seams only. First
chunk start and last chunk end stay untouched so streamed output is
acoustically equivalent to batch at the endpoints.
- CLI gains --stream-chunk-tokens / --stream-first-chunk-tokens /
--stream-chunk-tolerance-pct flags. --out - streams raw s16le PCM on
stdout for incremental playback (pipe into ffplay / sox -d).
SUPERTONIC_LOG_CHUNKS=1 logs chunker boundaries;
SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX=path- dumps per-chunk WAVs for
debugging.
Validated end-to-end at ~35x realtime on M2 Metal: streamed output is
acoustically equivalent to batch on the same seed; first audio drops in
~1 s for an 18 s utterance instead of waiting the full ~4-5 s for batch
synth to complete.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two empirically-driven additions on top of the sentence-aligned
chunker:
1. is_continuation flag through supertonic_preprocess_text +
supertonic_text_to_ids. When the engine produces a mid-clause /
mid-word chunk during streaming, the preprocess skips its
auto-appended terminal period. Without the flag the model spoke
stub chunks as complete sentences with falling intonation and
trailing-phoneme artifacts (the original "park.K" tail bug). The
engine detects per-chunk whether the chunk ends on a natural
sentence terminator (ASCII .?! plus CJK / Devanagari / Urdu
equivalents) and passes through the flag accordingly.
2. stream_min_chunk_tokens (default 30) on EngineOptions. Below ~30
tokens the model emits dropped / muddled phonemes on stub input
regardless of the continuation flag (verified on multiple seeds
and texts — short text is a model-level failure mode, not a
preprocess one). The chunker treats min_chunk_tokens as a hard
floor: effective target = max(target, min), the sentence/clause/
whitespace search lower bound is clamped to start + min, and any
trailing chunk below the floor is merged into its predecessor.
The min floor is the practical ceiling on what Option A streaming
can achieve. True seam-free streaming inside one utterance would
require model retraining (causal attention, per-token duration,
mel-frame cache continuity — the bits chatterbox has by design but
supertonic was not trained for). Documenting that as the trade-off
honestly rather than papering over it.
Behavior:
- Multi-sentence input → sentence-aligned chunks (the v1 behavior).
Acoustically equivalent to batch on the same seed.
- Long single-sentence input → multi-chunk output at the min floor,
each chunk passed to the model without an artificial terminal
period. Inter-chunk pauses and rate shifts are inherent to
per-chunk synthesis on a non-streaming-trained model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reshold Tail-merge was using min_chunk_tokens (30) as its threshold, which on languages denser than English (CJK in particular) merged the last chunk into the previous one even when that last chunk was a complete sentence. Concrete: Korean "공원에서 산책하기 좋은 날이다." is 18 code points — below the 30-cp floor — so the merger folded it into the previous chunk, which contained TWO sentences, producing a single 172-byte chunk for the whole utterance and zero streaming benefit. Switch to chatterbox_engine.cpp:608's heuristic: tail_thresh = max(6, target_tokens/3) (16 for target=50). Genuinely tiny stubs (<16 cps) still merge; real sentence chunks stay independent. The min_chunk_tokens floor governs what the chunker proactively *aims for* during iteration, not what it does with whatever's left after the last natural boundary. Verified: Korean 3-sentence text now chunks into 2 (first chunk spans 2 sentences due to first-sentence-below-min-floor, last sentence stays separate at 18 cps). English 3-sentence test stays at 3 sentence-aligned chunks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3x sentence-search window slurped runaway-sentence tails as one huge "sentence-aligned" chunk: a 245-char single sentence with the final period 109 chars past start was found by the wide window, so chunker took the whole remainder as chunk[3] instead of falling through to whitespace and producing multiple sub-sentence chunks. 2x is still wide enough to catch a long-but-reasonable first sentence in multi-sentence input (covers up to ~90 chars at target=50, ample for typical English / French / Portuguese sentences) but narrow enough that genuinely runaway sentences (>2x target with no internal periods) fall through to whitespace and stream. Empirical: same 245-char English run-on now produces 5 evenly-sized chunks (30, 52, 54, 52, 56) instead of 4 with the tail-blob (30, 52, 54, 109). Multi-sentence test unchanged (still 3 sentence- aligned chunks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ered stdout) Two review-comment fixes from PR #20: 1. De-duplicated the sentence-terminator code-point table between supertonic_chunker.cpp's is_sentence_end_cp() and the engine's chunk_ends_with_sentence_term(). is_sentence_end_cp() is now declared in supertonic_chunker.h and called from the engine's per-chunk continuation detector — the engine still owns the UTF-8 trim/decode logic, but the predicate (and its multilingual table) live in one place. Adding Ethiopic ።, Tibetan ། or any other terminator now needs one edit, not two. 2. stream_emit_pcm_stdout was doing a per-sample fwrite(&v, 2, 1, stdout) loop — ~44k-132k syscall-adjacent calls per chunk. Build the chunk's int16 buffer once and write it in a single fwrite; flush after. No semantic change to the bytes on stdout; just throughput. Verified: multi-sentence chunker still produces 3 sentence-aligned chunks (unchanged); stdout streaming byte count still equals samples * 2 exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…UPERTONIC_LOG_CHUNKS Adds one line per chunk to the existing SUPERTONIC_LOG_CHUNKS env-var trace, showing the is_continuation flag the engine resolved before handing the chunk to run_single_chunk: chunk[0] (44 bytes): The quick brown fox jumps over the lazy dog. chunk[0] is_continuation=0 chunk[1] (64 bytes): Then she said hello to the world, ... chunk[1] is_continuation=0 Useful for validating that the engine's per-chunk continuation detector and the chunker's boundary search agree on what counts as a sentence terminator across UTF-8 — they share the same detail::is_sentence_end_cp table, but the engine reaches it via a UTF-8-decode of the final code point in the chunk string, so the two paths can in principle disagree on a malformed input. The log makes that observable in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6582b0c to
16c2cd2
Compare
b220514
into
supertonic_optimizations
Summary
Adds native streaming synthesis to the Supertonic
Enginemirroring the chatterboxStreamCallbackAPI. Splits text into chunks via a multilingual splitter, runs the full per-chunk pipeline on the resident model, and invokes a user callback synchronously with each chunk's PCM as it's produced. The returnedSynthesisResult.pcmstill contains the concatenated audio so existing batch callers are unaffected — the callback is an addition, not a replacement.Based on
supertonic_optimizations(which now includes #15's Metal port and #21's CPU regression fix).Why this is chunked-pipeline streaming, not token-streamed-inside-one-utterance
Chatterbox achieves true token-level streaming because its model and training were designed for it: T3 emits audio-rate speech tokens, S3Gen uses causal convs, HiFT carries mel-frame cache + F0 phase across chunks, and the encoder lookahead trim cleans boundary effects.
Supertonic has none of that infrastructure. Single-stage pipeline, bidirectional attention over the full latent, per-utterance duration prediction, no cache continuity across forward passes, and trained on full sentences only. Forcing causal attention or chunked input at inference time produces audio the model never learned to generate (verified — sub-30-token stubs glitch on dropped/muddled phonemes regardless of preprocess tweaks).
So this PR ships what's actually achievable: sentence-aligned chunks for multi-sentence input (acoustically equivalent to batch), mid-clause/whitespace chunks for long single-sentence input where there's no other choice, and a
is_continuationpreprocess flag so the model isn't told "this is a complete sentence" when it isn't. Inter-chunk pauses and rate shifts at non-sentence seams are inherent to per-chunk synthesis on a non-streaming-trained model and can't be fixed at this layer.What ships
src/supertonic_chunker.{h,cpp}— Unicode-aware multilingual splitter[target/2, 2*target](wide). Catches long-but-reasonable first sentences in multi-sentence input but narrow enough that genuinely runaway sentences (>2× target without internal periods) fall through to whitespace so they still stream rather than dumping the whole tail as one chunk.[target ± tolerance_pct]. User-controlled..?!, CJK。?!, Devanagari।॥, Urdu۔, double‼⁇⁈⁉for sentences; ASCII / fullwidth / Arabic comma, semicolon, colon, closing brackets for clauses.stream_min_chunk_tokens(default 30) is a hard floor. Below that the model emits dropped/muddled phonemes on stub input. Effective targets aremax(target, min). The sentence/clause/whitespace search lower bound is clamped tostart + minso the chunker never proactively aims for a sub-minimum chunk.max(6, target/3)(16 for target=50), NOT the min_chunk floor — using min_chunk would swallow a complete final sentence on info-dense languages (e.g. Korean "공원에서 산책하기 좋은 날이다." is 18 cps — below the 30-cp floor — but is a perfectly valid sentence-aligned chunk).is_sentence_end_cp(uint32_t)predicate is declared insupertonic_chunker.hand reused by the engine's per-chunk continuation detector — additions to the terminator set (Ethiopic ።, Tibetan ། …) live in exactly one place.is_continuationflag through preprocesssupertonic_preprocess_textandsupertonic_text_to_idsaccept anis_continuationbool. When true, the auto-appended terminal period is skipped — used by streaming for chunks that don't end on a natural sentence terminator. Avoids the original "park.K" trailing-phoneme bug where the model spoke a stub chunk as a complete sentence with falling intonation + tail artifacts.Enginestreaming pathopts.seedfor every chunk (no per-chunk perturbation; different chunks have differentlatent_lenso noise tensors differ even with the same seed). Earlieropts.seed + kperturbation occasionally landed chunks on glitchy nearby seeds.is_continuationderived automatically by checking the chunk's trailing code point against the shared sentence-terminator set (ASCII + CJK + Devanagari + Urdu).EngineAPI additionssupertonic-cliflags--stream-chunk-tokens N— target chunk size in text tokens (0 disables; 50 ≈ 1-3 s English audio)--stream-first-chunk-tokens N— smaller first-chunk override for first-audio latency--stream-chunk-tolerance-pct N— clause/whitespace boundary-snap window--stream-min-chunk-tokens N— hard floor on chunk size (default 30)--out -streams raw s16le PCM on stdout (one bufferedfwriteper chunk). Pipe intoffplay -f s16le -ar 44100 -ch_layout mono -i -orsox -t raw -b 16 -e signed-integer -r 44100 -c 1 - -d.SUPERTONIC_LOG_CHUNKS=1logs chunker boundaries AND per-chunkis_continuation;SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX=path-dumps per-chunk WAVs for debugging.Test plan
cmake --build tts-cpp/build --target supertonic-cli -j 8.?!): chunker produces 5 evenly-sized chunks (30, 52, 54, 52, 55 cps) via whitespace fallback. All chunks above the min floor..for en/fr/pt,.and CJK。recognized for ko).ffplay: chunk-by-chunk timing in stderr proves later chunks synthesize during earlier chunks' playback.--n-gpu-layers 0writes 10.07 s WAV in 1.09 s wall-time (~9× realtime), exit code 0, no abort. Stdout streaming on CPU produces byte-exact output (samples × 2). Multilingual sentence detection on CPU produces sameis_continuation=0/1flags as Metal (en/fr/pt/ko verified — engine and chunker agree on Unicode terminator predicate)./tmp/supertonic-validate-review-fixes.shcovers the de-dup correctness (incl. CJK。decode through engine → sharedis_sentence_end_cp) and stdout-byte-exactness — 4/4 PASS.opts.seed + k) occasionally landed on glitchy nearby seeds → now usesopts.seedeverywhere.min_chunk_tokensswallowed valid CJK trailing sentences → relaxed tomax(6, target/3).is_sentence_end_cpvia header.fwriteinstream_emit_pcm_stdout→ buffered singlefwriteper chunk.Known limitations
ggml-metalresidency-set assertion at process exit (ggml-metal-device.m:612) fires after every Metal-backed run on this branch and on master. Unrelated to streaming; audio is written before exit.--language zhrejected at preprocess; supported set: en, ko, es, pt, fr). Chunker handles CJK punctuation correctly; no model support yet.synthesize_streaming's internals would change.Files
tts-cpp/include/tts-cpp/supertonic/engine.htts-cpp/src/supertonic_engine.cpptts-cpp/src/supertonic_cli.cpptts-cpp/src/supertonic_preprocess.cpptts-cpp/src/supertonic_internal.htts-cpp/CMakeLists.txttts-cpp/src/supertonic_chunker.htts-cpp/src/supertonic_chunker.cpp🤖 Generated with Claude Code