tts-cpp: supertonic Engine streaming via multilingual chunker + callback by ogad-tether · Pull Request #20 · tetherto/qvac-ext-lib-whisper.cpp

ogad-tether · 2026-05-15T11:09:13Z

Summary

Adds native streaming synthesis to the Supertonic Engine mirroring the chatterbox StreamCallback API. Splits text into chunks via a multilingual splitter, runs the full per-chunk pipeline on the resident model, and invokes a user callback synchronously with each chunk's PCM as it's produced. The returned SynthesisResult.pcm still contains the concatenated audio so existing batch callers are unaffected — the callback is an addition, not a replacement.

Based on supertonic_optimizations (which now includes #15's Metal port and #21's CPU regression fix).

Why this is chunked-pipeline streaming, not token-streamed-inside-one-utterance

Chatterbox achieves true token-level streaming because its model and training were designed for it: T3 emits audio-rate speech tokens, S3Gen uses causal convs, HiFT carries mel-frame cache + F0 phase across chunks, and the encoder lookahead trim cleans boundary effects.

Supertonic has none of that infrastructure. Single-stage pipeline, bidirectional attention over the full latent, per-utterance duration prediction, no cache continuity across forward passes, and trained on full sentences only. Forcing causal attention or chunked input at inference time produces audio the model never learned to generate (verified — sub-30-token stubs glitch on dropped/muddled phonemes regardless of preprocess tweaks).

So this PR ships what's actually achievable: sentence-aligned chunks for multi-sentence input (acoustically equivalent to batch), mid-clause/whitespace chunks for long single-sentence input where there's no other choice, and a is_continuation preprocess flag so the model isn't told "this is a complete sentence" when it isn't. Inter-chunk pauses and rate shifts at non-sentence seams are inherent to per-chunk synthesis on a non-streaming-trained model and can't be fixed at this layer.

What ships

src/supertonic_chunker.{h,cpp} — Unicode-aware multilingual splitter

Two-window boundary search:
- Sentence-end search: [target/2, 2*target] (wide). Catches long-but-reasonable first sentences in multi-sentence input but narrow enough that genuinely runaway sentences (>2× target without internal periods) fall through to whitespace so they still stream rather than dumping the whole tail as one chunk.
- Clause / whitespace fallback: [target ± tolerance_pct]. User-controlled.
Punctuation tables: ASCII .?!, CJK 。？！, Devanagari ।॥, Urdu ۔, double ‼⁇⁈⁉ for sentences; ASCII / fullwidth / Arabic comma, semicolon, colon, closing brackets for clauses.
Whitespace fallback covers CJK / Thai / Lao / Khmer where punctuation may be absent or sparse.
stream_min_chunk_tokens (default 30) is a hard floor. Below that the model emits dropped/muddled phonemes on stub input. Effective targets are max(target, min). The sentence/clause/whitespace search lower bound is clamped to start + min so the chunker never proactively aims for a sub-minimum chunk.
Tail-merge uses the chatterbox heuristic max(6, target/3) (16 for target=50), NOT the min_chunk floor — using min_chunk would swallow a complete final sentence on info-dense languages (e.g. Korean "공원에서 산책하기 좋은 날이다." is 18 cps — below the 30-cp floor — but is a perfectly valid sentence-aligned chunk).
Shared terminator table. The is_sentence_end_cp(uint32_t) predicate is declared in supertonic_chunker.h and reused by the engine's per-chunk continuation detector — additions to the terminator set (Ethiopic ።, Tibetan ། …) live in exactly one place.

is_continuation flag through preprocess

supertonic_preprocess_text and supertonic_text_to_ids accept an is_continuation bool. When true, the auto-appended terminal period is skipped — used by streaming for chunks that don't end on a natural sentence terminator. Avoids the original "park.K" trailing-phoneme bug where the model spoke a stub chunk as a complete sentence with falling intonation + tail artifacts.

Engine streaming path

Single opts.seed for every chunk (no per-chunk perturbation; different chunks have different latent_len so noise tensors differ even with the same seed). Earlier opts.seed + k perturbation occasionally landed chunks on glitchy nearby seeds.
Per-chunk is_continuation derived automatically by checking the chunk's trailing code point against the shared sentence-terminator set (ASCII + CJK + Devanagari + Urdu).
10 ms raised-cosine anti-click fade on inter-chunk seams only. First chunk start and last chunk end stay untouched so streamed output is acoustically equivalent to batch at the endpoints.

Engine API additions

using StreamCallback = std::function<void(
    const float * pcm, std::size_t samples, int chunk_index, bool is_last)>;

struct EngineOptions {
    // ...
    int stream_chunk_tokens        = 0;
    int stream_first_chunk_tokens  = 0;
    int stream_chunk_tolerance_pct = 20;
    int stream_min_chunk_tokens    = 30;
};

SynthesisResult synthesize(const std::string & text,
                           const StreamCallback & on_chunk);

supertonic-cli flags

--stream-chunk-tokens N — target chunk size in text tokens (0 disables; 50 ≈ 1-3 s English audio)
--stream-first-chunk-tokens N — smaller first-chunk override for first-audio latency
--stream-chunk-tolerance-pct N — clause/whitespace boundary-snap window
--stream-min-chunk-tokens N — hard floor on chunk size (default 30)
--out - streams raw s16le PCM on stdout (one buffered fwrite per chunk). Pipe into ffplay -f s16le -ar 44100 -ch_layout mono -i - or sox -t raw -b 16 -e signed-integer -r 44100 -c 1 - -d.
SUPERTONIC_LOG_CHUNKS=1 logs chunker boundaries AND per-chunk is_continuation; SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX=path- dumps per-chunk WAVs for debugging.

Test plan

Known limitations

The pre-existing ggml-metal residency-set assertion at process exit (ggml-metal-device.m:612) fires after every Metal-backed run on this branch and on master. Unrelated to streaming; audio is written before exit.
CJK languages other than Korean aren't accepted by the current Supertonic model GGUF (--language zh rejected at preprocess; supported set: en, ko, es, pt, fr). Chunker handles CJK punctuation correctly; no model support yet.
Mid-sentence streaming has audible seam artifacts (small pauses + rate shifts) in all languages. This is architectural — the duration predictor and attention run per-chunk on a non-streaming-trained model. The continuation flag avoids the artificial-period failure mode but can't share prosody across chunks. The API we ship carries forward unchanged if/when a streaming-trained Supertonic appears — only synthesize_streaming's internals would change.

Files

Type	Path
Edit	`tts-cpp/include/tts-cpp/supertonic/engine.h`
Edit	`tts-cpp/src/supertonic_engine.cpp`
Edit	`tts-cpp/src/supertonic_cli.cpp`
Edit	`tts-cpp/src/supertonic_preprocess.cpp`
Edit	`tts-cpp/src/supertonic_internal.h`
Edit	`tts-cpp/CMakeLists.txt`
New	`tts-cpp/src/supertonic_chunker.h`
New	`tts-cpp/src/supertonic_chunker.cpp`

🤖 Generated with Claude Code

GustavoA1604

Cursor flagged these two things, could you check if they make sense?

Duplicated sentence-terminator code point table (supertonic_engine.cpp + supertonic_chunker.cpp) — chunk_ends_with_sentence_term() in the engine and is_sentence_end_cp() in the chunker are identical switch statements. If the set grows (e.g. Ethiopic ።, Tibetan །) only one copy will be updated. The fix is to un-anonymize is_sentence_end_cp in the chunker and have the engine call it.

stream_emit_pcm_stdout writes one sample at a time (supertonic_cli.cpp) — It calls fwrite(&v, 2, 1, stdout) in a per-sample loop, meaning ~44k–132k syscall-adjacent calls per chunk. Should build a std::vector<int16_t> and write the whole buffer in one fwrite.

…ered stdout) Two review-comment fixes from PR #20: 1. De-duplicated the sentence-terminator code-point table between supertonic_chunker.cpp's is_sentence_end_cp() and the engine's chunk_ends_with_sentence_term(). is_sentence_end_cp() is now declared in supertonic_chunker.h and called from the engine's per-chunk continuation detector — the engine still owns the UTF-8 trim/decode logic, but the predicate (and its multilingual table) live in one place. Adding Ethiopic ።, Tibetan ། or any other terminator now needs one edit, not two. 2. stream_emit_pcm_stdout was doing a per-sample fwrite(&v, 2, 1, stdout) loop — ~44k-132k syscall-adjacent calls per chunk. Build the chunk's int16 buffer once and write it in a single fwrite; flush after. No semantic change to the bytes on stdout; just throughput. Verified: multi-sentence chunker still produces 3 sentence-aligned chunks (unchanged); stdout streaming byte count still equals samples * 2 exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ogad-tether · 2026-05-18T11:38:25Z

Both fixes are in 72b9d561. Thanks for catching these — both were real issues.

1. De-duplicated terminator table. is_sentence_end_cp() is now declared in supertonic_chunker.h and called from the engine's chunk_ends_with_sentence_term(). The engine still owns the UTF-8 trim/decode (its input is a std::string, not a raw code point), but the predicate + multilingual table live in exactly one place. Future additions (Ethiopic ።, Tibetan ።, etc.) now need one edit instead of two — and the older Cursor-flagged code path can't drift out of sync.

2. Buffered stdout write. stream_emit_pcm_stdout now builds the chunk's int16_t buffer once and writes it with a single fwrite, then flushes. No semantic change to the bytes on stdout — verified post-fix: 59487 samples produces exactly 118974 stdout bytes (samples × 2), same as before. Just removes the ~44k–132k tiny fwrite calls per chunk.

Both verified on smoke: multi-sentence chunker still gives 3 sentence-aligned chunks (no regression in the predicate), stdout streaming byte count is exact.

Mirrors the chatterbox StreamCallback API: a second synthesize() overload takes an on_chunk callback that receives PCM chunk-by-chunk while the returned SynthesisResult still accumulates the full audio (callback is an addition, not a replacement). Supertonic's vector estimator is non-autoregressive (5-step CFM denoise over the full duration-predicted latent), so the chatterbox token-level streaming pattern doesn't transfer. Instead this splits text into sentence-aligned chunks and runs the full pipeline per chunk: - New src/supertonic_chunker.{h,cpp}: Unicode-aware splitter. Sentence- end gets a wide implicit search window (target/2..3*target) because sentence prosody dominates audio quality on this model — chunks cut mid-clause receive an artificial trailing period from preprocess and the model emits muddled / dropped words in response. Clause and whitespace fallbacks use the user-supplied tolerance. - Multilingual punctuation tables: ASCII .?! plus CJK fullwidth, double exclamation/question, Devanagari danda, Urdu full stop for sentences; ASCII / fullwidth / Arabic comma, semicolon, colon and closing brackets for clauses. Whitespace fallback handles CJK / Thai / Lao / Khmer where punctuation may be absent. - Engine streaming path runs the full pipeline per chunk with opts.seed (no per-chunk perturbation; different chunks have different latent_len so noise tensors differ even with the same seed, and an earlier per-chunk seed bump occasionally landed chunks on nearby seeds where the model produces phantom-phoneme tail artifacts). - 10 ms raised-cosine anti-click fade on inter-chunk seams only. First chunk start and last chunk end stay untouched so streamed output is acoustically equivalent to batch at the endpoints. - CLI gains --stream-chunk-tokens / --stream-first-chunk-tokens / --stream-chunk-tolerance-pct flags. --out - streams raw s16le PCM on stdout for incremental playback (pipe into ffplay / sox -d). SUPERTONIC_LOG_CHUNKS=1 logs chunker boundaries; SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX=path- dumps per-chunk WAVs for debugging. Validated end-to-end at ~35x realtime on M2 Metal: streamed output is acoustically equivalent to batch on the same seed; first audio drops in ~1 s for an 18 s utterance instead of waiting the full ~4-5 s for batch synth to complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two empirically-driven additions on top of the sentence-aligned chunker: 1. is_continuation flag through supertonic_preprocess_text + supertonic_text_to_ids. When the engine produces a mid-clause / mid-word chunk during streaming, the preprocess skips its auto-appended terminal period. Without the flag the model spoke stub chunks as complete sentences with falling intonation and trailing-phoneme artifacts (the original "park.K" tail bug). The engine detects per-chunk whether the chunk ends on a natural sentence terminator (ASCII .?! plus CJK / Devanagari / Urdu equivalents) and passes through the flag accordingly. 2. stream_min_chunk_tokens (default 30) on EngineOptions. Below ~30 tokens the model emits dropped / muddled phonemes on stub input regardless of the continuation flag (verified on multiple seeds and texts — short text is a model-level failure mode, not a preprocess one). The chunker treats min_chunk_tokens as a hard floor: effective target = max(target, min), the sentence/clause/ whitespace search lower bound is clamped to start + min, and any trailing chunk below the floor is merged into its predecessor. The min floor is the practical ceiling on what Option A streaming can achieve. True seam-free streaming inside one utterance would require model retraining (causal attention, per-token duration, mel-frame cache continuity — the bits chatterbox has by design but supertonic was not trained for). Documenting that as the trade-off honestly rather than papering over it. Behavior: - Multi-sentence input → sentence-aligned chunks (the v1 behavior). Acoustically equivalent to batch on the same seed. - Long single-sentence input → multi-chunk output at the min floor, each chunk passed to the model without an artificial terminal period. Inter-chunk pauses and rate shifts are inherent to per-chunk synthesis on a non-streaming-trained model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reshold Tail-merge was using min_chunk_tokens (30) as its threshold, which on languages denser than English (CJK in particular) merged the last chunk into the previous one even when that last chunk was a complete sentence. Concrete: Korean "공원에서 산책하기 좋은 날이다." is 18 code points — below the 30-cp floor — so the merger folded it into the previous chunk, which contained TWO sentences, producing a single 172-byte chunk for the whole utterance and zero streaming benefit. Switch to chatterbox_engine.cpp:608's heuristic: tail_thresh = max(6, target_tokens/3) (16 for target=50). Genuinely tiny stubs (<16 cps) still merge; real sentence chunks stay independent. The min_chunk_tokens floor governs what the chunker proactively *aims for* during iteration, not what it does with whatever's left after the last natural boundary. Verified: Korean 3-sentence text now chunks into 2 (first chunk spans 2 sentences due to first-sentence-below-min-floor, last sentence stays separate at 18 cps). English 3-sentence test stays at 3 sentence-aligned chunks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 3x sentence-search window slurped runaway-sentence tails as one huge "sentence-aligned" chunk: a 245-char single sentence with the final period 109 chars past start was found by the wide window, so chunker took the whole remainder as chunk[3] instead of falling through to whitespace and producing multiple sub-sentence chunks. 2x is still wide enough to catch a long-but-reasonable first sentence in multi-sentence input (covers up to ~90 chars at target=50, ample for typical English / French / Portuguese sentences) but narrow enough that genuinely runaway sentences (>2x target with no internal periods) fall through to whitespace and stream. Empirical: same 245-char English run-on now produces 5 evenly-sized chunks (30, 52, 54, 52, 56) instead of 4 with the tail-blob (30, 52, 54, 109). Multi-sentence test unchanged (still 3 sentence- aligned chunks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ered stdout) Two review-comment fixes from PR #20: 1. De-duplicated the sentence-terminator code-point table between supertonic_chunker.cpp's is_sentence_end_cp() and the engine's chunk_ends_with_sentence_term(). is_sentence_end_cp() is now declared in supertonic_chunker.h and called from the engine's per-chunk continuation detector — the engine still owns the UTF-8 trim/decode logic, but the predicate (and its multilingual table) live in one place. Adding Ethiopic ።, Tibetan ། or any other terminator now needs one edit, not two. 2. stream_emit_pcm_stdout was doing a per-sample fwrite(&v, 2, 1, stdout) loop — ~44k-132k syscall-adjacent calls per chunk. Build the chunk's int16 buffer once and write it in a single fwrite; flush after. No semantic change to the bytes on stdout; just throughput. Verified: multi-sentence chunker still produces 3 sentence-aligned chunks (unchanged); stdout streaming byte count still equals samples * 2 exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…UPERTONIC_LOG_CHUNKS Adds one line per chunk to the existing SUPERTONIC_LOG_CHUNKS env-var trace, showing the is_continuation flag the engine resolved before handing the chunk to run_single_chunk: chunk[0] (44 bytes): The quick brown fox jumps over the lazy dog. chunk[0] is_continuation=0 chunk[1] (64 bytes): Then she said hello to the world, ... chunk[1] is_continuation=0 Useful for validating that the engine's per-chunk continuation detector and the chunker's boundary search agree on what counts as a sentence terminator across UTF-8 — they share the same detail::is_sentence_end_cp table, but the engine reaches it via a UTF-8-decode of the final code point in the chunk string, so the two paths can in principle disagree on a malformed input. The log makes that observable in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ogad-tether mentioned this pull request May 15, 2026

QVAC-18966 [TTS GGML] Fix CPU regression #21

Merged

GustavoA1604 requested changes May 15, 2026

View reviewed changes

ogad-tether and others added 6 commits May 18, 2026 13:00

ogad-tether force-pushed the feat/supertonic-streaming branch from 6582b0c to 16c2cd2 Compare May 18, 2026 14:03

ogad-tether marked this pull request as ready for review May 18, 2026 14:04

ogad-tether requested review from a team as code owners May 18, 2026 14:04

GustavoA1604 approved these changes May 18, 2026

View reviewed changes

GustavoA1604 merged commit b220514 into supertonic_optimizations May 18, 2026
54 of 60 checks passed

gianni-cor deleted the feat/supertonic-streaming branch May 28, 2026 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tts-cpp: supertonic Engine streaming via multilingual chunker + callback#20

tts-cpp: supertonic Engine streaming via multilingual chunker + callback#20
GustavoA1604 merged 6 commits into
supertonic_optimizationsfrom
feat/supertonic-streaming

ogad-tether commented May 15, 2026 •

edited

Loading

Uh oh!

GustavoA1604 left a comment

Uh oh!

ogad-tether commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ogad-tether commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is chunked-pipeline streaming, not token-streamed-inside-one-utterance

What ships

Test plan

Known limitations

Files

Uh oh!

GustavoA1604 left a comment

Choose a reason for hiding this comment

Uh oh!

ogad-tether commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ogad-tether commented May 15, 2026 •

edited

Loading