Skip to content

tts-cpp: supertonic Engine streaming via multilingual chunker + callback#20

Merged
GustavoA1604 merged 6 commits into
supertonic_optimizationsfrom
feat/supertonic-streaming
May 18, 2026
Merged

tts-cpp: supertonic Engine streaming via multilingual chunker + callback#20
GustavoA1604 merged 6 commits into
supertonic_optimizationsfrom
feat/supertonic-streaming

Conversation

@ogad-tether

@ogad-tether ogad-tether commented May 15, 2026

Copy link
Copy Markdown

Summary

Adds native streaming synthesis to the Supertonic Engine mirroring the chatterbox StreamCallback API. Splits text into chunks via a multilingual splitter, runs the full per-chunk pipeline on the resident model, and invokes a user callback synchronously with each chunk's PCM as it's produced. The returned SynthesisResult.pcm still contains the concatenated audio so existing batch callers are unaffected — the callback is an addition, not a replacement.

Based on supertonic_optimizations (which now includes #15's Metal port and #21's CPU regression fix).

Why this is chunked-pipeline streaming, not token-streamed-inside-one-utterance

Chatterbox achieves true token-level streaming because its model and training were designed for it: T3 emits audio-rate speech tokens, S3Gen uses causal convs, HiFT carries mel-frame cache + F0 phase across chunks, and the encoder lookahead trim cleans boundary effects.

Supertonic has none of that infrastructure. Single-stage pipeline, bidirectional attention over the full latent, per-utterance duration prediction, no cache continuity across forward passes, and trained on full sentences only. Forcing causal attention or chunked input at inference time produces audio the model never learned to generate (verified — sub-30-token stubs glitch on dropped/muddled phonemes regardless of preprocess tweaks).

So this PR ships what's actually achievable: sentence-aligned chunks for multi-sentence input (acoustically equivalent to batch), mid-clause/whitespace chunks for long single-sentence input where there's no other choice, and a is_continuation preprocess flag so the model isn't told "this is a complete sentence" when it isn't. Inter-chunk pauses and rate shifts at non-sentence seams are inherent to per-chunk synthesis on a non-streaming-trained model and can't be fixed at this layer.

What ships

src/supertonic_chunker.{h,cpp} — Unicode-aware multilingual splitter

  • Two-window boundary search:
    • Sentence-end search: [target/2, 2*target] (wide). Catches long-but-reasonable first sentences in multi-sentence input but narrow enough that genuinely runaway sentences (>2× target without internal periods) fall through to whitespace so they still stream rather than dumping the whole tail as one chunk.
    • Clause / whitespace fallback: [target ± tolerance_pct]. User-controlled.
  • Punctuation tables: ASCII .?!, CJK 。?!, Devanagari ।॥, Urdu ۔, double ‼⁇⁈⁉ for sentences; ASCII / fullwidth / Arabic comma, semicolon, colon, closing brackets for clauses.
  • Whitespace fallback covers CJK / Thai / Lao / Khmer where punctuation may be absent or sparse.
  • stream_min_chunk_tokens (default 30) is a hard floor. Below that the model emits dropped/muddled phonemes on stub input. Effective targets are max(target, min). The sentence/clause/whitespace search lower bound is clamped to start + min so the chunker never proactively aims for a sub-minimum chunk.
  • Tail-merge uses the chatterbox heuristic max(6, target/3) (16 for target=50), NOT the min_chunk floor — using min_chunk would swallow a complete final sentence on info-dense languages (e.g. Korean "공원에서 산책하기 좋은 날이다." is 18 cps — below the 30-cp floor — but is a perfectly valid sentence-aligned chunk).
  • Shared terminator table. The is_sentence_end_cp(uint32_t) predicate is declared in supertonic_chunker.h and reused by the engine's per-chunk continuation detector — additions to the terminator set (Ethiopic ።, Tibetan ། …) live in exactly one place.

is_continuation flag through preprocess

  • supertonic_preprocess_text and supertonic_text_to_ids accept an is_continuation bool. When true, the auto-appended terminal period is skipped — used by streaming for chunks that don't end on a natural sentence terminator. Avoids the original "park.K" trailing-phoneme bug where the model spoke a stub chunk as a complete sentence with falling intonation + tail artifacts.

Engine streaming path

  • Single opts.seed for every chunk (no per-chunk perturbation; different chunks have different latent_len so noise tensors differ even with the same seed). Earlier opts.seed + k perturbation occasionally landed chunks on glitchy nearby seeds.
  • Per-chunk is_continuation derived automatically by checking the chunk's trailing code point against the shared sentence-terminator set (ASCII + CJK + Devanagari + Urdu).
  • 10 ms raised-cosine anti-click fade on inter-chunk seams only. First chunk start and last chunk end stay untouched so streamed output is acoustically equivalent to batch at the endpoints.

Engine API additions

using StreamCallback = std::function<void(
    const float * pcm, std::size_t samples, int chunk_index, bool is_last)>;

struct EngineOptions {
    // ...
    int stream_chunk_tokens        = 0;
    int stream_first_chunk_tokens  = 0;
    int stream_chunk_tolerance_pct = 20;
    int stream_min_chunk_tokens    = 30;
};

SynthesisResult synthesize(const std::string & text,
                           const StreamCallback & on_chunk);

supertonic-cli flags

  • --stream-chunk-tokens N — target chunk size in text tokens (0 disables; 50 ≈ 1-3 s English audio)
  • --stream-first-chunk-tokens N — smaller first-chunk override for first-audio latency
  • --stream-chunk-tolerance-pct N — clause/whitespace boundary-snap window
  • --stream-min-chunk-tokens N — hard floor on chunk size (default 30)
  • --out - streams raw s16le PCM on stdout (one buffered fwrite per chunk). Pipe into ffplay -f s16le -ar 44100 -ch_layout mono -i - or sox -t raw -b 16 -e signed-integer -r 44100 -c 1 - -d.
  • SUPERTONIC_LOG_CHUNKS=1 logs chunker boundaries AND per-chunk is_continuation; SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX=path- dumps per-chunk WAVs for debugging.

Test plan

  • Build: cmake --build tts-cpp/build --target supertonic-cli -j 8
  • Batch vs streaming acoustic equivalence on 3-sentence text: streamed 10.07 s vs batch 10.11 s, no audible difference at chunk seams or endpoints
  • First-audio latency on 6-sentence 18 s paragraph: first audio in ~1 s vs batch's ~4–5 s synth-then-play
  • Long run-on single sentence (245 chars, no internal .?!): chunker produces 5 evenly-sized chunks (30, 52, 54, 52, 55 cps) via whitespace fallback. All chunks above the min floor.
  • Multilingual run-on stress (same 245-ish char run-on in en/fr/pt/ko):
    • EN: 5 chunks, FR: 6 chunks, PT: 5 chunks, KO: 3 chunks (info-dense Hangul → fewer chunks at same cp target).
  • Multilingual standard 3-sentence: all four languages chunk at sentence boundaries (. for en/fr/pt, . and CJK recognized for ko).
  • Stdout streaming via ffplay: chunk-by-chunk timing in stderr proves later chunks synthesize during earlier chunks' playback.
  • CPU streaming (post-QVAC-18966 [TTS GGML] Fix CPU regression #21 rebase): 3-sentence English at --n-gpu-layers 0 writes 10.07 s WAV in 1.09 s wall-time (~9× realtime), exit code 0, no abort. Stdout streaming on CPU produces byte-exact output (samples × 2). Multilingual sentence detection on CPU produces same is_continuation=0/1 flags as Metal (en/fr/pt/ko verified — engine and chunker agree on Unicode terminator predicate).
  • Validation script at /tmp/supertonic-validate-review-fixes.sh covers the de-dup correctness (incl. CJK decode through engine → shared is_sentence_end_cp) and stdout-byte-exactness — 4/4 PASS.
  • Empirical regressions caught during iteration:
    • Per-chunk seed perturbation (opts.seed + k) occasionally landed on glitchy nearby seeds → now uses opts.seed everywhere.
    • 100 ms tail fade on the last chunk faded legitimate final words → reverted to uniform 10 ms after the seed fix.
    • Tail-merge at min_chunk_tokens swallowed valid CJK trailing sentences → relaxed to max(6, target/3).
    • 3× sentence-search window slurped runaway-sentence tails as one chunk → tightened to 2×.
    • Duplicate sentence-terminator table between engine + chunker → shared is_sentence_end_cp via header.
    • Per-sample fwrite in stream_emit_pcm_stdout → buffered single fwrite per chunk.

Known limitations

  • The pre-existing ggml-metal residency-set assertion at process exit (ggml-metal-device.m:612) fires after every Metal-backed run on this branch and on master. Unrelated to streaming; audio is written before exit.
  • CJK languages other than Korean aren't accepted by the current Supertonic model GGUF (--language zh rejected at preprocess; supported set: en, ko, es, pt, fr). Chunker handles CJK punctuation correctly; no model support yet.
  • Mid-sentence streaming has audible seam artifacts (small pauses + rate shifts) in all languages. This is architectural — the duration predictor and attention run per-chunk on a non-streaming-trained model. The continuation flag avoids the artificial-period failure mode but can't share prosody across chunks. The API we ship carries forward unchanged if/when a streaming-trained Supertonic appears — only synthesize_streaming's internals would change.

Files

Type Path
Edit tts-cpp/include/tts-cpp/supertonic/engine.h
Edit tts-cpp/src/supertonic_engine.cpp
Edit tts-cpp/src/supertonic_cli.cpp
Edit tts-cpp/src/supertonic_preprocess.cpp
Edit tts-cpp/src/supertonic_internal.h
Edit tts-cpp/CMakeLists.txt
New tts-cpp/src/supertonic_chunker.h
New tts-cpp/src/supertonic_chunker.cpp

🤖 Generated with Claude Code

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor flagged these two things, could you check if they make sense?

Duplicated sentence-terminator code point table (supertonic_engine.cpp + supertonic_chunker.cpp) — chunk_ends_with_sentence_term() in the engine and is_sentence_end_cp() in the chunker are identical switch statements. If the set grows (e.g. Ethiopic ።, Tibetan །) only one copy will be updated. The fix is to un-anonymize is_sentence_end_cp in the chunker and have the engine call it.

stream_emit_pcm_stdout writes one sample at a time (supertonic_cli.cpp) — It calls fwrite(&v, 2, 1, stdout) in a per-sample loop, meaning ~44k–132k syscall-adjacent calls per chunk. Should build a std::vector<int16_t> and write the whole buffer in one fwrite.

ogad-tether added a commit that referenced this pull request May 18, 2026
…ered stdout)

Two review-comment fixes from PR #20:

1. De-duplicated the sentence-terminator code-point table between
   supertonic_chunker.cpp's is_sentence_end_cp() and the engine's
   chunk_ends_with_sentence_term().  is_sentence_end_cp() is now
   declared in supertonic_chunker.h and called from the engine's
   per-chunk continuation detector — the engine still owns the
   UTF-8 trim/decode logic, but the predicate (and its multilingual
   table) live in one place.  Adding Ethiopic ።, Tibetan ། or any
   other terminator now needs one edit, not two.

2. stream_emit_pcm_stdout was doing a per-sample
   fwrite(&v, 2, 1, stdout) loop — ~44k-132k syscall-adjacent calls
   per chunk.  Build the chunk's int16 buffer once and write it in
   a single fwrite; flush after.  No semantic change to the bytes
   on stdout; just throughput.

Verified: multi-sentence chunker still produces 3 sentence-aligned
chunks (unchanged); stdout streaming byte count still equals
samples * 2 exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether

Copy link
Copy Markdown
Author

Both fixes are in 72b9d561. Thanks for catching these — both were real issues.

1. De-duplicated terminator table. is_sentence_end_cp() is now declared in supertonic_chunker.h and called from the engine's chunk_ends_with_sentence_term(). The engine still owns the UTF-8 trim/decode (its input is a std::string, not a raw code point), but the predicate + multilingual table live in exactly one place. Future additions (Ethiopic ።, Tibetan ።, etc.) now need one edit instead of two — and the older Cursor-flagged code path can't drift out of sync.

2. Buffered stdout write. stream_emit_pcm_stdout now builds the chunk's int16_t buffer once and writes it with a single fwrite, then flushes. No semantic change to the bytes on stdout — verified post-fix: 59487 samples produces exactly 118974 stdout bytes (samples × 2), same as before. Just removes the ~44k–132k tiny fwrite calls per chunk.

Both verified on smoke: multi-sentence chunker still gives 3 sentence-aligned chunks (no regression in the predicate), stdout streaming byte count is exact.

ogad-tether and others added 6 commits May 18, 2026 13:00
Mirrors the chatterbox StreamCallback API: a second synthesize() overload
takes an on_chunk callback that receives PCM chunk-by-chunk while the
returned SynthesisResult still accumulates the full audio (callback is
an addition, not a replacement).

Supertonic's vector estimator is non-autoregressive (5-step CFM denoise
over the full duration-predicted latent), so the chatterbox token-level
streaming pattern doesn't transfer.  Instead this splits text into
sentence-aligned chunks and runs the full pipeline per chunk:

- New src/supertonic_chunker.{h,cpp}: Unicode-aware splitter.  Sentence-
  end gets a wide implicit search window (target/2..3*target) because
  sentence prosody dominates audio quality on this model — chunks cut
  mid-clause receive an artificial trailing period from preprocess and
  the model emits muddled / dropped words in response.  Clause and
  whitespace fallbacks use the user-supplied tolerance.

- Multilingual punctuation tables: ASCII .?! plus CJK fullwidth, double
  exclamation/question, Devanagari danda, Urdu full stop for sentences;
  ASCII / fullwidth / Arabic comma, semicolon, colon and closing
  brackets for clauses.  Whitespace fallback handles CJK / Thai / Lao /
  Khmer where punctuation may be absent.

- Engine streaming path runs the full pipeline per chunk with opts.seed
  (no per-chunk perturbation; different chunks have different latent_len
  so noise tensors differ even with the same seed, and an earlier
  per-chunk seed bump occasionally landed chunks on nearby seeds where
  the model produces phantom-phoneme tail artifacts).

- 10 ms raised-cosine anti-click fade on inter-chunk seams only.  First
  chunk start and last chunk end stay untouched so streamed output is
  acoustically equivalent to batch at the endpoints.

- CLI gains --stream-chunk-tokens / --stream-first-chunk-tokens /
  --stream-chunk-tolerance-pct flags.  --out - streams raw s16le PCM on
  stdout for incremental playback (pipe into ffplay / sox -d).
  SUPERTONIC_LOG_CHUNKS=1 logs chunker boundaries;
  SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX=path- dumps per-chunk WAVs for
  debugging.

Validated end-to-end at ~35x realtime on M2 Metal: streamed output is
acoustically equivalent to batch on the same seed; first audio drops in
~1 s for an 18 s utterance instead of waiting the full ~4-5 s for batch
synth to complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two empirically-driven additions on top of the sentence-aligned
chunker:

1. is_continuation flag through supertonic_preprocess_text +
   supertonic_text_to_ids.  When the engine produces a mid-clause /
   mid-word chunk during streaming, the preprocess skips its
   auto-appended terminal period.  Without the flag the model spoke
   stub chunks as complete sentences with falling intonation and
   trailing-phoneme artifacts (the original "park.K" tail bug).  The
   engine detects per-chunk whether the chunk ends on a natural
   sentence terminator (ASCII .?! plus CJK / Devanagari / Urdu
   equivalents) and passes through the flag accordingly.

2. stream_min_chunk_tokens (default 30) on EngineOptions.  Below ~30
   tokens the model emits dropped / muddled phonemes on stub input
   regardless of the continuation flag (verified on multiple seeds
   and texts — short text is a model-level failure mode, not a
   preprocess one).  The chunker treats min_chunk_tokens as a hard
   floor: effective target = max(target, min), the sentence/clause/
   whitespace search lower bound is clamped to start + min, and any
   trailing chunk below the floor is merged into its predecessor.

   The min floor is the practical ceiling on what Option A streaming
   can achieve.  True seam-free streaming inside one utterance would
   require model retraining (causal attention, per-token duration,
   mel-frame cache continuity — the bits chatterbox has by design but
   supertonic was not trained for).  Documenting that as the trade-off
   honestly rather than papering over it.

Behavior:

  - Multi-sentence input → sentence-aligned chunks (the v1 behavior).
    Acoustically equivalent to batch on the same seed.
  - Long single-sentence input → multi-chunk output at the min floor,
    each chunk passed to the model without an artificial terminal
    period.  Inter-chunk pauses and rate shifts are inherent to
    per-chunk synthesis on a non-streaming-trained model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reshold

Tail-merge was using min_chunk_tokens (30) as its threshold, which on
languages denser than English (CJK in particular) merged the last
chunk into the previous one even when that last chunk was a complete
sentence.  Concrete: Korean "공원에서 산책하기 좋은 날이다." is 18
code points — below the 30-cp floor — so the merger folded it into the
previous chunk, which contained TWO sentences, producing a single
172-byte chunk for the whole utterance and zero streaming benefit.

Switch to chatterbox_engine.cpp:608's heuristic: tail_thresh =
max(6, target_tokens/3) (16 for target=50).  Genuinely tiny stubs
(<16 cps) still merge; real sentence chunks stay independent.  The
min_chunk_tokens floor governs what the chunker proactively *aims for*
during iteration, not what it does with whatever's left after the
last natural boundary.

Verified: Korean 3-sentence text now chunks into 2 (first chunk spans
2 sentences due to first-sentence-below-min-floor, last sentence
stays separate at 18 cps).  English 3-sentence test stays at 3
sentence-aligned chunks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3x sentence-search window slurped runaway-sentence tails as one
huge "sentence-aligned" chunk: a 245-char single sentence with the
final period 109 chars past start was found by the wide window, so
chunker took the whole remainder as chunk[3] instead of falling
through to whitespace and producing multiple sub-sentence chunks.

2x is still wide enough to catch a long-but-reasonable first sentence
in multi-sentence input (covers up to ~90 chars at target=50, ample
for typical English / French / Portuguese sentences) but narrow
enough that genuinely runaway sentences (>2x target with no internal
periods) fall through to whitespace and stream.

Empirical: same 245-char English run-on now produces 5 evenly-sized
chunks (30, 52, 54, 52, 56) instead of 4 with the tail-blob
(30, 52, 54, 109).  Multi-sentence test unchanged (still 3 sentence-
aligned chunks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ered stdout)

Two review-comment fixes from PR #20:

1. De-duplicated the sentence-terminator code-point table between
   supertonic_chunker.cpp's is_sentence_end_cp() and the engine's
   chunk_ends_with_sentence_term().  is_sentence_end_cp() is now
   declared in supertonic_chunker.h and called from the engine's
   per-chunk continuation detector — the engine still owns the
   UTF-8 trim/decode logic, but the predicate (and its multilingual
   table) live in one place.  Adding Ethiopic ።, Tibetan ། or any
   other terminator now needs one edit, not two.

2. stream_emit_pcm_stdout was doing a per-sample
   fwrite(&v, 2, 1, stdout) loop — ~44k-132k syscall-adjacent calls
   per chunk.  Build the chunk's int16 buffer once and write it in
   a single fwrite; flush after.  No semantic change to the bytes
   on stdout; just throughput.

Verified: multi-sentence chunker still produces 3 sentence-aligned
chunks (unchanged); stdout streaming byte count still equals
samples * 2 exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…UPERTONIC_LOG_CHUNKS

Adds one line per chunk to the existing SUPERTONIC_LOG_CHUNKS env-var
trace, showing the is_continuation flag the engine resolved before
handing the chunk to run_single_chunk:

  chunk[0] (44 bytes): The quick brown fox jumps over the lazy dog.
  chunk[0] is_continuation=0
  chunk[1] (64 bytes): Then she said hello to the world, ...
  chunk[1] is_continuation=0

Useful for validating that the engine's per-chunk continuation
detector and the chunker's boundary search agree on what counts as
a sentence terminator across UTF-8 — they share the same
detail::is_sentence_end_cp table, but the engine reaches it via a
UTF-8-decode of the final code point in the chunk string, so the
two paths can in principle disagree on a malformed input.  The log
makes that observable in one place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether ogad-tether force-pushed the feat/supertonic-streaming branch from 6582b0c to 16c2cd2 Compare May 18, 2026 14:03
@ogad-tether ogad-tether marked this pull request as ready for review May 18, 2026 14:04
@ogad-tether ogad-tether requested review from a team as code owners May 18, 2026 14:04
@GustavoA1604 GustavoA1604 merged commit b220514 into supertonic_optimizations May 18, 2026
54 of 60 checks passed
@gianni-cor gianni-cor deleted the feat/supertonic-streaming branch May 28, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants