Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
65ad1d0
perf(t3-mtl): batch CFG cond+uncond into a single Metal forward (B=2)
reichert-dev Apr 28, 2026
c98a451
feat(cli): --cfm-steps N for non-streaming multilingual
reichert-dev Apr 28, 2026
0fd7b28
perf(t3-mtl): use ggml_swiglu_split for fused SiLU*Mul on Metal
reichert-dev Apr 28, 2026
153d27b
docs(README,PROGRESS): MTL Metal optimisation pass §3.21
reichert-dev Apr 28, 2026
6141cf2
perf(mtl): drop redundant gallocr_reserve + thread_local HiFT/time_ml…
reichert-dev Apr 28, 2026
a549e89
docs(PROGRESS): §3.22 — MTL allocator-overhead clean-up
reichert-dev Apr 28, 2026
1f43ecc
perf(t3-mtl): stack W_q+W_k+W_v into one mat-mul on Metal
reichert-dev Apr 28, 2026
9ea2875
docs(PROGRESS): §3.23 — T3-MTL fused Q/K/V mat-mul
reichert-dev Apr 28, 2026
9623466
perf(s3gen-mtl): F16 HiFT conv-kernel quantisation via requantize-ggu…
reichert-dev Apr 29, 2026
e7f1107
docs(PROGRESS): §3.24 — HiFT conv-kernel F16 quantisation
reichert-dev Apr 29, 2026
2de1970
docs(PROGRESS): track F32 mul_mm + add(bias) fusion as §3.24 follow-up
reichert-dev Apr 29, 2026
c47c776
docs(s3gen): §3.25 — negative finding on conformer flow-encoder flash…
reichert-dev Apr 30, 2026
daae187
perf(s3gen): §3.26 — add kernel_mul_mv_f32_f16{,_4,_short} Metal vari…
reichert-dev Apr 30, 2026
52d184a
perf(s3gen): §3.27 — mul_mm + ADD(bias)[+ADD(residual)] fusion in ggm…
reichert-dev May 1, 2026
64c991d
perf(s3gen): §3.28 — extend mul_mm fusion to absorb GELU_ERF (CFM FF …
reichert-dev May 1, 2026
4633172
docs(s3gen): §3.29 — direct-store mul_mm fold-in negative finding, re…
reichert-dev May 1, 2026
145c822
test(metal): §3.30 — mul_mm fused-kernel parity harness + bias-only d…
reichert-dev May 1, 2026
0902381
test(bench): §3.31 — iOS-arm64 cross-build + scripts/bench-m4-validat…
reichert-dev May 1, 2026
4058ce7
docs: §3.24–§3.31 portfolio closeout — SUMMARY + README bench table
reichert-dev May 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,519 changes: 1,519 additions & 0 deletions PROGRESS.md

Large diffs are not rendered by default.

85 changes: 84 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,28 @@ reference wav (T3 + S3Gen + HiFT, warm runs, excludes model load):
| CPU (Mac Studio M3 Ultra, NEON) | 7 568 ms | 1.05 | 0.96× | 2.3× faster |
| Reference (ONNX Runtime, CPU Q4) | 6.4–17 s | 1.2–3.2 | 0.3–0.85× | — |

**Multilingual** (same Spanish prompt, seed 42, M4 Mac, built-in voice;
**Multilingual** (same Spanish prompt, seed 42, built-in voice;
ONNX reference uses `jfk.wav` via the [multilingual-bench][bench] script):

| Backend | Wall | `RTF` | vs real-time | vs ONNX Runtime |
|--------------------------------------|----------:|------:|-------------:|----------------:|
| **Metal (M3 Ultra, Q4_0, `--cfm-steps 7`)** | **1.05 s**| **0.30** | **3.3×** | **48.4× faster**¹ |
| Metal (M3 Ultra, Q4_0) | **1.22 s** | 0.35 | 2.9× | **42.0× faster**¹ |
| Metal (M3 Ultra, F16, `--cfm-steps 7`)| 1.16 s | 0.32 | 3.2× | **45.9× faster**¹ |
| Metal (M3 Ultra, F16) | 1.41 s | 0.38 | 2.6× | **37.5× faster**¹ |
| **Metal (M4, Q4_0)** | **3.0 s**| 1.37 | 0.73× | **10.6× faster**¹ |
| Metal (M4, F16) | 4.0 s | 1.65 | 0.61× | **14.2× faster**¹ |
| CPU (M4, 4t NEON, Q4_0) | 6.0 s | 2.69 | 0.37× | **5.4× faster**¹ |
| CPU (M4, 4t NEON, F16) | 7.8 s | 3.24 | 0.31× | **7.3× faster**¹ |
| Reference (ONNX Runtime, CPU 4t, q4) | 31.7 s |14.55 | 0.07× | — |
| Reference (ONNX Runtime, CPU 4t, fp16)|53.3 s |23.50 | 0.04× | — |

The M3 Ultra rows reflect the §3.21 optimisation pass — CFG cond+uncond
batched into one Metal forward (B=2) on T3, the new `--cfm-steps N` knob
on the standard 10-step CFM (N=7 is the recommended quality knee, log-mel
cosine vs N=10 = **0.995**), and `ggml_swiglu_split` on the Llama MLP.
The M4 rows are kept for continuity with §3.19/§3.20.

¹ ONNX Runtime's multilingual ONNX export ships **without** the
`text_emb_weight.bin` tensor and logs `CFG disabled` at load, so it's
running half the compute of the ggml pipeline (1 T3 forward per token
Expand Down Expand Up @@ -386,6 +396,14 @@ Extra MTL-only knobs: `--cfg-weight F` (default 0.5, must be ≥ 0),
intensity, in [0, 1]). `--reference-audio` works
the same way on both variants.

`--cfm-steps N` lowers the CFM Euler step count for non-streaming
synthesis (default 10 for Multilingual's standard CFM). N=7 saves ~22%
of S3Gen wall time at log-mel cosine 0.995 vs the N=10 reference and is
the recommended quality knee on M3 Ultra (see [`PROGRESS.md §3.21`](PROGRESS.md));
N=6 is too aggressive (cosine 0.990 right at the threshold, PCM cosine
drops to 0.88). Streaming chunks ignore this flag and use
`--stream-cfm-steps` instead.

Everything is self-contained in the two `.gguf` files:

- `chatterbox-t3-turbo.gguf` embeds the BPE tokenizer (vocab + merges +
Expand Down Expand Up @@ -649,6 +667,71 @@ throughput, so CPU keeps the two-call path. See
[`PROGRESS.md §3.19`](PROGRESS.md) for the measurement and a discussion
of where the MTL slowdown lives relative to Turbo.

### Multilingual (Mac Studio M3 Ultra, after §3.21 optimisation pass)

Same Spanish prompt (`"Hola, esto es una demostración multilingüe."`,
`--language es`), `jfk.wav` voice, seed 42, greedy (`--temp 0 --top-k 1`),
3 warm runs averaged. T3 is now CFG-batched into a single Metal forward
(B=2, mirrors S3Gen's `use_b2`); MLP uses `ggml_swiglu_split` so the 30
SiLU+Mul element-wise pairs collapse into one fused Metal kernel per
layer. The new `--cfm-steps N` flag exposes the standard CFM step count
(default 10); N=7 is the recommended quality knee (log-mel cosine vs N=10
= **0.995**).

| Config | T3 infer | S3Gen infer | Audio | **RTF** |
|-------------------------------------|-------------------:|------------:|------:|--------:|
| MTL, Metal Q4_0, `--cfm-steps 7` | 478 ms / 84 tok | 576 ms | 3.48 s| 0.30 |
| MTL, Metal Q4_0 (default N=10) | 482 ms / 84 tok | 730 ms | 3.48 s| 0.35 |
| MTL, Metal F16, `--cfm-steps 7` | 579 ms / 89 tok | 586 ms | 3.68 s| 0.32 |
| MTL, Metal F16 (default N=10) | 613 ms / 89 tok | 752 ms | 3.68 s| 0.37 |

Compared to the M4 multilingual numbers above, the M3 Ultra hits
**RTF 0.30** on Q4_0 — a 4.6× speedup. The CFG-batching alone drops T3
by 42–45% (see PROGRESS.md §3.21 for the full bench matrix and the
NEGATIVE results for F16 KV cache and SwiGLU on F16).

### Multilingual (M3 Ultra, post §3.24–§3.31 Metal kernel portfolio)

Same prompt, voice, seed as §3.21 above. Adds, on top of §3.21:

- **§3.24** — HiFT conv-kernel F16 quantisation (64 tensors).
- **§3.26** — `kernel_mul_mv_f32_f16{,_4,_short}` Metal kernel variants
to unblock 21 more HiFT `source_*` F16 tensors
(GGUF shrinks **754 → 747 MB**, WAV cos 1.000000 vs §3.24).
- **§3.27** — `kernel_mul_mm` + `ADD(bias)` [+ `ADD(residual)`] fusion
for the CFM transformer Q4_0 mat-muls (1820 saved `ggml_add`
dispatches per synth).
- **§3.28** — extends the fusion to absorb `GELU_ERF` (CFM FF ff0
activation path; 1120 additional saved dispatches).
- **§3.30** — `test-metal-ops` fused-mul_mm parity harness + bias-only
direct-store variant.
- **§3.31** — iOS-arm64 cross-build portability +
`scripts/bench-m4-validation.sh` for M4 hand-off.

5-invocation averages (`default N=10` CFM — compare to the §3.21 N=10 row):

| Config | T3 infer | S3Gen infer | Audio | **RTF** |
|-------------------------------------|--------------------:|------------:|------:|--------:|
| MTL, Metal Q4_0 + HiFT F16 v2 (§3.28) | 433 ms / 84 tok | 706 ms | 3.48 s| **0.33** |
| MTL, Metal Q4_0 baseline (§3.21 N=10) | 482 ms / 84 tok | 730 ms | 3.48 s| 0.35 |
| **Δ §3.21 → §3.28** | **−49 ms / −10.2 %** | **−24 ms / −3.3 %** | — | **−0.02** |

WAV is byte-exact deterministic across runs (md5
`d8a1b22375dbcb2259c686426a7d76c5` ×5). Parity harness
`test-metal-ops` passes 14 gates (3 base + 3 conv_transpose_1d + 8
fused `mul_mm`). Patch `patches/ggml-metal-chatterbox-ops.patch`
(1088 lines) applies cleanly on a fresh ggml clone at pinned
`58c38058`. All §3.24–§3.30 kernel changes cross-compile cleanly
for iOS-arm64 (portability verified; runtime measurement deferred
until an M4 / iPhone / iPad run of
[`scripts/bench-m4-validation.sh`](scripts/bench-m4-validation.sh)).

M3 Ultra CFM time specifically drops from 541.9 ms → 534.0 ms
(**−1.5 %**) — modest on this chip because per-dispatch overhead
is very low; expected to be larger on bandwidth-limited silicon
(M4 / A-series) where each saved `ggml_add` dispatch is worth more
relative to compute.

### Reference comparison vs onnxruntime (Multilingual, M4 CPU, F16)

Same prompt, seed, and reference audio fed through
Expand Down
95 changes: 95 additions & 0 deletions SUMMARY-3.24-3.31.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# §3.24–§3.31 portfolio — closeout summary

**Branch**: `multilingual_merged`   |   **Last commit**: `0902381`   |   **Period**: Apr 30 – May 1, 2026

A compact summary of the §3.24 → §3.31 optimisation pass on top of
the §3.21 baseline. For the full chronological development journal
and every negative finding, see [`PROGRESS.md`](PROGRESS.md).

---

## What shipped (8 commits)

| Section | Commit | Nature | Net M3 Ultra | Net GGUF |
|--------:|:-------|:-------|:-------------|:---------|
| §3.24 | *(earlier)* | HiFT F16 conv kernels (64 tensors) | −3.6 ms HiFT | −33 MB |
| §3.25 | `c47c776` | FA flow-encoder — **negative finding**, reverted | docs | — |
| §3.26 | `daae187` | Missing `kernel_mul_mv_f32_f16{,_4,_short}` variants → 21 more HiFT F16 tensors | neutral | **−7.7 MB** |
| §3.27 | `52d184a` | `mul_mm + ADD(bias)[+residual]` fusion | neutral M3U (infra) | — |
| §3.28 | `64c991d` | + `GELU_ERF` fold-in (CFM FF ff0) | **−8.8 ms CFM** | — |
| §3.29 | `4633172` | Direct-store RMW — **negative finding**, reverted | docs | — |
| §3.30 | `145c822` | `test-metal-ops` fused-mul_mm harness + bias-only direct-store retry | neutral M3U (infra) | — |
| §3.31 | `0902381` | iOS-arm64 cross-build + `scripts/bench-m4-validation.sh` | infra | — |

**Net M3 Ultra**: CFM **541.9 → 534.0 ms (−7.9 ms / −1.5 %)**, S3Gen
**709 → 706 ms**, GGUF **754.4 → 746.7 MB (−7.7 MB)**. Five
commits deliver measurable change; three are documented negative
findings or infrastructure work that de-risks future rounds.

## Parity guarantees

- **WAV byte-exact** across all 5 benched invocations on the shipping
config (Q4_0 + HiFT F16 v2 GGUF, ES prompt, seed 42, `--temp 0
--top-k 1 --n-gpu-layers 1`): md5 `d8a1b22375dbcb2259c686426a7d76c5`.
Matches the §3.26 baseline exactly; §3.27/§3.28/§3.30 don't drift
it by a single bit.
- **14 / 14 `test-metal-ops` gates PASS**:
`diag_mask_inf`, `pad_ext`, 4× `conv_transpose_1d` (HiFT upsamples
+ tiny edge), 8× `mul_mm_fused` (covers CFM attn Q/K/V/out, FF
gate/down, b=1, bc_out edge shapes, both bias and gelu fusion).
- **End-to-end smoke** across all 8 model pairs
(2 T3 × 4 S3Gen variants): all produce correct output.
- **Streaming mode** (25-token chunks): 4 chunks, 938 ms first-chunk
latency, no NaN/Inf.
- **Long-text** (309 tokens, 12.57 s audio): no NaN/Inf,
speech-healthy RMS 1233.
- **Patch portability**:
[`patches/ggml-metal-chatterbox-ops.patch`](patches/ggml-metal-chatterbox-ops.patch)
(1088 lines) and `patches/ggml-opencl-chatterbox-ops.patch`
(unmodified in this period) both apply cleanly via `git apply
--check` on a fresh ggml clone at pinned `58c38058`.
- **iOS-arm64 cross-build**: `libggml-metal.a` + `libtts-cpp.a`
compile clean for iOS 14.0+ arm64 with Xcode 16 / iOS 18.5 SDK —
structural proof the §3.26/§3.27/§3.28/§3.30 kernel work is
iOS-portable (no macOS-only intrinsics).

## Open follow-ups (tracked in PROGRESS)

| Item | Effort | Expected gain | Status |
|:-----|:-------|:--------------|:-------|
| M4 / iPhone / iPad validation of §3.24/§3.27/§3.28/§3.30 on bandwidth-limited silicon | 0.5–2 h on hardware | predicted +5–15 ms S3Gen; untested | hand-off script shipped (`scripts/bench-m4-validation.sh`); awaiting test host |
| Residual + gelu direct-store retry (with §3.30 harness as safety net) | 2–3 h | potential +3–8 ms M3 Ultra CFM | deferred; §3.29 negative finding root-caused to cooperative-store memory ordering, needs Metal memory-model audit |
| Extend fusion to other unary sub-ops (SILU / GELU / RELU / GELU_QUICK) | ~15 LOC each | 0 ms chatterbox (not in graph); useful downstream infra | deferred as pure-infra |
| Q4_0 HiFT via 2-D-on-disk storage + `conv1d_f32` branch | 1–2 days | +4–8 ms HiFT, −30 MB GGUF | deferred (large surgery: converter + C++) |
| T3 speculative decoding | 2–5 days | −130 to −200 ms T3 (−10 to −15 % wall) | largest remaining lever; needs its own planning session |

## Final bench — shipping config

`./build-metal/chatterbox --model models/chatterbox-t3-mtl-q4_0.gguf --s3gen-gguf models/chatterbox-s3gen-mtl-q4_0_hift_f16_v2.gguf --reference-audio /tmp/jfk.wav --text "Hola mundo, esta es una prueba multilingue." --language es --seed 42 --temp 0 --top-k 1 --n-gpu-layers 1 --out /tmp/cb.wav`

**M3 Ultra Metal, 5 invocations averaged:**

| Stage | Mean | Stdev |
|:------|-----:|------:|
| mel | 14.6 ms | 0.2 |
| `[encoder]` | 30.5 ms | 0.7 |
| `[cfm_total]` | **534.0 ms** | **1.3** |
| `[hift_decode]` | 121.1 ms | 0.6 |
| S3GEN_INFER_MS | **706.6 ms** | 4.5 |
| T3_INFER_MS | **432.6 ms** (84 tokens) | 2.2 |
| **Total inference** | **~1165 ms** | |
| **RTF** | **0.33** | |

Audio output: 3.48 s WAV from 84 speech tokens. Byte-exact and
deterministic.

## How to reproduce

```bash
# From the multilingual_merged branch HEAD
scripts/setup-ggml.sh # apply pinned ggml patches
cmake -S . -B build-metal -DGGML_METAL=ON -DGGML_BLAS=OFF -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-metal -j
./build-metal/test-metal-ops # all 14 gates should PASS
bash scripts/bench-m4-validation.sh # also works on M3 Ultra; prints Δ vs the reference baked into the script
```
Loading