GustavoA1604 · GustavoA1604 · May 4, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/PROGRESS.md b/PROGRESS.md
diff --git a/README.md b/README.md
@@ -25,18 +25,28 @@ reference wav (T3 + S3Gen + HiFT, warm runs, excludes model load):
 | CPU (Mac Studio M3 Ultra, NEON)      | 7 568 ms  | 1.05   | 0.96×        | 2.3× faster     |
 | Reference (ONNX Runtime, CPU Q4)     | 6.4–17 s  | 1.2–3.2 | 0.3–0.85×   | —               |
 
-**Multilingual** (same Spanish prompt, seed 42, M4 Mac, built-in voice;
+**Multilingual** (same Spanish prompt, seed 42, built-in voice;
 ONNX reference uses `jfk.wav` via the [multilingual-bench][bench] script):
 
 | Backend                              | Wall      | `RTF` | vs real-time | vs ONNX Runtime |
 |--------------------------------------|----------:|------:|-------------:|----------------:|
+| **Metal (M3 Ultra, Q4_0, `--cfm-steps 7`)** | **1.05 s**| **0.30** | **3.3×**     | **48.4× faster**¹ |
+| Metal (M3 Ultra, Q4_0)               |  **1.22 s** | 0.35  | 2.9×        | **42.0× faster**¹ |
+| Metal (M3 Ultra, F16, `--cfm-steps 7`)| 1.16 s   |  0.32  | 3.2×        | **45.9× faster**¹ |
+| Metal (M3 Ultra, F16)                |  1.41 s   |  0.38  | 2.6×        | **37.5× faster**¹ |
 | **Metal (M4, Q4_0)**                 |  **3.0 s**| 1.37  | 0.73×        | **10.6× faster**¹ |
 | Metal (M4, F16)                      |   4.0 s   | 1.65  | 0.61×        | **14.2× faster**¹ |
 | CPU (M4, 4t NEON, Q4_0)              |   6.0 s   | 2.69  | 0.37×        | **5.4× faster**¹  |
 | CPU (M4, 4t NEON, F16)               |   7.8 s   | 3.24  | 0.31×        | **7.3× faster**¹  |
 | Reference (ONNX Runtime, CPU 4t, q4) |  31.7 s   |14.55  | 0.07×        | —                |
 | Reference (ONNX Runtime, CPU 4t, fp16)|53.3 s   |23.50  | 0.04×        | —                |
 
+The M3 Ultra rows reflect the §3.21 optimisation pass — CFG cond+uncond
+batched into one Metal forward (B=2) on T3, the new `--cfm-steps N` knob
+on the standard 10-step CFM (N=7 is the recommended quality knee, log-mel
+cosine vs N=10 = **0.995**), and `ggml_swiglu_split` on the Llama MLP.
+The M4 rows are kept for continuity with §3.19/§3.20.
+
 ¹ ONNX Runtime's multilingual ONNX export ships **without** the
 `text_emb_weight.bin` tensor and logs `CFG disabled` at load, so it's
 running half the compute of the ggml pipeline (1 T3 forward per token
@@ -386,6 +396,14 @@ Extra MTL-only knobs: `--cfg-weight F` (default 0.5, must be ≥ 0),
 intensity, in [0, 1]).  `--reference-audio` works
 the same way on both variants.
 
+`--cfm-steps N` lowers the CFM Euler step count for non-streaming
+synthesis (default 10 for Multilingual's standard CFM).  N=7 saves ~22%
+of S3Gen wall time at log-mel cosine 0.995 vs the N=10 reference and is
+the recommended quality knee on M3 Ultra (see [`PROGRESS.md §3.21`](PROGRESS.md));
+N=6 is too aggressive (cosine 0.990 right at the threshold, PCM cosine
+drops to 0.88).  Streaming chunks ignore this flag and use
+`--stream-cfm-steps` instead.
+
 Everything is self-contained in the two `.gguf` files:
 
 - `chatterbox-t3-turbo.gguf` embeds the BPE tokenizer (vocab + merges +
@@ -649,6 +667,71 @@ throughput, so CPU keeps the two-call path.  See
 [`PROGRESS.md §3.19`](PROGRESS.md) for the measurement and a discussion
 of where the MTL slowdown lives relative to Turbo.
 
+### Multilingual (Mac Studio M3 Ultra, after §3.21 optimisation pass)
+
+Same Spanish prompt (`"Hola, esto es una demostración multilingüe."`,
+`--language es`), `jfk.wav` voice, seed 42, greedy (`--temp 0 --top-k 1`),
+3 warm runs averaged.  T3 is now CFG-batched into a single Metal forward
+(B=2, mirrors S3Gen's `use_b2`); MLP uses `ggml_swiglu_split` so the 30
+SiLU+Mul element-wise pairs collapse into one fused Metal kernel per
+layer.  The new `--cfm-steps N` flag exposes the standard CFM step count
+(default 10); N=7 is the recommended quality knee (log-mel cosine vs N=10
+= **0.995**).
+
+| Config                              | T3 infer           | S3Gen infer | Audio | **RTF** |
+|-------------------------------------|-------------------:|------------:|------:|--------:|
+| MTL, Metal Q4_0, `--cfm-steps 7`    |  478 ms /  84 tok  |    576 ms   | 3.48 s|  0.30   |
+| MTL, Metal Q4_0 (default N=10)      |  482 ms /  84 tok  |    730 ms   | 3.48 s|  0.35   |
+| MTL, Metal F16, `--cfm-steps 7`     |  579 ms /  89 tok  |    586 ms   | 3.68 s|  0.32   |
+| MTL, Metal F16 (default N=10)       |  613 ms /  89 tok  |    752 ms   | 3.68 s|  0.37   |
+
+Compared to the M4 multilingual numbers above, the M3 Ultra hits
+**RTF 0.30** on Q4_0 — a 4.6× speedup.  The CFG-batching alone drops T3
+by 42–45% (see PROGRESS.md §3.21 for the full bench matrix and the
+NEGATIVE results for F16 KV cache and SwiGLU on F16).
+
+### Multilingual (M3 Ultra, post §3.24–§3.31 Metal kernel portfolio)
+
+Same prompt, voice, seed as §3.21 above.  Adds, on top of §3.21:
+
+- **§3.24** — HiFT conv-kernel F16 quantisation (64 tensors).
+- **§3.26** — `kernel_mul_mv_f32_f16{,_4,_short}` Metal kernel variants
+  to unblock 21 more HiFT `source_*` F16 tensors
+  (GGUF shrinks **754 → 747 MB**, WAV cos 1.000000 vs §3.24).
+- **§3.27** — `kernel_mul_mm` + `ADD(bias)` [+ `ADD(residual)`] fusion
+  for the CFM transformer Q4_0 mat-muls (1820 saved `ggml_add`
+  dispatches per synth).
+- **§3.28** — extends the fusion to absorb `GELU_ERF` (CFM FF ff0
+  activation path; 1120 additional saved dispatches).
+- **§3.30** — `test-metal-ops` fused-mul_mm parity harness + bias-only
+  direct-store variant.
+- **§3.31** — iOS-arm64 cross-build portability +
+  `scripts/bench-m4-validation.sh` for M4 hand-off.
+
+5-invocation averages (`default N=10` CFM — compare to the §3.21 N=10 row):
+
+| Config                              | T3 infer            | S3Gen infer | Audio | **RTF** |
+|-------------------------------------|--------------------:|------------:|------:|--------:|
+| MTL, Metal Q4_0 + HiFT F16 v2 (§3.28) |  433 ms / 84 tok  |    706 ms   | 3.48 s| **0.33** |
+| MTL, Metal Q4_0 baseline (§3.21 N=10) |  482 ms / 84 tok  |    730 ms   | 3.48 s|  0.35    |
+| **Δ §3.21 → §3.28**                   |  **−49 ms / −10.2 %** | **−24 ms / −3.3 %** | — | **−0.02** |
+
+WAV is byte-exact deterministic across runs (md5
+`d8a1b22375dbcb2259c686426a7d76c5` ×5).  Parity harness
+`test-metal-ops` passes 14 gates (3 base + 3 conv_transpose_1d + 8
+fused `mul_mm`).  Patch `patches/ggml-metal-chatterbox-ops.patch`
+(1088 lines) applies cleanly on a fresh ggml clone at pinned
+`58c38058`.  All §3.24–§3.30 kernel changes cross-compile cleanly
+for iOS-arm64 (portability verified; runtime measurement deferred
+until an M4 / iPhone / iPad run of
+[`scripts/bench-m4-validation.sh`](scripts/bench-m4-validation.sh)).
+
+M3 Ultra CFM time specifically drops from 541.9 ms → 534.0 ms
+(**−1.5 %**) — modest on this chip because per-dispatch overhead
+is very low; expected to be larger on bandwidth-limited silicon
+(M4 / A-series) where each saved `ggml_add` dispatch is worth more
+relative to compute.
+
 ### Reference comparison vs onnxruntime (Multilingual, M4 CPU, F16)
 
 Same prompt, seed, and reference audio fed through

diff --git a/SUMMARY-3.24-3.31.md b/SUMMARY-3.24-3.31.md
@@ -0,0 +1,95 @@
+# §3.24–§3.31 portfolio — closeout summary
+
+**Branch**: `multilingual_merged`  &nbsp;&nbsp;|&nbsp;&nbsp; **Last commit**: `0902381`  &nbsp;&nbsp;|&nbsp;&nbsp; **Period**: Apr 30 – May 1, 2026
+
+A compact summary of the §3.24 → §3.31 optimisation pass on top of
+the §3.21 baseline.  For the full chronological development journal
+and every negative finding, see [`PROGRESS.md`](PROGRESS.md).
+
+---
+
+## What shipped (8 commits)
+
+| Section | Commit | Nature | Net M3 Ultra | Net GGUF |
+|--------:|:-------|:-------|:-------------|:---------|
+| §3.24 | *(earlier)* | HiFT F16 conv kernels (64 tensors) | −3.6 ms HiFT | −33 MB |
+| §3.25 | `c47c776` | FA flow-encoder — **negative finding**, reverted | docs | — |
+| §3.26 | `daae187` | Missing `kernel_mul_mv_f32_f16{,_4,_short}` variants → 21 more HiFT F16 tensors | neutral | **−7.7 MB** |
+| §3.27 | `52d184a` | `mul_mm + ADD(bias)[+residual]` fusion | neutral M3U (infra) | — |
+| §3.28 | `64c991d` | + `GELU_ERF` fold-in (CFM FF ff0) | **−8.8 ms CFM** | — |
+| §3.29 | `4633172` | Direct-store RMW — **negative finding**, reverted | docs | — |
+| §3.30 | `145c822` | `test-metal-ops` fused-mul_mm harness + bias-only direct-store retry | neutral M3U (infra) | — |
+| §3.31 | `0902381` | iOS-arm64 cross-build + `scripts/bench-m4-validation.sh` | infra | — |
+
+**Net M3 Ultra**: CFM **541.9 → 534.0 ms (−7.9 ms / −1.5 %)**, S3Gen
+**709 → 706 ms**, GGUF **754.4 → 746.7 MB (−7.7 MB)**.  Five
+commits deliver measurable change; three are documented negative
+findings or infrastructure work that de-risks future rounds.
+
+## Parity guarantees
+
+- **WAV byte-exact** across all 5 benched invocations on the shipping
+  config (Q4_0 + HiFT F16 v2 GGUF, ES prompt, seed 42, `--temp 0
+  --top-k 1 --n-gpu-layers 1`): md5 `d8a1b22375dbcb2259c686426a7d76c5`.
+  Matches the §3.26 baseline exactly; §3.27/§3.28/§3.30 don't drift
+  it by a single bit.
+- **14 / 14 `test-metal-ops` gates PASS**:
+  `diag_mask_inf`, `pad_ext`, 4× `conv_transpose_1d` (HiFT upsamples
+  + tiny edge), 8× `mul_mm_fused` (covers CFM attn Q/K/V/out, FF
+  gate/down, b=1, bc_out edge shapes, both bias and gelu fusion).
+- **End-to-end smoke** across all 8 model pairs
+  (2 T3 × 4 S3Gen variants): all produce correct output.
+- **Streaming mode** (25-token chunks): 4 chunks, 938 ms first-chunk
+  latency, no NaN/Inf.
+- **Long-text** (309 tokens, 12.57 s audio): no NaN/Inf,
+  speech-healthy RMS 1233.
+- **Patch portability**:
+  [`patches/ggml-metal-chatterbox-ops.patch`](patches/ggml-metal-chatterbox-ops.patch)
+  (1088 lines) and `patches/ggml-opencl-chatterbox-ops.patch`
+  (unmodified in this period) both apply cleanly via `git apply
+  --check` on a fresh ggml clone at pinned `58c38058`.
+- **iOS-arm64 cross-build**: `libggml-metal.a` + `libtts-cpp.a`
+  compile clean for iOS 14.0+ arm64 with Xcode 16 / iOS 18.5 SDK —
+  structural proof the §3.26/§3.27/§3.28/§3.30 kernel work is
+  iOS-portable (no macOS-only intrinsics).
+
+## Open follow-ups (tracked in PROGRESS)
+
+| Item | Effort | Expected gain | Status |
+|:-----|:-------|:--------------|:-------|
+| M4 / iPhone / iPad validation of §3.24/§3.27/§3.28/§3.30 on bandwidth-limited silicon | 0.5–2 h on hardware | predicted +5–15 ms S3Gen; untested | hand-off script shipped (`scripts/bench-m4-validation.sh`); awaiting test host |
+| Residual + gelu direct-store retry (with §3.30 harness as safety net) | 2–3 h | potential +3–8 ms M3 Ultra CFM | deferred; §3.29 negative finding root-caused to cooperative-store memory ordering, needs Metal memory-model audit |
+| Extend fusion to other unary sub-ops (SILU / GELU / RELU / GELU_QUICK) | ~15 LOC each | 0 ms chatterbox (not in graph); useful downstream infra | deferred as pure-infra |
+| Q4_0 HiFT via 2-D-on-disk storage + `conv1d_f32` branch | 1–2 days | +4–8 ms HiFT, −30 MB GGUF | deferred (large surgery: converter + C++) |
+| T3 speculative decoding | 2–5 days | −130 to −200 ms T3 (−10 to −15 % wall) | largest remaining lever; needs its own planning session |
+
+## Final bench — shipping config
+
+`./build-metal/chatterbox --model models/chatterbox-t3-mtl-q4_0.gguf --s3gen-gguf models/chatterbox-s3gen-mtl-q4_0_hift_f16_v2.gguf --reference-audio /tmp/jfk.wav --text "Hola mundo, esta es una prueba multilingue." --language es --seed 42 --temp 0 --top-k 1 --n-gpu-layers 1 --out /tmp/cb.wav`
+
+**M3 Ultra Metal, 5 invocations averaged:**
+
+| Stage | Mean | Stdev |
+|:------|-----:|------:|
+| mel | 14.6 ms | 0.2 |
+| `[encoder]` | 30.5 ms | 0.7 |
+| `[cfm_total]` | **534.0 ms** | **1.3** |
+| `[hift_decode]` | 121.1 ms | 0.6 |
+| S3GEN_INFER_MS | **706.6 ms** | 4.5 |
+| T3_INFER_MS | **432.6 ms** (84 tokens) | 2.2 |
+| **Total inference** | **~1165 ms** | |
+| **RTF** | **0.33** | |
+
+Audio output: 3.48 s WAV from 84 speech tokens. Byte-exact and
+deterministic.
+
+## How to reproduce
+
+```bash
+# From the multilingual_merged branch HEAD
+scripts/setup-ggml.sh                      # apply pinned ggml patches
+cmake -S . -B build-metal -DGGML_METAL=ON -DGGML_BLAS=OFF -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
+cmake --build build-metal -j
+./build-metal/test-metal-ops               # all 14 gates should PASS
+bash scripts/bench-m4-validation.sh        # also works on M3 Ultra; prints Δ vs the reference baked into the script
+```