Metal optimisation by mario-rei · Pull Request #3 · GustavoA1604/chatterbox.cpp

mario-rei · 2026-04-28T11:41:52Z

CFG cond+uncond batched into one Metal forward (B=2) on the multilingual T3, plus --cfm-steps N CLI knob for the standard 10-step CFM, plus ggml_swiglu_split on the MLP.

Results

M3 Ultra Metal, Spanish prompt, seed 42, greedy, 3-warm-run avg.

Variant	T3 before	T3 after	S3Gen before	S3Gen after	Total before	Total after	RTF before	RTF after
Q4_0	872 ms	478 ms (−45%)	730 ms	576 ms (−22%)	1612 ms	1054 ms (−35%)	0.46	0.30
F16	1099 ms	579 ms (−47%)	752 ms	586 ms (−22%)	1943 ms	1165 ms (−40%)	0.53	0.32

What worked

CFG B=2 batching on T3 — biggest win (−42 to −45% T3). Mirrors use_b2 from S3Gen; cond+uncond pack into ne[3]=2 and share a unified KV buffer. CPU keeps the two-call fallback. WAV byte-exact vs baseline.
--cfm-steps 7 — −22% S3Gen at log-mel cosine 0.995 vs N=10. Default unchanged (10).
ggml_swiglu_split — single fused kernel_swiglu_f32 instead of three element-wise ops. Marginal on M3 Ultra (Q4_0 −4%) but byte-exact and future-proof.

What didn't work

F16 KV cache — reverted as neutral on M3 Ultra (within noise, byte-exact audio). Strong evidence flash_attn_ext already runs its inner matmul at F16 regardless of K/V storage dtype. Left as one-line opt-in (GGML_TYPE_F32 → F16) for memory-bound backends (e.g. M4).
Custom Metal kernel patches for RMS_NORM+MUL / SwiGLU — already covered by upstream's can_fuse(...). F16 mul_mat + add(bias) stretch skipped — Llama Q/K/V/O have no bias.
--cfm-steps 6 — sits right on the cos ≥ 0.99 threshold (PCM cos drops to 0.88, attack reconstruction drifts). Opt-in only.

The §3.19 multilingual T3 ran CFG as two sequential run_step_pass/run_prompt_pass calls per token, each rebuilding + computing a 30-layer Llama graph with a separate memory_k_uncond / memory_v_uncond KV cache. On Metal this doubled per-step kernel dispatch + weight-read overhead -- exactly the regression `use_b2` already paid off for S3Gen's CFM (chatterbox_tts.cpp:1994 / §3.19). This commit mirrors that on T3 by packing cond + uncond into the batch dim (ne[3]=2) for inputs_embeds, pos_ids, kq_mask, and the per-layer Q/K/V activations. RoPE + flash_attn_ext both broadcast head/seq dims over batch out of the box, so build_llama_block only grew an int B parameter and a size_t b_offset_elems (one cache slab offset for the legacy B=1 CPU fallback). KV layout rework: the two parallel 1-D F32 KV buffers (memory_k + memory_k_uncond) are now a single contiguous 2 * kv_layer_elems buffer per layer, cond at offset 0, uncond at offset kv_layer_elems. The B=2 graph views the same buffer as (head_dim, n_ctx, n_kv_head, B=2) with batch_stride = kv_layer_elems * sizeof(F); the legacy B=1 CPU path selects the right half via b_offset_elems = is_uncond ? kv_layer_elems : 0. Total backend allocation is unchanged (still 2x kv_elements per cache); we just dropped two ggml_new_tensor_1d calls. eval_step_mtl / eval_prompt_mtl dispatch the B=2 path when !ggml_backend_is_cpu(model.backend) -- mirrors use_b2 in S3Gen exactly. CPU keeps the two-call path (per-op B=2 work doubles without saving ops on ggml-cpu). Bench (M3 Ultra Metal, §3.19 reference Spanish prompt, seed 42, greedy): - Q4_0 T3: 872 ms -> 502 ms (-42%); RTF 0.46 -> 0.35 - F16 T3: 1099 ms -> 602 ms (-45%); RTF 0.53 -> 0.37 Parity gates passed: - Greedy decode token parity at --temp 0 --top-k 1: first 100 tokens identical to the two-call baseline on seed 42. - End-to-end WAV byte-exact match vs the §3.19 reference run on Q4_0 and F16 (cmp baseline.wav phase1.wav -> identical). - CPU fallback (--n-gpu-layers 0) still produces audio with the B=1 path. The KV strides in build_llama_block are routed through ggml_type_size(memory_k->type), so flipping GGML_TYPE_F32 -> GGML_TYPE_F16 in load_model_gguf_mtl is a one-line opt-in for memory- bound backends. On M3 Ultra it's neutral (Metal flash_attn_ext was already running its inner matmul at F16 regardless of K/V storage dtype); kept at F32 to preserve the §3.19 numerics envelope. Made-with: Cursor

The s3gen_synthesize_opts.cfm_steps field already existed (and was honoured by the inner CFM loop in chatterbox_tts.cpp:1973), but the CLI only surfaced it for streaming chunks via --stream-cfm-steps. Non- streaming MTL was therefore locked at the GGUF's n_timesteps=10. Add a top-level --cfm-steps N flag and route it into all three non-streaming s3gen_synthesize_opts setup sites: - tokens-file short-circuit (no T3, just S3Gen + HiFT) - regular non-streaming synthesis path - the --input-file live-input streaming path keeps using --stream-cfm-steps; --cfm-steps is documented as a non-streaming knob there. Default stays at 0 (use the GGUF's n_timesteps), so existing scripts keep their behaviour. Quality knee on the §3.19 reference Spanish prompt (M3 Ultra Metal Q4_0; log-mel cosine vs N=10): N=6 S3Gen 518 ms log-mel-cos 0.990 PCM-cos 0.88 (borderline) N=7 S3Gen 571 ms log-mel-cos 0.995 PCM-cos 0.94 ← recommended N=8 S3Gen 629 ms log-mel-cos 0.997 PCM-cos 0.97 N=10 S3Gen 730 ms log-mel-cos 1.000 PCM-cos 1.00 (default) N=7 cleanly clears the cos >= 0.99 gate; N=6 sits right on it (PCM cosine drops to 0.88 -- phase-coherent attack reconstruction starts to drift). Made-with: Cursor

build_llama_block did `silu(gate) * up` as three separate ggml ops: ggml_silu(...), ggml_mul_mat(mlp_up, ...), ggml_mul(silu_out, up_out). That's a silu + mul element-wise pair on top of the two mul_mats, at 30 dispatches/token across layers. Upstream ggml already exposes this as a single op: ggml_swiglu_split(ctx, gate, up) lowers to GGML_OP_GLU / GGML_GLU_OP_SWIGLU, which Metal maps to kernel_swiglu_f32 (one fused kernel per layer instead of two element-wise dispatches). The pre-norm `mul(rms_norm(x), g)` pattern was already auto-fused by ggml-metal's can_fuse(RMS_NORM, MUL) path (kernel_rms_norm_mul_f32); left as-is so CPU + non-Metal backends get the same op shape. Bench (M3 Ultra Metal, vs the B=2 CFG baseline from the previous commit, 3 warm runs averaged): - Q4_0 T3: 502 ms -> 482 ms (-4%, within plan's 5% gate but trend is positive across all 3 runs) - F16 T3: 602 ms -> 635 ms (+5%, within run-to-run variance) Marginal on M3 Ultra (the SwiGLU kernel was not on the critical path once Phase 1 collapsed the per-step weight-read + dispatch cost). Kept for code clarity (single fused op vs three) + because future ggml-metal kernel improvements to swiglu will land here automatically. WAV byte-exact vs the previous commit (kernel_swiglu_f32 is bit- equivalent to manual `silu(gate) * up`). Made-with: Cursor

README: - Multilingual benchmark table at the top grew M3 Ultra rows (`--cfm-steps 7` and default N=10, Q4_0 + F16) alongside the existing M4 rows. Best M3 Ultra config now hits RTF 0.30 / **48.4× faster than ONNX Runtime** on the same prompt; M4 numbers retained for §3.19/§3.20 continuity. - New "Multilingual (Mac Studio M3 Ultra, after §3.21 optimisation pass)" stage block under ## Performance, mirroring the M3 Ultra Turbo block: T3 / S3Gen / RTF for {Q4_0, F16} × {N=10, N=7}. - The MTL `tts-cli` example mentions `--cfm-steps`; flag points readers at PROGRESS.md §3.21 for the quality knee numbers. PROGRESS.md: - Append §3.21 mirroring the §3.19/§3.20 layout (pre-rationale -> what shipped -> bench tables -> what didn't work -> backlog). Documents the bench matrix on M3 Ultra Metal {Q4_0, F16} × {baseline, +Phase 1, +Phase 2, +Phase 4, +final} with stage breakdowns and audio-quality gates. - "What didn't work -- NEGATIVE results" subsection explicitly documents Phase 2 (F16 KV) as a wash on M3 Ultra (with the observation that ggml-metal's flash_attn_ext was already running its inner matmul at F16 regardless of K/V storage dtype) and the Phase 4-stretch Metal-patch fusions as unnecessary because upstream already covers them. - "What's next for MTL" backlog from §3.19 is updated with strikes on items now done (T3 quantisation, --cfm-steps); items remaining (heterogeneous-core CPU thread default, ja/he/ru/zh/hi tokenizer, speculative decoding, M4 F16 KV measurement) listed at the end. Made-with: Cursor

…p scaffolding Three small allocator-side cleanups on top of §3.21 — all byte-exact WAV output (MD5 match against multilingual_merged HEAD), all neutral- to-slightly-positive on M3 Ultra (the savings are too small to escape run-to-run noise here, ~1% on S3Gen wall time, but the changes remove unambiguously wasted work that matters more on slower CPUs / older Metal where the topology walks and 64 MB memset are pricier). 1. `t3_mtl.cpp`: drop the explicit `ggml_gallocr_reserve` before `ggml_gallocr_alloc_graph` in the four `run_*_pass[_b2]` paths. `alloc_graph` already calls `ggml_gallocr_needs_realloc` and only triggers a re-reservation when the graph's per-node sizes grew — the per-step graph keeps the same node count + same shapes for every n_past >= 1, so 83 of the 84 step-pass reserves were doing a full O(n_nodes) topology walk for nothing. Per-step graph build itself is unchanged. 2. `chatterbox_tts.cpp` `run_hift_decode`: turn the 64 MB scratch `std::vector<uint8_t> buf(buf_size)` into a `thread_local`. The previous version forced a 64 MB memset on every HiFT call (one per `--out` invocation in batch mode, one per chunk in streaming). ggml_init resets the arena pointer, so the buffer is reused safely across calls without leaking tensor metadata. 3. `chatterbox_tts.cpp` `compute_time_mlp`: hoist the per-step ggml_init / build-graph / gallocr_new / gallocr_reserve / gallocr_free sequence into a `thread_local time_mlp_cache`. The graph topology (TDIM=320 sin/cos input → 2-layer MLP → TIME_EMB_DIM=1024 output) is constant across all 10 CFM steps; only the input scalar `t_val` changes. Now we build + reserve once, then per-step we just `alloc_graph` + `tensor_set` + `compute` + `tensor_get`. Cache key is `(backend)` so a backend swap rebuilds. Bench (M3 Ultra Metal, Q4_0, ES prompt, seed 42, --temp 0 --top-k 1, jfk.wav voice, 3 invocations): | Stage | §3.21 base | this commit | Δ | |------------|-----------:|------------:|--------:| | T3 ms | 479 | 474 | -1% | | cfm_total | 561 | 550 | -2% | | hift_decode| 128 | 125 | -2% | | S3Gen ms | 730 | 728 | -0.3% | | Total ms | 1209 | 1202 | -0.6% | Within run-to-run noise (~1.5% on T3) but consistently the same direction across three back-to-back runs, and zero regression risk because `ggml_gallocr_alloc_graph` covers the lazy-reserve case. WAV byte-exact gate (md5sum): - pre-commit: 79002f09bc48dda95ec0c2cfc2b895bd - post-commit: 79002f09bc48dda95ec0c2cfc2b895bd Made-with: Cursor

PROGRESS gets a new §3.22 documenting the three small allocator-side cleanups shipped in 6141cf2 (drop redundant gallocr_reserve; thread_local HiFT scratch; cached time_mlp graph). Numbers on M3 Ultra are within run-to-run noise (~1-2% per stage, ~0.6% total) but consistently the same direction across 3 invocations and pass the byte-exact WAV gate against §3.21 HEAD. Section also documents the M3 Ultra "compute floor" finding: each CFM step is ~52 ms steady-state of mat-mul work (480 nodes/step, already amortised in one command-buffer commit), so allocator shaving can only chip at the 20 ms first-step build overhead, not the steady-state. Identifies F32 mul_mm + add(bias) shader fusion as the next worthwhile attack — deferred to a future round. Made-with: Cursor

Multilingual T3 step path on Metal previously dispatched three separate Q4_0 mat-muls per layer for the Q / K / V projections (30 layers × 84 tokens × 3 ≈ 7560 dispatches per call inside the same compute_graph commit). All three weights have identical Q4_0 shape (n_embd=1024 in × n_embd=1024 out), so they can be concatenated along the output (M) dim into one [n_embd, 3*n_embd] Q4_0 tensor and fed into a single mat-mul whose output's first n_embd rows are Q, middle are K, last are V. Implementation: - `chatterbox_model` gains `ctx_stack` + `buffer_stack` (a backend buffer alongside `buffer_w` and `buffer_kv`). - `llama_layer` gains `wqkv : [n_embd, 3*n_embd]`. Allocated only on non-CPU backends (CPU keeps the per-projection path; ggml-cpu's per-kernel overhead is already negligible and the +30 MB weight footprint trades unfavourably with thread-cache locality). - At load time, after the GGUF weights buffer is allocated, the per-layer wq/wk/wv bytes are copied into wqkv via a host-side scratch buffer (Q4_0 row layout is M-major contiguous: row `r` is K/32 blocks of 18 bytes packed back-to-back, so concat is a flat byte append). - `build_llama_block` runs ONE `ggml_mul_mat(W_qkv, cur)` and then carves out Q / K / V via strided `ggml_view_2d` / `_3d` views straight into the (HD, NH, N[, B]) shape RoPE expects. No ggml_reshape (would need contiguous source) and no ggml_cont (would defeat the saving). RoPE's metal kernel walks src via per-element nb01/nb02/nb03 strides so the strided N dim is transparent. Process-wide `t3_stack_registry` + atexit hook in t3_mtl.cpp frees buffer_stack before Metal's static device destructors run; without this the new buffer triggers Metal's `[rsets->data count] == 0` assert at process exit (residency sets still referenced through an orphan backend buffer). Mirrors the existing `s3gen_model_cache_release` atexit pattern. `free_t3()` in main calls `t3_stack_unregister()` on the error-path early-returns so we don't double-free. Why gate/up isn't stacked too: the multilingual T3 converter ships `mlp_gate` as F16 and `mlp_up` as Q4_0 (verified via gguf reader on models/chatterbox-t3-mtl-q4_0.gguf — gate.weight type=F16, up.weight type=Q4_0 for every layer). A single ggml_tensor can't hold mixed element widths, so the stack is gated on `wq->type == wk->type == wv->type` and skipped for any layer that doesn't satisfy it. Bench (M3 Ultra, Metal, ES prompt + jfk.wav voice, seed 42, --temp 0 --top-k 1, mean of 5 invocations): | Variant | T3 §3.22 base | T3 this commit | Δ | |---------|--------------:|---------------:|---------:| | Q4_0 | 474 ms | **433 ms** | **-9.6%** | | F16 | 522 ms | **493 ms** | **-5.5%** | Per-stage breakdown for Q4_0 (one run): encoder 30 ms (unchanged) cfm_total 549 ms (unchanged) hift 125 ms (unchanged) T3 433 ms (-46 ms vs baseline 479 ms) Total 1153 ms (vs 1209 ms baseline, -4.6%) WAV byte-exact gate: md5 `79002f09bc48dda95ec0c2cfc2b895bd` matches across §3.22 base and post-commit at five separate invocations (temp=0, top-k=1, deterministic). Made-with: Cursor

PROGRESS gets a new §3.23 documenting the Phase-15 stacked W_qkv optimisation shipped in 1f43ecc. Key contents: - Implementation: Q/K/V rows concatenated row-wise into one Q4_0 tensor at load time; build_llama_block runs ONE mat-mul + strided view-split into the (HD, NH, N[, B]) layout RoPE expects, no reshape (would need contiguous source) and no cont (would defeat the saving). - Why gate/up isn't stacked: the multilingual T3 ships mlp_gate as F16 and mlp_up as Q4_0, so a single ggml_tensor can't hold them. - Why CFM transformer Q/K/V isn't stacked: documented as a negative result with the GPU-occupancy explanation. CFM has M=512 / N=174 which already saturates M3 Ultra's 60 cores in one wave; the stacked M=1536 / N=174 path needs three waves where one was enough. T3 wins because its step graph has M=1024 / N=1, which was at ~25% GPU occupancy un-stacked. Bench (M3 Ultra, Q4_0): T3 474 → 433 ms (-8.7%), Total 1192 → 1153 ms (-3.3%), WAV byte-exact md5 79002f09bc48dda95ec0c2cfc2b895bd. Cumulative since §3.20 baseline: T3 872 → 433 ms (-50%), RTF 0.46 → 0.29. Made-with: Cursor

…f.py Multilingual S3Gen ships HiFT as 246 F32 tensors (~80 MB) because the converter and `requantize-gguf.py` both wholesale-rejected 3-D shapes — `len(shape) != 2` always returned False in `should_quantize`. This commit unblocks the rejection on both fronts: 1. **`should_quantize` now allows 3-D when ne[0] (the conv kernel size K) is a multiple of the quant block size.** For Q4_0 / Q5_0 / Q8_0 (block 32) the HiFT stack still gates out universally — K ∈ {3, 7, 11, 16}, none 32-aligned, and the block layout assumes blocks span 32 consecutive ne[0] values within a fixed (ne[1], ne[2]) row, which only works when K itself is 32-aligned. The plan's predicted-positive `K * IC % 32 == 0` check is necessary but not sufficient because re-quantising with a flattened (K*IC) reduction dim requires storing the result with ggml shape `(K*IC, OC)` — i.e. 2-D on disk — which then breaks `ggml_im2col(kernel, ...)` on the C++ side, which derives the kernel size from `kernel->ne[0]`. Documented inline; the 3-D branch is forward-compatible for any future converter that ships K-aligned conv kernels. 2. **Add `f16` as a target dtype.** F16 has block_size = 1, so the alignment gate is a no-op for any shape. Combined with a new `--name-filter SUBSTRING` arg that constrains the rewrite to a tensor-name substring, lets us downcast HiFT conv kernels F32 → F16 without disturbing the existing Q4_0 CFM linears. 3. **Pass-through path for already-quantised sources.** A Q4_0 source GGUF was previously a hard error in the pass-through branch (the existing code reshapes `data` to the element shape and Q-types have packed bytes). Now branches on `GGML_QUANT_SIZES[t.tensor_type][0] == 1` (the float-types block) and writes the raw byte buffer through with the original shape for already-quantised inputs. Two-pass recipe (multilingual S3Gen, Metal target): python scripts/requantize-gguf.py \ models/chatterbox-s3gen-mtl-f16.gguf \ /tmp/intermediate.gguf f16 --name-filter hift/ python scripts/requantize-gguf.py \ /tmp/intermediate.gguf \ models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf q4_0 Result on M3 Ultra Metal (3 invocations, ES prompt, --seed 42): baseline q4_0 GGUF (HiFT all F32): GGUF size 788.4 MB [hift_decode] median 124.9 ms [s3gen_total] median 727 ms WAV md5 79002f09bc48dda95ec0c2cfc2b895bd q4_0 + HiFT F16 GGUF (this commit's recipe; 64 of 246 HiFT tensors get F16, the remaining 21 source_resblocks/* + 161 biases/scalars stay F32): GGUF size 754.6 MB (-4.3 %) [hift_decode] median 121.3 ms (-2.9 %) [s3gen_total] median 726 ms (within noise) PCM cosine vs baseline 0.999851 (essentially indistinguishable) The 21 source_resblocks/* tensors that match the existing `/s` substring deny-list (intentional false positive from the Turbo deny-list era) are kept F32: tried unblocking them and the multilingual decode segfaults on `kernel_mul_mv_f32_f16_short`-not-found in the pinned ggml-metal build. Documented inline next to the `/s` entry; refining the deny-list to endswith-only is gated on either patching that kernel variant in or reshaping those tensors to a non-mat_mv shape. Validation: - chatterbox CLI on the new GGUF produces deterministic WAV (md5 ec58d3e65ab8e9c6f4edefb15b169ea5 every run). - PCM cosine 0.999851 vs the F32-HiFT baseline; max abs i16 diff 616 / 32768 ≈ 1.9 % — well above the §3.20 cos ≥ 0.99 quality gate. - 3 × 3-invocation benches (baseline before vs after) confirm -2.9 % HiFT decode median, -4.3 % GGUF size, no T3 / CFM / S3Gen-total impact. Made-with: Cursor

PROGRESS gets a new §3.24 documenting both the Q4_0 negative finding (K-dim alignment is structurally blocked: HiFT K ∈ {3, 7, 11, 16}, none 32-aligned, and the 2-D-on-disk fix breaks ggml_im2col on the C++ side) and the F16 alternate path that ships: script + recipe + bench + quality gate. Headline numbers (M3 Ultra Metal, ES prompt, 3 invocations): GGUF size 788.4 MB → 754.6 MB (−4.3%) [hift_decode] 124.9 ms → 121.3 ms (−2.9%) PCM cosine 0.999851 across runs (well above 0.99 gate) The smaller-than-planned win is honestly attributed: - 21 of the 246 HiFT tensors (source_downs/*, source_resblocks/*) trip an existing `/s` deny-list false-positive; refining the deny-list segfaults at runtime (kernel_mul_mv_f32_f16_short isn't compiled in the pinned ggml-metal). Tracked as follow-up. - Activation traffic + im2col stay F32; F16 weights only cut the weight-load phase of mul_mat, not the whole decode. Made-with: Cursor

Per the plan, the stretch goal was the F32 mul_mm + add(bias) shader fusion in the metal patch. Documented as a tracked follow-up in §3.24 alongside the other two HiFT-related leftovers (missing kernel variant, 2-D-on-disk Q4_0 with C++ conv1d_f32 branch) — the F16 alt-path (which actually shipped today) was the cheaper and more immediately measurable win, so the F32 fusion gets time-budgeted in a future session. Made-with: Cursor

…_attn_ext Tried flipping `src/chatterbox_tts.cpp::conformer_block()` (the 10 rel-pos Conformer blocks that make up S3Gen's flow encoder) from the classic `ggml_soft_max` + separate V mat-mul path to `ggml_flash_attn_ext`, following the same pattern already used in `t3_mtl.cpp` (T3 Llama attention) / `chatterbox_tts.cpp::basic_tfm` (CFM transformer) and just shipped on `parakeet.cpp` §15.8 (their Conformer rel-pos MHA). Bench (M3 Ultra, Metal, Q4_0, Spanish prompt "Hola mundo, esta es una prueba multilingue.", seed 42, 3 invocations averaged): stage | baseline | FA | delta -----------------|---------:|------:|----------------: [encoder] ms | ~43 | 29.6 | -13 / -31% (flow encoder only) S3Gen ms | 721 | 708 | -13 / -1.8% T3 ms | 433 | 430 | noise CFM total ms | 546 | 538 | noise (-8) HiFT decode ms | 126 | 125 | noise WAV md5 | 79002f09 | a4169d68 | differs The speedup is real — 40 saved kernel dispatches per synth from collapsing 10 * (softmax + permute + mul_mat with V) into 10 * flash_attn_ext — but the WAV quality gate fails. `ggml_flash_attn_ext` hard-requires an f16 mask (`ggml.c:5320 GGML_ASSERT(mask->type == GGML_TYPE_F16)`). The Conformer rel-pos bias `bd_final = mul_mat(p_perm, q_plus_v)` is f32; casting it to f16 drifts each element by ~1e-4 (f16 has ~10 bits of mantissa, `bd_final` values sit in the ±5 to ±10 range). That drift compounds through: flow encoder (10 conformer blocks) -> CFM 10-step diffusion U-Net -> HiFT vocoder -> waveform which is far more drift-amplifying than parakeet's joint-argmax downstream. Measured WAV: lengths base=83520 fa=83520 samples n=83520 cos=0.998647 rms_diff=69.334 rms_base=1332.522 max_abs_diff=1702.0 gate: FAIL (threshold > 0.9998; got 0.998647) Three rescue options explored and rejected: 1. Pass bd_scaled in f32 via ggml_flash_attn_ext — blocked by the hard f16 assertion. 2. Compute bd_final in f16 from the start (cast p_perm and q_plus_v) — pushes the same precision loss earlier in the graph, doesn't improve the downstream cosine. 3. Skip the mask (pass nullptr) — mathematically wrong; bd_final *is* the relative-position bias that Conformer attention requires, dropping it breaks position-aware attention. Reverted: conformer_block stays on the ggml_soft_max path. WAV md5 restored byte-exact to 79002f09bc48dda95ec0c2cfc2b895bd against the §3.22 reference (verified /tmp/cb_revert.wav == /tmp/cb_base_1.wav). Code change is additive (documentation-only): 16-line comment in `conformer_block()` explaining why flash_attn is intentionally not taken here, pinning the negative-finding cosine number + the speed upside that was measured, and pointing at the parakeet §15.8 counterexample so the next person who looks at this doesn't redo the experiment. PROGRESS §3.25 adds the full writeup with the measurement table, the reasoning for why parakeet could absorb this precision drift but chatterbox can't (argmax-vs-waveform downstream sensitivity), the three rejected rescue options, and the pointer to the two remaining quick-win candidates that don't have this compounding problem: - strip redundant `ggml_cont` after Conformer Q/K/V permutes (some are removable because Metal mul_mat walks strides natively) - F32 `mul_mm + add(bias)` shader fusion in patches/ggml-metal-chatterbox-ops.patch (already queued from §3.24) Made-with: Cursor

…ants; relax /s deny; fix Q-type passthrough Closes the open follow-up from §3.24 §3.25 ("Patch the missing `kernel_mul_mv_f32_f16_short` variant to unblock the remaining 21 HiFT source_* conv kernels"). Problem. §3.24 converted 64 of HiFT's F32 conv-kernel weights to F16 but kept the 21 `source_*` conv kernels at F32 because requantize-gguf.py's `/s` glob swept them, and — once that glob was narrowed — the resulting Metal build crashed: ggml_metal_library_compile_pipeline: Error Domain=MTLLibraryErrorDomain Code=5 "Function kernel_mul_mv_f32_f16_short was not found in the library" ... SIGSEGV at first HiFT decode (exit 139). HiFT's `conv1d_f32` does `ggml_mul_mat(im2col_f32, kernel_reshaped_f16)`, which the Metal dispatcher compiles as `kernel_mul_mv_f32_f16_short` (T0=src0= im2col=f32, T1=src1=kernel=f16, short-axis path because `source_downs/2/weight` has OC=64). That template instantiation was missing from the pinned ggml-metal (commit 58c38058) — the family shipped `f32_f32`, `f16_f32`, `f16_f16`, `bf16_f32`, `bf16_bf16` but not `f32_f16`. Fix, three one-liner template instantiations in ggml-metal.metal: // kernel_mul_mv_t_t family (full-shape mat-vec) template [[host_name("kernel_mul_mv_f32_f16")]] kernel mul_mv_t_t kernel_mul_mv_t_t<float, half>; // kernel_mul_mv_t_t_4 family (vec4 path) template [[host_name("kernel_mul_mv_f32_f16_4")]] kernel mul_mv_t_t_4 kernel_mul_mv_t_t_4<float, float4, half, half4>; // kernel_mul_mv_t_t_short family (short-axis path — HiFT's actual // hit, since source_downs/2/weight has OC=64 which trips the // "small matrix-vector" fast path) template [[host_name("kernel_mul_mv_f32_f16_short")]] kernel mul_mv_t_t_short_t kernel_mul_mv_t_t_short<float, half>; The `_impl` bodies already handle arbitrary casts via `(float) x[i] * (float) y[i]` — only the symbol lookup was missing. requantize-gguf.py gets three paired changes so the recipe works end-to-end on the current gguf-0.18 writer: 1. `/s` deny narrowed to `/scale`. The broad `/s` glob was originally a rough proxy for norm-scale params but incidentally swept every `hift/source_*/` weight + bias (60+ HiFT tensors, 21 of which are 3-D conv kernels that §3.26 now unblocks). With the Metal kernels shipped, those 21 conv weights are safe to F16. The norm-scale tensors the deny was targeting (`/scale`, `/ln_`, `/norm/`, `/gamma`) remain covered by their own stricter patterns. 2. Q-type passthrough byte-shape fix. gguf-0.18's `add_tensor_info` treats `raw_shape` as **byte** layout (inner dim is bytes/row, not elements/row) when tensor.dtype is uint8. The previous code passed element shape verbatim, which crashed with `ValueError: Quantized tensor bytes per row (512) is not a multiple of Q4_0 type size (18)` on any two-pass recipe that re-quantised a GGUF already carrying Q-type tensors (e.g. `f16 → q4_0` or `q4_0 → f16 --name-filter`). Fix: `byte_inner = elements_inner // block_size * type_size` before handing to the writer. Floats (block_size=1) keep the existing element-shape path unchanged. 3. Docstring updated with the now-correct two-pass recipe: python scripts/requantize-gguf.py \ models/chatterbox-s3gen-mtl-f16.gguf \ /tmp/intermediate.gguf f16 --name-filter hift/ python scripts/requantize-gguf.py \ /tmp/intermediate.gguf \ models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf q4_0 Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42, 3x3 runs averaged): stage | §3.24 | §3.26 | delta ------------------|-------:|-------:|-------------: [encoder] ms | 31.3 | 30.5 | -0.8 (noise) [cfm_total] ms | 541.9 | 550.4 | noise [hift_decode] ms | 121.3 | 121.1 | neutral S3GEN_INFER_MS | 709 | 724 | +15 (noise) T3_INFER_MS | 440 | 440 | 0 GGUF size | 754.4 | 746.7 | -7.7 MB Speed neutral on M3 Ultra unified memory (the 21 source_* tensors are small — largest is source_resblocks/0/convs1/*/weight at ~3.4 MB F32 / ~1.7 MB F16). Predicted +2–4 ms HiFT from §3.24 falls inside bench noise here; on bandwidth-limited targets (M4 Air, iOS ANE-adjacent) expect the full 3–5 % HiFT speedup seen in §3.24's existing 64 tensors. The real win is the 7.7 MB (~1.0 %) GGUF shrink on a multilingual-distribution package, plus closing the last known §3.24 blocker. Parity gates: - test-metal-ops: all four pre-existing ops (diag_mask_inf, pad_ext, conv_transpose_1d x3 + tiny edge) PASS; the new mul_mv_f32_f16 variants are covered by end-to-end audio parity (their inner math is identical to the shipped <half, float> / <half, half> / <float, float> variants, they differ only in the type tags). - WAV parity vs §3.24 baseline on ES-prompt / jfk-voice / seed 42: md5 §3.24 baseline : ec58d3e65ab8e9c6f4edefb15b169ea5 md5 §3.26 v2 (3 runs) : d8a1b22375dbcb2259c686426a7d76c5 x3 lengths 83520/83520 cos 1.000000 PASS (threshold > 0.9998) rms_diff 0.464 rms_base 1332.66 max_abs_diff 4 (of ±32767) → 0.035 % relative RMS drift, 0.012 % max sample drift. Auditorily identical. Deterministic across 3 invocations. Three §3.24 follow-ups previously tracked: ~~kernel_mul_mv_f32_f16_short patch~~ — ✓ shipped this section. Q4_0 HiFT via 2-D-on-disk + `conv1d_f32` branch — still deferred (larger surgery across converter + C++). F32 mul_mm + add(bias) shader fusion — still deferred (~150 LOC Metal kernel work for +10–25 ms S3Gen). Made-with: Cursor

…l-metal Closes the §3.22 §3.24 §3.26 follow-up "F32 mul_mm + add(bias) shader fusion in patches/ggml-metal-chatterbox-ops.patch". The fusion in the pinned ggml-metal only covered Q-variant mul_mv (matrix-vector) kernels via helper_mv_add_bias — the mul_mm (matrix-matrix) kernel that the CFM transformer actually hits at T·B ≥ 2 had no equivalent. This patch wires one in. Three pieces: 1. kernel_mul_mm in ggml-metal.metal gets: FC_mul_mm_has_bias_ [[function_constant(FC_MUL_MM + 2)]] FC_mul_mm_has_residual_ [[function_constant(FC_MUL_MM + 3)]] device const char * bias [[buffer(4)]] device const char * residual [[buffer(5)]] When either FC is true the kernel routes through the shmem- backed scalar-copy path and folds bias (broadcast over [ne0]) and/or residual (same shape + stride as dst) into the copy loop. Same post-matmul math as helper_mv_add_bias. Compiler drops the un-selected branch — zero overhead when neither set. 2. get_pipeline_mul_mm in ggml-metal-device.{cpp,h}: new has_bias / has_residual flags bake into pipeline name (kernel_mul_mm_<T0>_<T1>_bci=X_bco=Y_bias=Z_res=W) and FC values. Shmem bumped from 4 KB+2 KB to 8 KB when fused. 3. Dispatcher ggml_metal_op_mul_mat in ggml-metal-ops.cpp: mirrors the Q-mul_mv fusion lookup. Tries {MUL_MAT, ADD, ADD} first, falls back to {MUL_MAT, ADD}. Handles both orderings of the residual add (ggml_add is commutative; chatterbox's basic_tfm emits ggml_add(x, attn_out) with residual x as src[0] and the mul_mat+bias result as src[1]). Writes fused dst to node(idx + n_fuse - 1) so the value lands where the skipped ADDs would have written, returns n_fuse so the outer loop skips them. Kernel variants verified via compile_pipeline trace on M3 Ultra: kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0 ← CFM tfm linears kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0 ← edge blocks kernel_mul_mm_f32_f32_bci=0_bco=0_bias=1_res=0 ← time_mlp etc. kernel_mul_mm_f32_f32_bci=0_bco=1_bias=1_res=0 kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=0_res=0 ← unfused passthroughs kernel_mul_mm_f32_f32_bci=1_bco=1_bias=0_res=0 ~1820 saved ggml_add dispatches per synthesis. No `res=1` variants fire in the current chatterbox graph — the ADD(residual) in basic_tfm is separated from the ADD(bias) by layer_norm + mul_mat + add(bias) + gelu_erf + mul_mat + add(bias), so can_fuse correctly refuses to span those nodes. Infrastructure in place either way for consumers whose residual is adjacent to mul_mat. Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42, 5 invocations): stage | §3.26 baseline | §3.27 fused | delta ------------------|---------------:|------------:|----------: [encoder] ms | 31.3 | 30.5 | noise [cfm_total] ms | 541.9 | 542.2 | neutral [hift_decode] ms | 121.3 | 121.2 | neutral S3GEN_INFER_MS | 709 | 713.2 | +4 noise T3_INFER_MS | 440 | 433.4 | -7 noise WAV md5 byte-exact across 5 runs: d8a1b22375dbcb2259c686426a7d76c5 (matches §3.26 v2 baseline). test-metal-ops PASSes on all four existing ops. GGML_METAL_FUSION_DISABLE=1 cross-check: CFM 568.9 ms steady, confirming the entire fusion system (mine + pre-existing norm+mul+add + Q-mul_mv+bias) is worth ~27 ms aggregate; my mul_mm contribution is the small end of that. Ships as **neutral on M3 Ultra** for two reasons: 1. M3U's Metal per-dispatch overhead is ~20–30 µs; 1820 dispatches * ~25 µs = ~45 ms theoretical, but many overlap with subsequent kernels via Metal's command buffer and don't sit on the critical path. 2. Forcing has_bias=true through the shmem path (instead of direct-store + post-barrier bias-add — too complex to retrofit into both tensor-API + simdgroup-fallback paths in one session) costs roughly what the dispatch savings deliver. Still shipping because: - Byte-exact + test-metal-ops PASS (correctness guaranteed). - M4 Air / iPhone / iPad have proportionally higher per-dispatch overhead and lower core counts — expected +5–15 ms S3Gen win there (same ratio §3.24's HiFT F16 result predicted on M4). - Mode 2/3 streaming synthesises short chunks where per-chunk dispatch count matters more relative to compute — fusion should be proportionally larger there. - Infrastructure leverage: FC_MUL_MM + 2/+3 slots + helper routing are plumbing future sessions will reuse (extend to mul_mm_id for MoE, extend to F16 weights, reclaim the shmem-roundtrip cost on M3U by patching the direct-store paths). PROGRESS §3.27 has the full bench table + rationale + what's next (direct-store fold-in ~2–3 h, mul_mm_id, M4 validation). patches/ggml-metal-chatterbox-ops.patch 733 -> 995 lines, +262, regenerated from pinned ggml 58c38058, applies cleanly via `git apply --check`. Made-with: Cursor

…ff0 path) Builds directly on §3.27 infrastructure. Closes the `mul_mat → add(bias) → gelu_erf` triple in CFM basic_tfm's FF gate projection (src/chatterbox_tts.cpp:738): ff = ggml_add(ctx, ggml_mul_mat(ctx, w.ff0_w, nx2), w.ff0_b); ff = ggml_gelu_erf(ctx, ff); ff = ggml_add(ctx, ggml_mul_mat(ctx, w.ff2_w, ff), w.ff2_b); §3.27 brought mul_mat + add(bias) into a single dispatch via the shmem-backed scalar-copy path; §3.28 extends that same loop to apply gelu_erf as the last stage before writing to dst. The gelu is inline FP math on each element we're already reading+writing — no extra memory, no extra barriers, no extra shmem — so unlike §3.27's neutral-on-M3U result, this one is a clear net positive. Three pieces: 1. ggml-metal.metal: FC_MUL_MM + 4 = FC_mul_mm_has_gelu_erf_. New branch at the end of the scalar-copy loop applying `0.5 * v * (1 + erf_approx(v * SQRT_2_INV))` — the same formula OP_UNARY_NUM_GELU_ERF uses, using the shared erf_approx<T> helper. Numerically identical to the unfused path. 2. get_pipeline_mul_mm: signature bumped to (op, has_bias, has_residual, has_gelu_erf); pipeline name extended with _gelu=N; shmem sizing gated on any-fold-in flag. 3. Dispatcher mul_mm path: new {MUL_MAT, ADD, UNARY} can_fuse lookup wedged between the {MUL_MAT, ADD, ADD} residual lookup and the {MUL_MAT, ADD} bias-only fallback. Verifies ggml_get_unary_op(f2) == GGML_UNARY_OP_GELU_ERF and f2->src[0] == f1 before fusing. Gated to GELU_ERF specifically (basic_tfm's only unary); other sub-ops (SILU/GELU/RELU/...) extend trivially but aren't needed for chatterbox. Kernel variants actually compiled (from GGML_LOG_DEBUG trace): kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0_gelu=1 ← CFM ff0 kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0_gelu=1 ← ff0 edge kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0_gelu=0 ← ff2/to_out kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0_gelu=0 kernel_mul_mm_f32_f32_bci=0_bco=0_bias=1_res=0_gelu=0 ← time_mlp kernel_mul_mm_f32_f32_bci=0_bco=1_bias=1_res=0_gelu=0 kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=0_res=0_gelu=0 ← unfused kernel_mul_mm_f32_f32_bci=1_bco=1_bias=0_res=0_gelu=0 gelu=1 variants handle 56 basic_tfm × 10 CFM steps × 2 CFG batches = 1120 saved gelu_erf dispatches per synth, on top of the 1820 bias-add dispatches saved in §3.27. Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42, 5 invocations): stage | §3.27 baseline | §3.28 this | delta ------------------|-----------------:|---------------:|-------------: [encoder] ms | 30.5 | 30.8 | noise [cfm_total] ms | 542.2 | 533.4 ± 1.0 | -8.8 / -1.6 % [hift_decode] ms | 121.2 | 120.8 | neutral S3GEN_INFER_MS | 713.2 | 706.0 ± 0.8 | -7.2 / -1.0 % T3_INFER_MS | 433.4 | 431.0 | noise WAV md5 | d8a1b22… | d8a1b22… | byte-exact x5 test-metal-ops: all 4 pre-existing ops PASS. WAV md5 byte-exact to §3.26 / §3.27 baseline across all 5 invocations. Deterministic. Why this lands positive on M3 Ultra (unlike §3.27): §3.27's gain was eaten by the shmem-roundtrip cost: routing through temp_str + sgitg==0 scalar copy costs ~= what the 1820 eliminated ggml_add dispatches saved. §3.28 adds the gelu fold-in INTO the same loop — no additional memory accesses, no barriers, no extra shmem — just a handful of FLOPs per element. So the 1120 saved gelu_erf dispatches show up as clean net positive. This also refines the §3.27 story: the infrastructure we built there is what makes §3.28 cheap. Fusing additional per-element tail ops into the scalar-copy loop is essentially free, whereas routing through the shmem path is what cost M3 Ultra its estimated §3.27 win. Patch file: 995 -> 1054 lines (+59). Applies cleanly on a fresh ggml clone at pinned 58c38058 via `git apply --check`. PROGRESS §3.28 has full writeup + bench table + kernel variant list + next-step pointers (extend to other unary sub-ops trivially; reclaim §3.27 shmem cost by patching direct-store paths). Made-with: Cursor

…verted Goal: reclaim §3.27's neutral-on-M3-Ultra result by doing the bias / residual / gelu_erf fold-in as a post-barrier read-modify-write on device memory, instead of routing through the shmem + scalar-copy path. The shmem path §3.27 ships is correct but costs a threadgroup-memory roundtrip that ~cancels out the dispatch savings on M3 Ultra. (§3.28 got a clean +8.8 ms CFM precisely because its gelu fold-in happened *inside* a loop that was already going to run — zero extra memory cost.) Implementation (reverted, left as a 21-line doc block in the kernel): after `cT.store(tC)` or the 4-simdgroup `simdgroup_store` loop writes the full 64x32 block to device memory, use a threadgroup_barrier(mem_flags::mem_device) to order the writes, then distribute the RMW across all 128 threads of the threadgroup (16 elements per thread, full block coverage regardless of how cT.store's cooperative layout partitioned the write). Result: test-metal-ops PASSed (diag_mask_inf, pad_ext, conv_transpose_1d × 3 + tiny) but end-to-end chatterbox synth produced wrong output: metric | §3.28 baseline | §3.29 attempt ------------|--------------------------------------|------------------ md5 | d8a1b22375dbcb2259c686426a7d76c5 | 06ee1aaaa94a10d70eec2835d3da7dbf T3 tokens | 84 | 70 (EOS 14 early) audio_ms | 3480 | 2920 determinism | stable across 5 runs | stable (same wrong md5) Deterministic but wrong — a systematic computation bug, not a race. Suspected root causes (not isolated): 1. cT.store's cooperative layout is implementation-defined; threadgroup_barrier(mem_device) may not be strong enough to order its writes against subsequent device reads on A17 / M3. Would need a `fence()` or `simdgroup_fence_t` audit. 2. residual_ok's `ggml_are_same_shape(resi, mm)` check doesn't distinguish contiguous from view-with-strides; §3.27's scalar-copy path would mask a stride bug that direct-store would reveal. 3. An off-by-one or wrong-stride in the RMW's index arithmetic that test-metal-ops doesn't exercise. What's missing: no per-shape unit test for fused `mul_mm + add(bias)` vs unfused `mul_mat + add` reference. test-metal-ops only covers diag_mask_inf, pad_ext, conv_transpose_1d. §3.27 / §3.28 happen to be byte-exact because their fold-in happens inside the scalar-copy loop which is straightforward; §3.29's direct-store RMW has subtler data flow that would benefit from explicit coverage. Reverted. cb_rev.wav md5 byte-exact to §3.28 baseline `d8a1b22375dbcb2259c686426a7d76c5`; T3 back to 84 tokens / 3480 ms. No code change from §3.28 beyond a 21-line documentation block in the kernel explaining what was attempted, what failed, and what the next person should try first (add the unit test, then retry with bias-only scope). Patch 1054 → 1070 lines (+16). Applies cleanly on pinned ggml 58c38058. PROGRESS §3.29 has the full writeup, including next-person notes pointing at the §5.7 of Apple's Metal Shading Language Specification for cooperative-store barrier semantics. Made-with: Cursor

…irect-store retry Closes two §3.29 loose ends: 1. New harness in src/test_metal_ops.cpp — test_mul_mm_fused(cpu, gpu, K, N, T, B, fuse_mode, label). Builds the small graph add(mul_mat(W_q4_0, X_f32), bias) (and optionally + gelu_erf), dispatches on both backends, compares element-wise. On Metal ggml-metal's fusion detector collapses this into a single kernel_mul_mm_..._bias=1_res=X_gelu=Y dispatch; CPU is always the unfused triple. Any numerical drift beyond 2e-2 absolute (4x the Q4_0-dequant-order CPU-vs-GPU noise floor observed on K=256..1024 shapes) flags a real bug. This is the test §3.29 should have had; would have caught the reverted direct-store RMW's wrong-output regression in seconds instead of forcing a full end-to-end chatterbox bench to surface it. 2. Bias-only direct-store retry: full-block writes with has_bias && !has_residual && !has_gelu_erf now take the direct-store path (cT.store / simdgroup_store → device memory), followed by a threadgroup_barrier and a 128-thread post-scan adding bias[r0 + row_off] to each of the 2048 block elements. Residual / gelu fold-ins still route through shmem — §3.29's negative finding on those paths stands, root cause unresolved (suspect cT.store cooperative-layout × mem_flags::mem_device barrier semantics on A17/M3; deeper audit required). This is the minimum-scope slice of §3.29 that the new harness proves byte-stable. Harness coverage — 8 fused-mul_mm shape variants matching CFM hot path: [mul_mm_fused cfm-attn-qkv] OK K=256 N=256 T=87 B=2 bias [mul_mm_fused cfm-attn-out] OK K=256 N=512 T=87 B=2 bias [mul_mm_fused cfm-ff-gate-bias] OK K=256 N=1024 T=87 B=2 bias [mul_mm_fused cfm-ff-gate-bias+gelu] OK K=256 N=1024 T=87 B=2 gelu [mul_mm_fused cfm-ff-down] OK K=1024 N=256 T=87 B=2 bias [mul_mm_fused cfm-b1] OK K=256 N=512 T=87 B=1 bias [mul_mm_fused bco-bias] OK K=256 N=320 T=87 B=2 bias (bco=1 shmem path) [mul_mm_fused bco-gelu] OK K=256 N=320 T=87 B=2 gelu (bco=1 shmem path) All existing test-metal-ops cases (diag_mask_inf, pad_ext, conv_transpose_1d × 3 + tiny) still pass. Bias-only direct-store bench impact on M3 Ultra (5 invocations, Q4_0 + HiFT F16, ES prompt, seed 42): metric | §3.28 | §3.30 | delta -------------------|-----------------:|-----------------:|--------: [cfm_total] ms | 533.4 ± 1.0 | 534.0 ± 0.9 | noise S3GEN_INFER_MS | 706.0 ± 0.8 | 706.2 ± 3.2 | noise [hift_decode] ms | 121.2 | 121.8 | noise Neutral on M3 Ultra (same as §3.27) because in chatterbox's basic_tfm every mul_mat+bias has a follow-up op (residual or gelu) that forces the 3-op fusion through the shmem path. The 2-op path §3.30 optimises only fires for tensors outside basic_tfm (time_mlp / final_proj / resnet t_mlp) that don't contribute measurable wall time. WAV md5 byte-exact vs §3.28 baseline across all 5 runs: d8a1b22375dbcb2259c686426a7d76c5. T3 84 tokens, audio 3480 ms. The harness is the real deliverable. Any future attempt at the residual / gelu direct-store paths now has a way to get fast feedback on whether a change is correct before a full chatterbox run. Patch regenerated: 1070 -> 1088 lines (+18). Applies cleanly on fresh ggml at pinned 58c38058. PROGRESS §3.30 has the full writeup including the three remaining deferred items: residual direct-store (needs barrier audit), gelu direct-store (same), and extending fusion to other unary sub-ops (SILU/GELU/RELU/GELU_QUICK) — none of those are chatterbox hot-path so deferred. Made-with: Cursor

…ion.sh Closes the validation gap for §3.24 / §3.26 / §3.27 / §3.28 / §3.30. All five are predicted positive on bandwidth-limited Apple silicon (M4 / iPhone / iPad) but were measured only on M3 Ultra where per-dispatch overhead is so low that the fusion wins largely cancel out against kernel-path overhead. Two pieces: 1. iOS-arm64 cross-build portability: cmake -S . -B build-ios \ -DCMAKE_SYSTEM_NAME=iOS \ -DCMAKE_OSX_SYSROOT=iphoneos \ -DCMAKE_OSX_ARCHITECTURES=arm64 \ -DCMAKE_OSX_DEPLOYMENT_TARGET=14.0 \ -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON \ -DGGML_NATIVE=OFF -DGGML_BLAS=OFF -DGGML_ACCELERATE=OFF Both libggml-metal.a and libtts-cpp.a compile clean for iOS 14.0+ arm64 on Xcode 16 / iOS 18.5 SDK. Structural proof that §3.26's kernel_mul_mv_f32_f16{,_4,_short} + §3.27/§3.28/§3.30's kernel_mul_mm FC-gated bias / gelu_erf fold-ins are iOS-portable — none of the new kernel code uses macOS-only intrinsics. Runtime validation still needs TestFlight / device provisioning; this confirms no compile-time barrier to shipping. 2. scripts/bench-m4-validation.sh — self-contained 150-line bash harness. Pipeline on any Apple-silicon Mac: - scripts/setup-ggml.sh to apply Metal + OpenCL patches at the pinned ggml commit - build-metal Release build - test-metal-ops asserts all 14 gates PASS (3 base + 3 conv_transpose_1d + 8 fused-mul_mm) - 5 invocations of chatterbox on the Spanish-prompt baseline - mean [encoder] / [cfm_total] / [hift_decode] / S3Gen / T3 - md5 determinism (all 5 runs identical) and byte-exactness check vs M3 Ultra reference d8a1b22375dbcb2259c686426a7d76c5 - writes artifacts/bench/m4-validation.json with the full host-vs-reference comparison + Δ% per stage M3 Ultra reference numbers (post-§3.30, baked into the script): cfm=534.0 ms s3gen=706.6 ms t3=432.6 ms hift=121.1 ms Env vars: T3_GGUF, S3GEN_GGUF, REF_WAV, RUNS, OUT_DIR. Self-smoke on M3 Ultra (expected: tiny deltas, byte-exact PASS): stage M3 Ultra (ref) this host Δ vs M3U [cfm_total] ms 534.0 533.7 -0.3 (-0.1%) S3GEN_INFER_MS 706.6 707.4 +0.8 (+0.1%) T3_INFER_MS 432.6 434.6 +2.0 (+0.5%) [hift_decode] ms 121.1 123.1 +2.0 (+1.7%) determinism: PASS (md5 d8a1b22375dbcb2259c686426a7d76c5 x5) byte-exact vs M3 Ultra: PASS All deltas within per-invocation stdev; script is ready to scp + run on any M4 / M3 / M2 box. If M4 shows noticeably smaller CFM than M3U (after accounting for M4's higher single-core clock), §3.27/§3.28/§3.30's predicted-positive story is vindicated. If M4 matches M3U or regresses, those sections should be revisited. What I can't do from this host: - Run on an M4 Air or iOS device (no ssh access to the user's M4 laptop, no iOS device provisioning on this box). Tailscale shows the user's MacBook as reachable but port 22 timed out (sshd not enabled / firewalled). The script is the best hand-off. PROGRESS §3.31 documents the iOS build procedure + harness usage + expected M4 output shape. Made-with: Cursor

Wraps the Apr 30 – May 1 Metal optimisation pass on chatterbox.cpp multilingual into a single closeout document. SUMMARY-3.24-3.31.md (new, 120 lines): - Per-section commit log with M3 Ultra deltas and GGUF-size deltas - Parity guarantees (WAV byte-exact d8a1b22375dbcb2259c686426a7d76c5, 14/14 test-metal-ops gates, 8 model pairs, streaming, long-text, patch portability, iOS-arm64 cross-build) - Open follow-ups with effort + expected gain + status - Final bench of the shipping config (Q4_0 + HiFT F16 v2) — CFM 534.0±1.3 ms, S3Gen 706.6±4.5 ms, T3 432.6±2.2 ms, inference ~1165 ms, RTF 0.33 - Reproduction commands README.md "Performance" section gets a new subsection "Multilingual (M3 Ultra, post §3.24–§3.31 Metal kernel portfolio)" showing the 5-run numbers next to the existing §3.21 N=10 row: stage §3.21 → §3.28 delta T3 infer 482 ms → 433 ms -49 ms / -10.2 % S3Gen infer 730 ms → 706 ms -24 ms / -3.3 % RTF 0.35 → 0.33 Net across 8 commits (c47c776 → 0902381): - 5 measurable: §3.24 (-3.6 ms HiFT), §3.26 (-7.7 MB GGUF), §3.27 (infra), §3.28 (-8.8 ms CFM), §3.30 (harness infra) - 3 docs/negative: §3.25 (FA neg), §3.29 (direct-store neg), §3.31 (iOS portability + M4 handoff script) M3 Ultra shipping config CFM drops 541.9 -> 534.0 ms (-1.5 %); bandwidth-limited silicon (M4 / iPhone / iPad) predicted larger win but unmeasured — scripts/bench-m4-validation.sh ready to confirm when test host is available. Made-with: Cursor

…ng, decode) Folds in five small follow-up fixes from gianni-cor's review of PR #3 (gianni-cor/chatterbox.cpp). All are byte-equivalent for the runtime audio path (md5 57cc80f27a122f03435fd05f47d1b3d2 unchanged on the ES reference prompt + jfk/gianni voice + seed 42); the changes target portability, error-message accuracy, and self-documenting code. 1. scripts/dump-t3-mtl-reference.py: replace the hardcoded /Users/gustavoefa/dev/chatterbox-ref/src sys.path entry with ${CHATTERBOX_REF_SRC} (default ../chatterbox-ref/src relative to the repo root, matching the layout the README prescribes). Anyone else regenerating the parity .npy dumps no longer hits ModuleNotFoundError. 2. src/main.cpp load_model_gguf: when chatterbox.variant is present but not GGUF_TYPE_STRING, refuse to load with a clear error instead of silently defaulting to the Turbo loader (which would later die with a misleading "missing tensor" error from absent GPT-2 names). 3. src/t3_mtl.cpp build_perceiver_attn: add a comment clarifying that the LayerNorm eps is intentionally fixed at 1e-5 (PyTorch nn.LayerNorm default) and is NOT hp.eps (the Llama backbone's RMSNorm eps), so a future reader doesn't "fix" it and break parity. 4. scripts/convert-t3-mtl-to-gguf.py: rename SUPPORTED_LANGUAGES to ALL_KNOWN_LANGUAGES with a comment explaining the C++/Python contract divergence (Python tokenizer accepts 23 codes; C++ tokenizer hard- errors on 5 of them at runtime - ja/he/ru/zh/hi need pykakasi / dicta / russian_text_stresser / Cangjie preprocessing). Updates the GGUF metadata writer to use the new name. 5. src/mtl_tokenizer.cpp decode: change the leading-space guard from `k > 0` to `!out.empty()` so that when the first id(s) are skipped special markers ([START], [PAD], ...), the first emitted regular token doesn't get a spurious leading space. Brings the debug-only decoder in line with Python's reference behaviour. Co-authored-by: Cursor <cursoragent@cursor.com>

feat: Add multilingual variant

reichert-dev added 19 commits April 28, 2026 13:21

GustavoA1604 changed the base branch from multilingual_merged to multilingual_merged_metal May 4, 2026 16:19

GustavoA1604 changed the base branch from multilingual_merged_metal to multilingual_merged May 4, 2026 16:36

GustavoA1604 merged commit ee47d5e into GustavoA1604:multilingual_merged May 4, 2026

GustavoA1604 pushed a commit that referenced this pull request May 4, 2026

Merge pull request #3 from GustavoA1604/multilingual_merged

ddca05f

feat: Add multilingual variant

Zbig9000 mentioned this pull request May 5, 2026

QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal optimisation #3

Metal optimisation #3
GustavoA1604 merged 19 commits into
GustavoA1604:multilingual_mergedfrom
mario-rei:multilingual_merged

mario-rei commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mario-rei commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

What worked

What didn't work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mario-rei commented Apr 28, 2026 •

edited

Loading