Skip to content

Metal optimisation #3

Merged
GustavoA1604 merged 19 commits into
GustavoA1604:multilingual_mergedfrom
mario-rei:multilingual_merged
May 4, 2026
Merged

Metal optimisation #3
GustavoA1604 merged 19 commits into
GustavoA1604:multilingual_mergedfrom
mario-rei:multilingual_merged

Conversation

@mario-rei

@mario-rei mario-rei commented Apr 28, 2026

Copy link
Copy Markdown

CFG cond+uncond batched into one Metal forward (B=2) on the multilingual T3, plus --cfm-steps N CLI knob for the standard 10-step CFM, plus ggml_swiglu_split on the MLP.

Results

M3 Ultra Metal, Spanish prompt, seed 42, greedy, 3-warm-run avg.

Variant T3 before T3 after S3Gen before S3Gen after Total before Total after RTF before RTF after
Q4_0 872 ms 478 ms (−45%) 730 ms 576 ms (−22%) 1612 ms 1054 ms (−35%) 0.46 0.30
F16 1099 ms 579 ms (−47%) 752 ms 586 ms (−22%) 1943 ms 1165 ms (−40%) 0.53 0.32

What worked

  • CFG B=2 batching on T3 — biggest win (−42 to −45% T3). Mirrors use_b2 from S3Gen; cond+uncond pack into ne[3]=2 and share a unified KV buffer. CPU keeps the two-call fallback. WAV byte-exact vs baseline.
  • --cfm-steps 7 — −22% S3Gen at log-mel cosine 0.995 vs N=10. Default unchanged (10).
  • ggml_swiglu_split — single fused kernel_swiglu_f32 instead of three element-wise ops. Marginal on M3 Ultra (Q4_0 −4%) but byte-exact and future-proof.

What didn't work

  • F16 KV cache — reverted as neutral on M3 Ultra (within noise, byte-exact audio). Strong evidence flash_attn_ext already runs its inner matmul at F16 regardless of K/V storage dtype. Left as one-line opt-in (GGML_TYPE_F32F16) for memory-bound backends (e.g. M4).
  • Custom Metal kernel patches for RMS_NORM+MUL / SwiGLU — already covered by upstream's can_fuse(...). F16 mul_mat + add(bias) stretch skipped — Llama Q/K/V/O have no bias.
  • --cfm-steps 6 — sits right on the cos ≥ 0.99 threshold (PCM cos drops to 0.88, attack reconstruction drifts). Opt-in only.

The §3.19 multilingual T3 ran CFG as two sequential
run_step_pass/run_prompt_pass calls per token, each rebuilding +
computing a 30-layer Llama graph with a separate memory_k_uncond /
memory_v_uncond KV cache. On Metal this doubled per-step kernel
dispatch + weight-read overhead -- exactly the regression `use_b2`
already paid off for S3Gen's CFM (chatterbox_tts.cpp:1994 / §3.19).

This commit mirrors that on T3 by packing cond + uncond into the batch
dim (ne[3]=2) for inputs_embeds, pos_ids, kq_mask, and the per-layer
Q/K/V activations. RoPE + flash_attn_ext both broadcast head/seq dims
over batch out of the box, so build_llama_block only grew an int B
parameter and a size_t b_offset_elems (one cache slab offset for the
legacy B=1 CPU fallback).

KV layout rework: the two parallel 1-D F32 KV buffers (memory_k +
memory_k_uncond) are now a single contiguous 2 * kv_layer_elems buffer
per layer, cond at offset 0, uncond at offset kv_layer_elems. The B=2
graph views the same buffer as (head_dim, n_ctx, n_kv_head, B=2) with
batch_stride = kv_layer_elems * sizeof(F); the legacy B=1 CPU path
selects the right half via b_offset_elems = is_uncond ? kv_layer_elems
: 0. Total backend allocation is unchanged (still 2x kv_elements per
cache); we just dropped two ggml_new_tensor_1d calls.

eval_step_mtl / eval_prompt_mtl dispatch the B=2 path when
!ggml_backend_is_cpu(model.backend) -- mirrors use_b2 in S3Gen
exactly. CPU keeps the two-call path (per-op B=2 work doubles without
saving ops on ggml-cpu).

Bench (M3 Ultra Metal, §3.19 reference Spanish prompt, seed 42, greedy):
- Q4_0 T3: 872 ms -> 502 ms (-42%); RTF 0.46 -> 0.35
- F16  T3: 1099 ms -> 602 ms (-45%); RTF 0.53 -> 0.37

Parity gates passed:
- Greedy decode token parity at --temp 0 --top-k 1: first 100 tokens
  identical to the two-call baseline on seed 42.
- End-to-end WAV byte-exact match vs the §3.19 reference run on Q4_0
  and F16 (cmp baseline.wav phase1.wav -> identical).
- CPU fallback (--n-gpu-layers 0) still produces audio with the B=1
  path.

The KV strides in build_llama_block are routed through
ggml_type_size(memory_k->type), so flipping GGML_TYPE_F32 ->
GGML_TYPE_F16 in load_model_gguf_mtl is a one-line opt-in for memory-
bound backends. On M3 Ultra it's neutral (Metal flash_attn_ext was
already running its inner matmul at F16 regardless of K/V storage
dtype); kept at F32 to preserve the §3.19 numerics envelope.

Made-with: Cursor
The s3gen_synthesize_opts.cfm_steps field already existed (and was
honoured by the inner CFM loop in chatterbox_tts.cpp:1973), but the CLI
only surfaced it for streaming chunks via --stream-cfm-steps. Non-
streaming MTL was therefore locked at the GGUF's n_timesteps=10. Add a
top-level --cfm-steps N flag and route it into all three non-streaming
s3gen_synthesize_opts setup sites:

  - tokens-file short-circuit (no T3, just S3Gen + HiFT)
  - regular non-streaming synthesis path
  - the --input-file live-input streaming path keeps using
    --stream-cfm-steps; --cfm-steps is documented as a non-streaming
    knob there.

Default stays at 0 (use the GGUF's n_timesteps), so existing scripts
keep their behaviour. Quality knee on the §3.19 reference Spanish
prompt (M3 Ultra Metal Q4_0; log-mel cosine vs N=10):

  N=6   S3Gen 518 ms   log-mel-cos 0.990   PCM-cos 0.88  (borderline)
  N=7   S3Gen 571 ms   log-mel-cos 0.995   PCM-cos 0.94  ← recommended
  N=8   S3Gen 629 ms   log-mel-cos 0.997   PCM-cos 0.97
  N=10  S3Gen 730 ms   log-mel-cos 1.000   PCM-cos 1.00  (default)

N=7 cleanly clears the cos >= 0.99 gate; N=6 sits right on it (PCM
cosine drops to 0.88 -- phase-coherent attack reconstruction starts
to drift).

Made-with: Cursor
build_llama_block did `silu(gate) * up` as three separate ggml ops:
ggml_silu(...), ggml_mul_mat(mlp_up, ...), ggml_mul(silu_out, up_out).
That's a silu + mul element-wise pair on top of the two mul_mats, at
30 dispatches/token across layers.

Upstream ggml already exposes this as a single op:
ggml_swiglu_split(ctx, gate, up) lowers to GGML_OP_GLU /
GGML_GLU_OP_SWIGLU, which Metal maps to kernel_swiglu_f32 (one fused
kernel per layer instead of two element-wise dispatches).

The pre-norm `mul(rms_norm(x), g)` pattern was already auto-fused by
ggml-metal's can_fuse(RMS_NORM, MUL) path (kernel_rms_norm_mul_f32);
left as-is so CPU + non-Metal backends get the same op shape.

Bench (M3 Ultra Metal, vs the B=2 CFG baseline from the previous
commit, 3 warm runs averaged):
- Q4_0 T3: 502 ms -> 482 ms (-4%, within plan's 5% gate but trend
  is positive across all 3 runs)
- F16  T3: 602 ms -> 635 ms (+5%, within run-to-run variance)

Marginal on M3 Ultra (the SwiGLU kernel was not on the critical path
once Phase 1 collapsed the per-step weight-read + dispatch cost).
Kept for code clarity (single fused op vs three) + because future
ggml-metal kernel improvements to swiglu will land here automatically.
WAV byte-exact vs the previous commit (kernel_swiglu_f32 is bit-
equivalent to manual `silu(gate) * up`).

Made-with: Cursor
README:
- Multilingual benchmark table at the top grew M3 Ultra rows
  (`--cfm-steps 7` and default N=10, Q4_0 + F16) alongside the
  existing M4 rows. Best M3 Ultra config now hits RTF 0.30 / **48.4×
  faster than ONNX Runtime** on the same prompt; M4 numbers retained
  for §3.19/§3.20 continuity.
- New "Multilingual (Mac Studio M3 Ultra, after §3.21 optimisation
  pass)" stage block under ## Performance, mirroring the M3 Ultra
  Turbo block: T3 / S3Gen / RTF for {Q4_0, F16} × {N=10, N=7}.
- The MTL `tts-cli` example mentions `--cfm-steps`; flag points
  readers at PROGRESS.md §3.21 for the quality knee numbers.

PROGRESS.md:
- Append §3.21 mirroring the §3.19/§3.20 layout (pre-rationale ->
  what shipped -> bench tables -> what didn't work -> backlog).
  Documents the bench matrix on M3 Ultra Metal {Q4_0, F16} ×
  {baseline, +Phase 1, +Phase 2, +Phase 4, +final} with stage
  breakdowns and audio-quality gates.
- "What didn't work -- NEGATIVE results" subsection explicitly
  documents Phase 2 (F16 KV) as a wash on M3 Ultra (with the
  observation that ggml-metal's flash_attn_ext was already running
  its inner matmul at F16 regardless of K/V storage dtype) and the
  Phase 4-stretch Metal-patch fusions as unnecessary because upstream
  already covers them.
- "What's next for MTL" backlog from §3.19 is updated with strikes
  on items now done (T3 quantisation, --cfm-steps); items remaining
  (heterogeneous-core CPU thread default, ja/he/ru/zh/hi tokenizer,
  speculative decoding, M4 F16 KV measurement) listed at the end.

Made-with: Cursor
…p scaffolding

Three small allocator-side cleanups on top of §3.21 — all byte-exact
WAV output (MD5 match against multilingual_merged HEAD), all neutral-
to-slightly-positive on M3 Ultra (the savings are too small to escape
run-to-run noise here, ~1% on S3Gen wall time, but the changes remove
unambiguously wasted work that matters more on slower CPUs / older
Metal where the topology walks and 64 MB memset are pricier).

1. `t3_mtl.cpp`: drop the explicit `ggml_gallocr_reserve` before
   `ggml_gallocr_alloc_graph` in the four `run_*_pass[_b2]` paths.
   `alloc_graph` already calls `ggml_gallocr_needs_realloc` and only
   triggers a re-reservation when the graph's per-node sizes grew —
   the per-step graph keeps the same node count + same shapes for
   every n_past >= 1, so 83 of the 84 step-pass reserves were doing a
   full O(n_nodes) topology walk for nothing. Per-step graph build
   itself is unchanged.

2. `chatterbox_tts.cpp` `run_hift_decode`: turn the 64 MB scratch
   `std::vector<uint8_t> buf(buf_size)` into a `thread_local`. The
   previous version forced a 64 MB memset on every HiFT call (one per
   `--out` invocation in batch mode, one per chunk in streaming).
   ggml_init resets the arena pointer, so the buffer is reused safely
   across calls without leaking tensor metadata.

3. `chatterbox_tts.cpp` `compute_time_mlp`: hoist the per-step
   ggml_init / build-graph / gallocr_new / gallocr_reserve /
   gallocr_free sequence into a `thread_local time_mlp_cache`. The
   graph topology (TDIM=320 sin/cos input → 2-layer MLP →
   TIME_EMB_DIM=1024 output) is constant across all 10 CFM steps;
   only the input scalar `t_val` changes. Now we build + reserve once,
   then per-step we just `alloc_graph` + `tensor_set` + `compute` +
   `tensor_get`. Cache key is `(backend)` so a backend swap rebuilds.

Bench (M3 Ultra Metal, Q4_0, ES prompt, seed 42, --temp 0 --top-k 1,
jfk.wav voice, 3 invocations):

| Stage      | §3.21 base | this commit | Δ       |
|------------|-----------:|------------:|--------:|
| T3 ms      |       479  |        474  |  -1%    |
| cfm_total  |       561  |        550  |  -2%    |
| hift_decode|       128  |        125  |  -2%    |
| S3Gen ms   |       730  |        728  |  -0.3%  |
| Total ms   |      1209  |       1202  |  -0.6%  |

Within run-to-run noise (~1.5% on T3) but consistently the same
direction across three back-to-back runs, and zero regression risk
because `ggml_gallocr_alloc_graph` covers the lazy-reserve case.

WAV byte-exact gate (md5sum):
- pre-commit: 79002f09bc48dda95ec0c2cfc2b895bd
- post-commit: 79002f09bc48dda95ec0c2cfc2b895bd

Made-with: Cursor
PROGRESS gets a new §3.22 documenting the three small allocator-side
cleanups shipped in 6141cf2 (drop redundant gallocr_reserve;
thread_local HiFT scratch; cached time_mlp graph). Numbers on M3
Ultra are within run-to-run noise (~1-2% per stage, ~0.6% total)
but consistently the same direction across 3 invocations and pass
the byte-exact WAV gate against §3.21 HEAD.

Section also documents the M3 Ultra "compute floor" finding: each
CFM step is ~52 ms steady-state of mat-mul work (480 nodes/step,
already amortised in one command-buffer commit), so allocator
shaving can only chip at the 20 ms first-step build overhead, not
the steady-state. Identifies F32 mul_mm + add(bias) shader fusion
as the next worthwhile attack — deferred to a future round.

Made-with: Cursor
Multi­lingual T3 step path on Metal previously dispatched three
separate Q4_0 mat-muls per layer for the Q / K / V projections (30
layers × 84 tokens × 3 ≈ 7560 dispatches per call inside the same
compute_graph commit). All three weights have identical Q4_0 shape
(n_embd=1024 in × n_embd=1024 out), so they can be concatenated
along the output (M) dim into one [n_embd, 3*n_embd] Q4_0 tensor and
fed into a single mat-mul whose output's first n_embd rows are Q,
middle are K, last are V.

Implementation:

- `chatterbox_model` gains `ctx_stack` + `buffer_stack` (a backend
  buffer alongside `buffer_w` and `buffer_kv`).
- `llama_layer` gains `wqkv : [n_embd, 3*n_embd]`.  Allocated only
  on non-CPU backends (CPU keeps the per-projection path; ggml-cpu's
  per-kernel overhead is already negligible and the +30 MB weight
  footprint trades unfavourably with thread-cache locality).
- At load time, after the GGUF weights buffer is allocated, the
  per-layer wq/wk/wv bytes are copied into wqkv via a host-side
  scratch buffer (Q4_0 row layout is M-major contiguous: row `r` is
  K/32 blocks of 18 bytes packed back-to-back, so concat is a flat
  byte append).
- `build_llama_block` runs ONE `ggml_mul_mat(W_qkv, cur)` and then
  carves out Q / K / V via strided `ggml_view_2d` / `_3d` views
  straight into the (HD, NH, N[, B]) shape RoPE expects.  No
  ggml_reshape (would need contiguous source) and no ggml_cont
  (would defeat the saving).  RoPE's metal kernel walks src via
  per-element nb01/nb02/nb03 strides so the strided N dim is
  transparent.

Process-wide `t3_stack_registry` + atexit hook in t3_mtl.cpp frees
buffer_stack before Metal's static device destructors run; without
this the new buffer triggers Metal's `[rsets->data count] == 0`
assert at process exit (residency sets still referenced through an
orphan backend buffer).  Mirrors the existing `s3gen_model_cache_release`
atexit pattern.  `free_t3()` in main calls `t3_stack_unregister()`
on the error-path early-returns so we don't double-free.

Why gate/up isn't stacked too: the multilingual T3 converter ships
`mlp_gate` as F16 and `mlp_up` as Q4_0 (verified via gguf reader on
models/chatterbox-t3-mtl-q4_0.gguf — gate.weight type=F16, up.weight
type=Q4_0 for every layer).  A single ggml_tensor can't hold mixed
element widths, so the stack is gated on `wq->type == wk->type ==
wv->type` and skipped for any layer that doesn't satisfy it.

Bench (M3 Ultra, Metal, ES prompt + jfk.wav voice, seed 42, --temp 0
--top-k 1, mean of 5 invocations):

| Variant | T3 §3.22 base | T3 this commit | Δ        |
|---------|--------------:|---------------:|---------:|
| Q4_0    |        474 ms |     **433 ms** | **-9.6%** |
| F16     |        522 ms |     **493 ms** | **-5.5%** |

Per-stage breakdown for Q4_0 (one run):
  encoder    30 ms     (unchanged)
  cfm_total 549 ms     (unchanged)
  hift      125 ms     (unchanged)
  T3        433 ms     (-46 ms vs baseline 479 ms)
  Total    1153 ms     (vs 1209 ms baseline, -4.6%)

WAV byte-exact gate: md5 `79002f09bc48dda95ec0c2cfc2b895bd` matches
across §3.22 base and post-commit at five separate invocations
(temp=0, top-k=1, deterministic).

Made-with: Cursor
PROGRESS gets a new §3.23 documenting the Phase-15 stacked W_qkv
optimisation shipped in 1f43ecc. Key contents:

- Implementation: Q/K/V rows concatenated row-wise into one Q4_0
  tensor at load time; build_llama_block runs ONE mat-mul + strided
  view-split into the (HD, NH, N[, B]) layout RoPE expects, no
  reshape (would need contiguous source) and no cont (would defeat
  the saving).
- Why gate/up isn't stacked: the multilingual T3 ships mlp_gate as
  F16 and mlp_up as Q4_0, so a single ggml_tensor can't hold them.
- Why CFM transformer Q/K/V isn't stacked: documented as a negative
  result with the GPU-occupancy explanation. CFM has M=512 / N=174
  which already saturates M3 Ultra's 60 cores in one wave; the
  stacked M=1536 / N=174 path needs three waves where one was
  enough. T3 wins because its step graph has M=1024 / N=1, which
  was at ~25% GPU occupancy un-stacked.

Bench (M3 Ultra, Q4_0): T3 474 → 433 ms (-8.7%), Total 1192 → 1153
ms (-3.3%), WAV byte-exact md5
79002f09bc48dda95ec0c2cfc2b895bd. Cumulative since §3.20 baseline:
T3 872 → 433 ms (-50%), RTF 0.46 → 0.29.

Made-with: Cursor
…f.py

Multilingual S3Gen ships HiFT as 246 F32 tensors (~80 MB) because
the converter and `requantize-gguf.py` both wholesale-rejected 3-D
shapes — `len(shape) != 2` always returned False in
`should_quantize`.  This commit unblocks the rejection on both
fronts:

1. **`should_quantize` now allows 3-D when ne[0] (the conv kernel
    size K) is a multiple of the quant block size.**  For Q4_0 /
    Q5_0 / Q8_0 (block 32) the HiFT stack still gates out
    universally — K ∈ {3, 7, 11, 16}, none 32-aligned, and the
    block layout assumes blocks span 32 consecutive ne[0] values
    within a fixed (ne[1], ne[2]) row, which only works when K
    itself is 32-aligned.  The plan's predicted-positive
    `K * IC % 32 == 0` check is necessary but not sufficient
    because re-quantising with a flattened (K*IC) reduction dim
    requires storing the result with ggml shape `(K*IC, OC)` —
    i.e. 2-D on disk — which then breaks `ggml_im2col(kernel, ...)`
    on the C++ side, which derives the kernel size from
    `kernel->ne[0]`.  Documented inline; the 3-D branch is
    forward-compatible for any future converter that ships
    K-aligned conv kernels.

2. **Add `f16` as a target dtype.**  F16 has block_size = 1, so
    the alignment gate is a no-op for any shape.  Combined with
    a new `--name-filter SUBSTRING` arg that constrains the
    rewrite to a tensor-name substring, lets us downcast HiFT
    conv kernels F32 → F16 without disturbing the existing Q4_0
    CFM linears.

3. **Pass-through path for already-quantised sources.**  A Q4_0
    source GGUF was previously a hard error in the pass-through
    branch (the existing code reshapes `data` to the element shape
    and Q-types have packed bytes).  Now branches on
    `GGML_QUANT_SIZES[t.tensor_type][0] == 1` (the float-types
    block) and writes the raw byte buffer through with the
    original shape for already-quantised inputs.

Two-pass recipe (multilingual S3Gen, Metal target):

  python scripts/requantize-gguf.py \
      models/chatterbox-s3gen-mtl-f16.gguf \
      /tmp/intermediate.gguf f16 --name-filter hift/
  python scripts/requantize-gguf.py \
      /tmp/intermediate.gguf \
      models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf q4_0

Result on M3 Ultra Metal (3 invocations, ES prompt, --seed 42):
  baseline q4_0 GGUF (HiFT all F32):
    GGUF size               788.4 MB
    [hift_decode] median    124.9 ms
    [s3gen_total] median    727 ms
    WAV md5    79002f09bc48dda95ec0c2cfc2b895bd

  q4_0 + HiFT F16 GGUF (this commit's recipe; 64 of 246 HiFT
  tensors get F16, the remaining 21 source_resblocks/* + 161
  biases/scalars stay F32):
    GGUF size               754.6 MB  (-4.3 %)
    [hift_decode] median    121.3 ms  (-2.9 %)
    [s3gen_total] median    726 ms    (within noise)
    PCM cosine vs baseline  0.999851  (essentially indistinguishable)

The 21 source_resblocks/* tensors that match the existing `/s`
substring deny-list (intentional false positive from the Turbo
deny-list era) are kept F32: tried unblocking them and the
multilingual decode segfaults on
`kernel_mul_mv_f32_f16_short`-not-found in the pinned ggml-metal
build.  Documented inline next to the `/s` entry; refining the
deny-list to endswith-only is gated on either patching that
kernel variant in or reshaping those tensors to a non-mat_mv
shape.

Validation: - chatterbox CLI on the new GGUF produces deterministic WAV
    (md5 ec58d3e65ab8e9c6f4edefb15b169ea5 every run).
  - PCM cosine 0.999851 vs the F32-HiFT baseline; max abs i16
    diff 616 / 32768 ≈ 1.9 % — well above the §3.20 cos ≥ 0.99
    quality gate.
  - 3 × 3-invocation benches (baseline before vs after) confirm
    -2.9 % HiFT decode median, -4.3 % GGUF size, no T3 / CFM /
    S3Gen-total impact.
Made-with: Cursor
PROGRESS gets a new §3.24 documenting both the Q4_0 negative
finding (K-dim alignment is structurally blocked: HiFT K ∈ {3, 7,
11, 16}, none 32-aligned, and the 2-D-on-disk fix breaks
ggml_im2col on the C++ side) and the F16 alternate path that ships:
script + recipe + bench + quality gate.

Headline numbers (M3 Ultra Metal, ES prompt, 3 invocations):
  GGUF size      788.4 MB → 754.6 MB  (−4.3%)
  [hift_decode]  124.9 ms → 121.3 ms  (−2.9%)
  PCM cosine     0.999851 across runs (well above 0.99 gate)

The smaller-than-planned win is honestly attributed:
- 21 of the 246 HiFT tensors (source_downs/*, source_resblocks/*)
  trip an existing `/s` deny-list false-positive; refining the
  deny-list segfaults at runtime (kernel_mul_mv_f32_f16_short
  isn't compiled in the pinned ggml-metal).  Tracked as follow-up.
- Activation traffic + im2col stay F32; F16 weights only cut the
  weight-load phase of mul_mat, not the whole decode.

Made-with: Cursor
Per the plan, the stretch goal was the F32 mul_mm + add(bias)
shader fusion in the metal patch.  Documented as a tracked
follow-up in §3.24 alongside the other two HiFT-related leftovers
(missing kernel variant, 2-D-on-disk Q4_0 with C++ conv1d_f32
branch) — the F16 alt-path (which actually shipped today) was
the cheaper and more immediately measurable win, so the F32 fusion
gets time-budgeted in a future session.

Made-with: Cursor
…_attn_ext

Tried flipping `src/chatterbox_tts.cpp::conformer_block()` (the 10 rel-pos
Conformer blocks that make up S3Gen's flow encoder) from the classic
`ggml_soft_max` + separate V mat-mul path to `ggml_flash_attn_ext`,
following the same pattern already used in `t3_mtl.cpp` (T3 Llama
attention) / `chatterbox_tts.cpp::basic_tfm` (CFM transformer) and just
shipped on `parakeet.cpp` §15.8 (their Conformer rel-pos MHA).

Bench (M3 Ultra, Metal, Q4_0, Spanish prompt "Hola mundo, esta es una
prueba multilingue.", seed 42, 3 invocations averaged):

  stage            | baseline | FA    | delta
  -----------------|---------:|------:|----------------:
  [encoder] ms     |     ~43  | 29.6  | -13 / -31%  (flow encoder only)
  S3Gen ms         |      721 |  708  | -13 / -1.8%
  T3 ms            |      433 |  430  | noise
  CFM total ms     |      546 |  538  | noise (-8)
  HiFT decode ms   |      126 |  125  | noise
  WAV md5          | 79002f09 | a4169d68 | differs

The speedup is real — 40 saved kernel dispatches per synth from
collapsing 10 * (softmax + permute + mul_mat with V) into 10 *
flash_attn_ext — but the WAV quality gate fails.

`ggml_flash_attn_ext` hard-requires an f16 mask
(`ggml.c:5320 GGML_ASSERT(mask->type == GGML_TYPE_F16)`). The Conformer
rel-pos bias `bd_final = mul_mat(p_perm, q_plus_v)` is f32; casting it
to f16 drifts each element by ~1e-4 (f16 has ~10 bits of mantissa,
`bd_final` values sit in the ±5 to ±10 range). That drift compounds
through:

  flow encoder (10 conformer blocks) -> CFM 10-step diffusion U-Net
                                     -> HiFT vocoder -> waveform

which is far more drift-amplifying than parakeet's joint-argmax
downstream. Measured WAV:

  lengths  base=83520  fa=83520
  samples  n=83520  cos=0.998647
  rms_diff=69.334   rms_base=1332.522
  max_abs_diff=1702.0
  gate: FAIL (threshold > 0.9998; got 0.998647)

Three rescue options explored and rejected:

  1. Pass bd_scaled in f32 via ggml_flash_attn_ext — blocked by the
     hard f16 assertion.
  2. Compute bd_final in f16 from the start (cast p_perm and
     q_plus_v) — pushes the same precision loss earlier in the
     graph, doesn't improve the downstream cosine.
  3. Skip the mask (pass nullptr) — mathematically wrong; bd_final
     *is* the relative-position bias that Conformer attention
     requires, dropping it breaks position-aware attention.

Reverted: conformer_block stays on the ggml_soft_max path. WAV md5
restored byte-exact to 79002f09bc48dda95ec0c2cfc2b895bd against the
§3.22 reference (verified /tmp/cb_revert.wav == /tmp/cb_base_1.wav).

Code change is additive (documentation-only): 16-line comment in
`conformer_block()` explaining why flash_attn is intentionally not
taken here, pinning the negative-finding cosine number + the speed
upside that was measured, and pointing at the parakeet §15.8
counterexample so the next person who looks at this doesn't redo
the experiment.

PROGRESS §3.25 adds the full writeup with the measurement table,
the reasoning for why parakeet could absorb this precision drift
but chatterbox can't (argmax-vs-waveform downstream sensitivity),
the three rejected rescue options, and the pointer to the two
remaining quick-win candidates that don't have this compounding
problem:

  - strip redundant `ggml_cont` after Conformer Q/K/V permutes
    (some are removable because Metal mul_mat walks strides natively)
  - F32 `mul_mm + add(bias)` shader fusion in
    patches/ggml-metal-chatterbox-ops.patch (already queued from §3.24)

Made-with: Cursor
…ants; relax /s deny; fix Q-type passthrough

Closes the open follow-up from §3.24 §3.25 ("Patch the missing
`kernel_mul_mv_f32_f16_short` variant to unblock the remaining 21
HiFT source_* conv kernels").

Problem.  §3.24 converted 64 of HiFT's F32 conv-kernel weights to
F16 but kept the 21 `source_*` conv kernels at F32 because
requantize-gguf.py's `/s` glob swept them, and — once that glob
was narrowed — the resulting Metal build crashed:

    ggml_metal_library_compile_pipeline: Error Domain=MTLLibraryErrorDomain
    Code=5 "Function kernel_mul_mv_f32_f16_short was not found in the library"
    ... SIGSEGV at first HiFT decode (exit 139).

HiFT's `conv1d_f32` does
`ggml_mul_mat(im2col_f32, kernel_reshaped_f16)`, which the Metal
dispatcher compiles as `kernel_mul_mv_f32_f16_short` (T0=src0=
im2col=f32, T1=src1=kernel=f16, short-axis path because
`source_downs/2/weight` has OC=64).  That template instantiation
was missing from the pinned ggml-metal (commit 58c38058) — the
family shipped `f32_f32`, `f16_f32`, `f16_f16`, `bf16_f32`,
`bf16_bf16` but not `f32_f16`.

Fix, three one-liner template instantiations in ggml-metal.metal:

  // kernel_mul_mv_t_t family (full-shape mat-vec)
  template [[host_name("kernel_mul_mv_f32_f16")]] kernel mul_mv_t_t
      kernel_mul_mv_t_t<float, half>;
  // kernel_mul_mv_t_t_4 family (vec4 path)
  template [[host_name("kernel_mul_mv_f32_f16_4")]] kernel mul_mv_t_t_4
      kernel_mul_mv_t_t_4<float, float4, half, half4>;
  // kernel_mul_mv_t_t_short family (short-axis path — HiFT's actual
  // hit, since source_downs/2/weight has OC=64 which trips the
  // "small matrix-vector" fast path)
  template [[host_name("kernel_mul_mv_f32_f16_short")]] kernel mul_mv_t_t_short_t
      kernel_mul_mv_t_t_short<float, half>;

The `_impl` bodies already handle arbitrary casts via
`(float) x[i] * (float) y[i]` — only the symbol lookup was missing.

requantize-gguf.py gets three paired changes so the recipe works
end-to-end on the current gguf-0.18 writer:

  1. `/s` deny narrowed to `/scale`.  The broad `/s` glob was
     originally a rough proxy for norm-scale params but
     incidentally swept every `hift/source_*/` weight + bias
     (60+ HiFT tensors, 21 of which are 3-D conv kernels that
     §3.26 now unblocks).  With the Metal kernels shipped, those
     21 conv weights are safe to F16.  The norm-scale tensors
     the deny was targeting (`/scale`, `/ln_`, `/norm/`,
     `/gamma`) remain covered by their own stricter patterns.

  2. Q-type passthrough byte-shape fix.  gguf-0.18's
     `add_tensor_info` treats `raw_shape` as **byte** layout (inner
     dim is bytes/row, not elements/row) when tensor.dtype is
     uint8.  The previous code passed element shape verbatim,
     which crashed with
     `ValueError: Quantized tensor bytes per row (512) is not a
     multiple of Q4_0 type size (18)` on any two-pass recipe
     that re-quantised a GGUF already carrying Q-type tensors
     (e.g. `f16 → q4_0` or `q4_0 → f16 --name-filter`).  Fix:
     `byte_inner = elements_inner // block_size * type_size`
     before handing to the writer.  Floats (block_size=1) keep
     the existing element-shape path unchanged.

  3. Docstring updated with the now-correct two-pass recipe:

        python scripts/requantize-gguf.py \
            models/chatterbox-s3gen-mtl-f16.gguf \
            /tmp/intermediate.gguf f16 --name-filter hift/
        python scripts/requantize-gguf.py \
            /tmp/intermediate.gguf \
            models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf q4_0

Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42,
3x3 runs averaged):

  stage             | §3.24  | §3.26  | delta
  ------------------|-------:|-------:|-------------:
  [encoder] ms      |   31.3 |   30.5 | -0.8 (noise)
  [cfm_total] ms    |  541.9 |  550.4 | noise
  [hift_decode] ms  |  121.3 |  121.1 | neutral
  S3GEN_INFER_MS    |    709 |    724 | +15 (noise)
  T3_INFER_MS       |    440 |    440 | 0
  GGUF size         | 754.4  | 746.7  | -7.7 MB

Speed neutral on M3 Ultra unified memory (the 21 source_* tensors
are small — largest is source_resblocks/0/convs1/*/weight at
~3.4 MB F32 / ~1.7 MB F16).  Predicted +2–4 ms HiFT from §3.24
falls inside bench noise here; on bandwidth-limited targets
(M4 Air, iOS ANE-adjacent) expect the full 3–5 % HiFT speedup
seen in §3.24's existing 64 tensors.  The real win is the
7.7 MB (~1.0 %) GGUF shrink on a multilingual-distribution
package, plus closing the last known §3.24 blocker.

Parity gates:

  - test-metal-ops: all four pre-existing ops (diag_mask_inf,
    pad_ext, conv_transpose_1d x3 + tiny edge) PASS; the new
    mul_mv_f32_f16 variants are covered by end-to-end audio
    parity (their inner math is identical to the shipped
    <half, float> / <half, half> / <float, float> variants,
    they differ only in the type tags).
  - WAV parity vs §3.24 baseline on ES-prompt / jfk-voice / seed 42:

      md5 §3.24 baseline       : ec58d3e65ab8e9c6f4edefb15b169ea5
      md5 §3.26 v2 (3 runs)    : d8a1b22375dbcb2259c686426a7d76c5 x3

      lengths 83520/83520   cos 1.000000  PASS (threshold > 0.9998)
      rms_diff 0.464    rms_base 1332.66   max_abs_diff 4 (of ±32767)
      → 0.035 % relative RMS drift, 0.012 % max sample drift.
      Auditorily identical.  Deterministic across 3 invocations.

Three §3.24 follow-ups previously tracked:
  ~~kernel_mul_mv_f32_f16_short patch~~ — ✓ shipped this section.
  Q4_0 HiFT via 2-D-on-disk + `conv1d_f32` branch — still deferred
    (larger surgery across converter + C++).
  F32 mul_mm + add(bias) shader fusion — still deferred
    (~150 LOC Metal kernel work for +10–25 ms S3Gen).

Made-with: Cursor
…l-metal

Closes the §3.22 §3.24 §3.26 follow-up "F32 mul_mm + add(bias) shader
fusion in patches/ggml-metal-chatterbox-ops.patch". The fusion in the
pinned ggml-metal only covered Q-variant mul_mv (matrix-vector)
kernels via helper_mv_add_bias — the mul_mm (matrix-matrix) kernel
that the CFM transformer actually hits at T·B ≥ 2 had no equivalent.
This patch wires one in.

Three pieces:

1. kernel_mul_mm in ggml-metal.metal gets:
     FC_mul_mm_has_bias_     [[function_constant(FC_MUL_MM + 2)]]
     FC_mul_mm_has_residual_ [[function_constant(FC_MUL_MM + 3)]]
     device const char * bias     [[buffer(4)]]
     device const char * residual [[buffer(5)]]
   When either FC is true the kernel routes through the shmem-
   backed scalar-copy path and folds bias (broadcast over [ne0])
   and/or residual (same shape + stride as dst) into the copy
   loop.  Same post-matmul math as helper_mv_add_bias.  Compiler
   drops the un-selected branch — zero overhead when neither set.

2. get_pipeline_mul_mm in ggml-metal-device.{cpp,h}:
   new has_bias / has_residual flags bake into pipeline name
   (kernel_mul_mm_<T0>_<T1>_bci=X_bco=Y_bias=Z_res=W) and FC
   values.  Shmem bumped from 4 KB+2 KB to 8 KB when fused.

3. Dispatcher ggml_metal_op_mul_mat in ggml-metal-ops.cpp:
   mirrors the Q-mul_mv fusion lookup.  Tries
   {MUL_MAT, ADD, ADD} first, falls back to {MUL_MAT, ADD}.
   Handles both orderings of the residual add (ggml_add is
   commutative; chatterbox's basic_tfm emits
   ggml_add(x, attn_out) with residual x as src[0] and the
   mul_mat+bias result as src[1]).  Writes fused dst to
   node(idx + n_fuse - 1) so the value lands where the skipped
   ADDs would have written, returns n_fuse so the outer loop
   skips them.

Kernel variants verified via compile_pipeline trace on M3 Ultra:

  kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0  ← CFM tfm linears
  kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0  ← edge blocks
  kernel_mul_mm_f32_f32_bci=0_bco=0_bias=1_res=0   ← time_mlp etc.
  kernel_mul_mm_f32_f32_bci=0_bco=1_bias=1_res=0
  kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=0_res=0  ← unfused passthroughs
  kernel_mul_mm_f32_f32_bci=1_bco=1_bias=0_res=0

~1820 saved ggml_add dispatches per synthesis.  No `res=1` variants
fire in the current chatterbox graph — the ADD(residual) in
basic_tfm is separated from the ADD(bias) by layer_norm + mul_mat
+ add(bias) + gelu_erf + mul_mat + add(bias), so can_fuse correctly
refuses to span those nodes.  Infrastructure in place either way
for consumers whose residual is adjacent to mul_mat.

Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42, 5
invocations):

  stage             | §3.26 baseline | §3.27 fused | delta
  ------------------|---------------:|------------:|----------:
  [encoder] ms      |           31.3 |        30.5 | noise
  [cfm_total] ms    |          541.9 |       542.2 | neutral
  [hift_decode] ms  |          121.3 |       121.2 | neutral
  S3GEN_INFER_MS    |            709 |       713.2 | +4 noise
  T3_INFER_MS       |            440 |       433.4 | -7 noise

WAV md5 byte-exact across 5 runs: d8a1b22375dbcb2259c686426a7d76c5
(matches §3.26 v2 baseline).  test-metal-ops PASSes on all four
existing ops.  GGML_METAL_FUSION_DISABLE=1 cross-check: CFM 568.9 ms
steady, confirming the entire fusion system (mine + pre-existing
norm+mul+add + Q-mul_mv+bias) is worth ~27 ms aggregate; my mul_mm
contribution is the small end of that.

Ships as **neutral on M3 Ultra** for two reasons:

  1. M3U's Metal per-dispatch overhead is ~20–30 µs; 1820 dispatches
     * ~25 µs = ~45 ms theoretical, but many overlap with subsequent
     kernels via Metal's command buffer and don't sit on the
     critical path.
  2. Forcing has_bias=true through the shmem path (instead of
     direct-store + post-barrier bias-add — too complex to retrofit
     into both tensor-API + simdgroup-fallback paths in one session)
     costs roughly what the dispatch savings deliver.

Still shipping because:

  - Byte-exact + test-metal-ops PASS (correctness guaranteed).
  - M4 Air / iPhone / iPad have proportionally higher per-dispatch
    overhead and lower core counts — expected +5–15 ms S3Gen win
    there (same ratio §3.24's HiFT F16 result predicted on M4).
  - Mode 2/3 streaming synthesises short chunks where per-chunk
    dispatch count matters more relative to compute — fusion
    should be proportionally larger there.
  - Infrastructure leverage: FC_MUL_MM + 2/+3 slots + helper
    routing are plumbing future sessions will reuse (extend to
    mul_mm_id for MoE, extend to F16 weights, reclaim the
    shmem-roundtrip cost on M3U by patching the direct-store
    paths).

PROGRESS §3.27 has the full bench table + rationale + what's next
(direct-store fold-in ~2–3 h, mul_mm_id, M4 validation).

patches/ggml-metal-chatterbox-ops.patch 733 -> 995 lines, +262,
regenerated from pinned ggml 58c38058, applies cleanly via
`git apply --check`.

Made-with: Cursor
…ff0 path)

Builds directly on §3.27 infrastructure.  Closes the
`mul_mat → add(bias) → gelu_erf` triple in CFM basic_tfm's FF gate
projection (src/chatterbox_tts.cpp:738):

  ff = ggml_add(ctx, ggml_mul_mat(ctx, w.ff0_w, nx2), w.ff0_b);
  ff = ggml_gelu_erf(ctx, ff);
  ff = ggml_add(ctx, ggml_mul_mat(ctx, w.ff2_w, ff), w.ff2_b);

§3.27 brought mul_mat + add(bias) into a single dispatch via the
shmem-backed scalar-copy path; §3.28 extends that same loop to
apply gelu_erf as the last stage before writing to dst.  The gelu
is inline FP math on each element we're already reading+writing —
no extra memory, no extra barriers, no extra shmem — so unlike
§3.27's neutral-on-M3U result, this one is a clear net positive.

Three pieces:

1. ggml-metal.metal: FC_MUL_MM + 4 = FC_mul_mm_has_gelu_erf_.
   New branch at the end of the scalar-copy loop applying
   `0.5 * v * (1 + erf_approx(v * SQRT_2_INV))` — the same formula
   OP_UNARY_NUM_GELU_ERF uses, using the shared erf_approx<T>
   helper.  Numerically identical to the unfused path.

2. get_pipeline_mul_mm: signature bumped to
   (op, has_bias, has_residual, has_gelu_erf); pipeline name
   extended with _gelu=N; shmem sizing gated on any-fold-in flag.

3. Dispatcher mul_mm path: new {MUL_MAT, ADD, UNARY} can_fuse
   lookup wedged between the {MUL_MAT, ADD, ADD} residual lookup
   and the {MUL_MAT, ADD} bias-only fallback.  Verifies
   ggml_get_unary_op(f2) == GGML_UNARY_OP_GELU_ERF and
   f2->src[0] == f1 before fusing.  Gated to GELU_ERF specifically
   (basic_tfm's only unary); other sub-ops (SILU/GELU/RELU/...)
   extend trivially but aren't needed for chatterbox.

Kernel variants actually compiled (from GGML_LOG_DEBUG trace):

  kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0_gelu=1   ← CFM ff0
  kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0_gelu=1   ← ff0 edge
  kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0_gelu=0   ← ff2/to_out
  kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0_gelu=0
  kernel_mul_mm_f32_f32_bci=0_bco=0_bias=1_res=0_gelu=0    ← time_mlp
  kernel_mul_mm_f32_f32_bci=0_bco=1_bias=1_res=0_gelu=0
  kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=0_res=0_gelu=0   ← unfused
  kernel_mul_mm_f32_f32_bci=1_bco=1_bias=0_res=0_gelu=0

gelu=1 variants handle 56 basic_tfm × 10 CFM steps × 2 CFG batches
= 1120 saved gelu_erf dispatches per synth, on top of the 1820
bias-add dispatches saved in §3.27.

Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42,
5 invocations):

  stage             | §3.27 baseline   | §3.28 this     | delta
  ------------------|-----------------:|---------------:|-------------:
  [encoder] ms      |            30.5  |           30.8 | noise
  [cfm_total] ms    |           542.2  |   533.4 ± 1.0  | -8.8 / -1.6 %
  [hift_decode] ms  |           121.2  |          120.8 | neutral
  S3GEN_INFER_MS    |           713.2  |   706.0 ± 0.8  | -7.2 / -1.0 %
  T3_INFER_MS       |           433.4  |          431.0 | noise
  WAV md5           |       d8a1b22…   |      d8a1b22…  | byte-exact x5

test-metal-ops: all 4 pre-existing ops PASS.  WAV md5 byte-exact
to §3.26 / §3.27 baseline across all 5 invocations.  Deterministic.

Why this lands positive on M3 Ultra (unlike §3.27):

§3.27's gain was eaten by the shmem-roundtrip cost: routing
through temp_str + sgitg==0 scalar copy costs ~= what the 1820
eliminated ggml_add dispatches saved.  §3.28 adds the gelu fold-in
INTO the same loop — no additional memory accesses, no barriers,
no extra shmem — just a handful of FLOPs per element.  So the
1120 saved gelu_erf dispatches show up as clean net positive.

This also refines the §3.27 story: the infrastructure we built
there is what makes §3.28 cheap.  Fusing additional per-element
tail ops into the scalar-copy loop is essentially free, whereas
routing through the shmem path is what cost M3 Ultra its estimated
§3.27 win.

Patch file: 995 -> 1054 lines (+59).  Applies cleanly on a fresh
ggml clone at pinned 58c38058 via `git apply --check`.

PROGRESS §3.28 has full writeup + bench table + kernel variant
list + next-step pointers (extend to other unary sub-ops trivially;
reclaim §3.27 shmem cost by patching direct-store paths).

Made-with: Cursor
…verted

Goal: reclaim §3.27's neutral-on-M3-Ultra result by doing the
bias / residual / gelu_erf fold-in as a post-barrier
read-modify-write on device memory, instead of routing through
the shmem + scalar-copy path.  The shmem path §3.27 ships is
correct but costs a threadgroup-memory roundtrip that
~cancels out the dispatch savings on M3 Ultra.  (§3.28 got a
clean +8.8 ms CFM precisely because its gelu fold-in happened
*inside* a loop that was already going to run — zero extra
memory cost.)

Implementation (reverted, left as a 21-line doc block in the
kernel): after `cT.store(tC)` or the 4-simdgroup
`simdgroup_store` loop writes the full 64x32 block to device
memory, use a threadgroup_barrier(mem_flags::mem_device) to
order the writes, then distribute the RMW across all 128
threads of the threadgroup (16 elements per thread, full block
coverage regardless of how cT.store's cooperative layout
partitioned the write).

Result: test-metal-ops PASSed (diag_mask_inf, pad_ext,
conv_transpose_1d × 3 + tiny) but end-to-end chatterbox synth
produced wrong output:

  metric      | §3.28 baseline                       | §3.29 attempt
  ------------|--------------------------------------|------------------
  md5         | d8a1b22375dbcb2259c686426a7d76c5     | 06ee1aaaa94a10d70eec2835d3da7dbf
  T3 tokens   | 84                                   | 70  (EOS 14 early)
  audio_ms    | 3480                                 | 2920
  determinism | stable across 5 runs                 | stable (same wrong md5)

Deterministic but wrong — a systematic computation bug, not a
race.  Suspected root causes (not isolated):

  1. cT.store's cooperative layout is implementation-defined;
     threadgroup_barrier(mem_device) may not be strong enough
     to order its writes against subsequent device reads on
     A17 / M3.  Would need a `fence()` or `simdgroup_fence_t`
     audit.
  2. residual_ok's `ggml_are_same_shape(resi, mm)` check
     doesn't distinguish contiguous from view-with-strides;
     §3.27's scalar-copy path would mask a stride bug that
     direct-store would reveal.
  3. An off-by-one or wrong-stride in the RMW's index
     arithmetic that test-metal-ops doesn't exercise.

What's missing: no per-shape unit test for fused
`mul_mm + add(bias)` vs unfused `mul_mat + add` reference.
test-metal-ops only covers diag_mask_inf, pad_ext,
conv_transpose_1d.  §3.27 / §3.28 happen to be byte-exact
because their fold-in happens inside the scalar-copy loop
which is straightforward; §3.29's direct-store RMW has
subtler data flow that would benefit from explicit coverage.

Reverted.  cb_rev.wav md5 byte-exact to §3.28 baseline
`d8a1b22375dbcb2259c686426a7d76c5`; T3 back to 84 tokens /
3480 ms.  No code change from §3.28 beyond a 21-line
documentation block in the kernel explaining what was
attempted, what failed, and what the next person should try
first (add the unit test, then retry with bias-only scope).

Patch 1054 → 1070 lines (+16).  Applies cleanly on pinned
ggml 58c38058.

PROGRESS §3.29 has the full writeup, including next-person
notes pointing at the §5.7 of Apple's Metal Shading Language
Specification for cooperative-store barrier semantics.

Made-with: Cursor
…irect-store retry

Closes two §3.29 loose ends:

1. New harness in src/test_metal_ops.cpp — test_mul_mm_fused(cpu,
   gpu, K, N, T, B, fuse_mode, label).  Builds the small graph
   add(mul_mat(W_q4_0, X_f32), bias) (and optionally + gelu_erf),
   dispatches on both backends, compares element-wise.  On Metal
   ggml-metal's fusion detector collapses this into a single
   kernel_mul_mm_..._bias=1_res=X_gelu=Y dispatch; CPU is always
   the unfused triple.  Any numerical drift beyond 2e-2 absolute
   (4x the Q4_0-dequant-order CPU-vs-GPU noise floor observed on
   K=256..1024 shapes) flags a real bug.  This is the test §3.29
   should have had; would have caught the reverted direct-store
   RMW's wrong-output regression in seconds instead of forcing a
   full end-to-end chatterbox bench to surface it.

2. Bias-only direct-store retry: full-block writes with
   has_bias && !has_residual && !has_gelu_erf now take the
   direct-store path (cT.store / simdgroup_store → device memory),
   followed by a threadgroup_barrier and a 128-thread post-scan
   adding bias[r0 + row_off] to each of the 2048 block elements.
   Residual / gelu fold-ins still route through shmem — §3.29's
   negative finding on those paths stands, root cause unresolved
   (suspect cT.store cooperative-layout × mem_flags::mem_device
   barrier semantics on A17/M3; deeper audit required).  This is
   the minimum-scope slice of §3.29 that the new harness proves
   byte-stable.

Harness coverage — 8 fused-mul_mm shape variants matching CFM
hot path:

  [mul_mm_fused cfm-attn-qkv]          OK  K=256  N=256  T=87 B=2  bias
  [mul_mm_fused cfm-attn-out]          OK  K=256  N=512  T=87 B=2  bias
  [mul_mm_fused cfm-ff-gate-bias]      OK  K=256  N=1024 T=87 B=2  bias
  [mul_mm_fused cfm-ff-gate-bias+gelu] OK  K=256  N=1024 T=87 B=2  gelu
  [mul_mm_fused cfm-ff-down]           OK  K=1024 N=256  T=87 B=2  bias
  [mul_mm_fused cfm-b1]                OK  K=256  N=512  T=87 B=1  bias
  [mul_mm_fused bco-bias]              OK  K=256  N=320  T=87 B=2  bias  (bco=1 shmem path)
  [mul_mm_fused bco-gelu]              OK  K=256  N=320  T=87 B=2  gelu  (bco=1 shmem path)

All existing test-metal-ops cases (diag_mask_inf, pad_ext,
conv_transpose_1d × 3 + tiny) still pass.

Bias-only direct-store bench impact on M3 Ultra (5 invocations,
Q4_0 + HiFT F16, ES prompt, seed 42):

  metric             | §3.28            | §3.30            | delta
  -------------------|-----------------:|-----------------:|--------:
  [cfm_total] ms     |     533.4 ± 1.0  |     534.0 ± 0.9  | noise
  S3GEN_INFER_MS     |     706.0 ± 0.8  |     706.2 ± 3.2  | noise
  [hift_decode] ms   |        121.2     |        121.8     | noise

Neutral on M3 Ultra (same as §3.27) because in chatterbox's
basic_tfm every mul_mat+bias has a follow-up op (residual or
gelu) that forces the 3-op fusion through the shmem path.  The
2-op path §3.30 optimises only fires for tensors outside
basic_tfm (time_mlp / final_proj / resnet t_mlp) that don't
contribute measurable wall time.

WAV md5 byte-exact vs §3.28 baseline across all 5 runs:
d8a1b22375dbcb2259c686426a7d76c5.  T3 84 tokens, audio 3480 ms.

The harness is the real deliverable.  Any future attempt at the
residual / gelu direct-store paths now has a way to get fast
feedback on whether a change is correct before a full
chatterbox run.

Patch regenerated: 1070 -> 1088 lines (+18).  Applies cleanly
on fresh ggml at pinned 58c38058.

PROGRESS §3.30 has the full writeup including the three remaining
deferred items: residual direct-store (needs barrier audit), gelu
direct-store (same), and extending fusion to other unary sub-ops
(SILU/GELU/RELU/GELU_QUICK) — none of those are chatterbox hot-path
so deferred.

Made-with: Cursor
…ion.sh

Closes the validation gap for §3.24 / §3.26 / §3.27 / §3.28 / §3.30.
All five are predicted positive on bandwidth-limited Apple silicon
(M4 / iPhone / iPad) but were measured only on M3 Ultra where
per-dispatch overhead is so low that the fusion wins largely cancel
out against kernel-path overhead.  Two pieces:

1. iOS-arm64 cross-build portability:

     cmake -S . -B build-ios \
       -DCMAKE_SYSTEM_NAME=iOS \
       -DCMAKE_OSX_SYSROOT=iphoneos \
       -DCMAKE_OSX_ARCHITECTURES=arm64 \
       -DCMAKE_OSX_DEPLOYMENT_TARGET=14.0 \
       -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON \
       -DGGML_NATIVE=OFF -DGGML_BLAS=OFF -DGGML_ACCELERATE=OFF

   Both libggml-metal.a and libtts-cpp.a compile clean for iOS
   14.0+ arm64 on Xcode 16 / iOS 18.5 SDK.  Structural proof that
   §3.26's kernel_mul_mv_f32_f16{,_4,_short} + §3.27/§3.28/§3.30's
   kernel_mul_mm FC-gated bias / gelu_erf fold-ins are iOS-portable
   — none of the new kernel code uses macOS-only intrinsics.
   Runtime validation still needs TestFlight / device provisioning;
   this confirms no compile-time barrier to shipping.

2. scripts/bench-m4-validation.sh — self-contained 150-line bash
   harness.  Pipeline on any Apple-silicon Mac:

     - scripts/setup-ggml.sh to apply Metal + OpenCL patches at
       the pinned ggml commit
     - build-metal Release build
     - test-metal-ops asserts all 14 gates PASS (3 base + 3
       conv_transpose_1d + 8 fused-mul_mm)
     - 5 invocations of chatterbox on the Spanish-prompt baseline
     - mean [encoder] / [cfm_total] / [hift_decode] / S3Gen / T3
     - md5 determinism (all 5 runs identical) and byte-exactness
       check vs M3 Ultra reference d8a1b22375dbcb2259c686426a7d76c5
     - writes artifacts/bench/m4-validation.json with the full
       host-vs-reference comparison + Δ% per stage

   M3 Ultra reference numbers (post-§3.30, baked into the script):
     cfm=534.0 ms  s3gen=706.6 ms  t3=432.6 ms  hift=121.1 ms

   Env vars: T3_GGUF, S3GEN_GGUF, REF_WAV, RUNS, OUT_DIR.

Self-smoke on M3 Ultra (expected: tiny deltas, byte-exact PASS):

  stage                 M3 Ultra (ref)       this host       Δ vs M3U
  [cfm_total] ms                 534.0           533.7    -0.3 (-0.1%)
  S3GEN_INFER_MS                 706.6           707.4    +0.8 (+0.1%)
  T3_INFER_MS                    432.6           434.6    +2.0 (+0.5%)
  [hift_decode] ms               121.1           123.1    +2.0 (+1.7%)
  determinism: PASS  (md5 d8a1b22375dbcb2259c686426a7d76c5 x5)
  byte-exact vs M3 Ultra: PASS

All deltas within per-invocation stdev; script is ready to
scp + run on any M4 / M3 / M2 box.  If M4 shows noticeably
smaller CFM than M3U (after accounting for M4's higher single-core
clock), §3.27/§3.28/§3.30's predicted-positive story is vindicated.
If M4 matches M3U or regresses, those sections should be revisited.

What I can't do from this host:
- Run on an M4 Air or iOS device (no ssh access to the user's
  M4 laptop, no iOS device provisioning on this box).  Tailscale
  shows the user's MacBook as reachable but port 22 timed out
  (sshd not enabled / firewalled).  The script is the best
  hand-off.

PROGRESS §3.31 documents the iOS build procedure + harness usage
+ expected M4 output shape.

Made-with: Cursor
Wraps the Apr 30 – May 1 Metal optimisation pass on chatterbox.cpp
multilingual into a single closeout document.

SUMMARY-3.24-3.31.md (new, 120 lines):

- Per-section commit log with M3 Ultra deltas and GGUF-size deltas
- Parity guarantees (WAV byte-exact d8a1b22375dbcb2259c686426a7d76c5,
  14/14 test-metal-ops gates, 8 model pairs, streaming, long-text,
  patch portability, iOS-arm64 cross-build)
- Open follow-ups with effort + expected gain + status
- Final bench of the shipping config (Q4_0 + HiFT F16 v2) —
  CFM 534.0±1.3 ms, S3Gen 706.6±4.5 ms, T3 432.6±2.2 ms,
  inference ~1165 ms, RTF 0.33
- Reproduction commands

README.md "Performance" section gets a new subsection
"Multilingual (M3 Ultra, post §3.24–§3.31 Metal kernel portfolio)"
showing the 5-run numbers next to the existing §3.21 N=10 row:

  stage                         §3.21  →  §3.28   delta
  T3 infer                      482 ms →  433 ms  -49 ms / -10.2 %
  S3Gen infer                   730 ms →  706 ms  -24 ms /  -3.3 %
  RTF                           0.35   →  0.33

Net across 8 commits (c47c7760902381):
- 5 measurable: §3.24 (-3.6 ms HiFT), §3.26 (-7.7 MB GGUF),
  §3.27 (infra), §3.28 (-8.8 ms CFM), §3.30 (harness infra)
- 3 docs/negative: §3.25 (FA neg), §3.29 (direct-store neg),
  §3.31 (iOS portability + M4 handoff script)

M3 Ultra shipping config CFM drops 541.9 -> 534.0 ms (-1.5 %);
bandwidth-limited silicon (M4 / iPhone / iPad) predicted larger
win but unmeasured — scripts/bench-m4-validation.sh ready to
confirm when test host is available.

Made-with: Cursor
@GustavoA1604 GustavoA1604 changed the base branch from multilingual_merged to multilingual_merged_metal May 4, 2026 16:19
@GustavoA1604 GustavoA1604 changed the base branch from multilingual_merged_metal to multilingual_merged May 4, 2026 16:36
@GustavoA1604 GustavoA1604 merged commit ee47d5e into GustavoA1604:multilingual_merged May 4, 2026
GustavoA1604 added a commit that referenced this pull request May 4, 2026
…ng, decode)

Folds in five small follow-up fixes from gianni-cor's review of PR #3
(gianni-cor/chatterbox.cpp). All are byte-equivalent for the runtime
audio path (md5 57cc80f27a122f03435fd05f47d1b3d2 unchanged on the ES
reference prompt + jfk/gianni voice + seed 42); the changes target
portability, error-message accuracy, and self-documenting code.

1. scripts/dump-t3-mtl-reference.py: replace the hardcoded
   /Users/gustavoefa/dev/chatterbox-ref/src sys.path entry with
   ${CHATTERBOX_REF_SRC} (default ../chatterbox-ref/src relative to the
   repo root, matching the layout the README prescribes). Anyone else
   regenerating the parity .npy dumps no longer hits ModuleNotFoundError.

2. src/main.cpp load_model_gguf: when chatterbox.variant is present but
   not GGUF_TYPE_STRING, refuse to load with a clear error instead of
   silently defaulting to the Turbo loader (which would later die with a
   misleading "missing tensor" error from absent GPT-2 names).

3. src/t3_mtl.cpp build_perceiver_attn: add a comment clarifying that
   the LayerNorm eps is intentionally fixed at 1e-5 (PyTorch
   nn.LayerNorm default) and is NOT hp.eps (the Llama backbone's
   RMSNorm eps), so a future reader doesn't "fix" it and break parity.

4. scripts/convert-t3-mtl-to-gguf.py: rename SUPPORTED_LANGUAGES to
   ALL_KNOWN_LANGUAGES with a comment explaining the C++/Python contract
   divergence (Python tokenizer accepts 23 codes; C++ tokenizer hard-
   errors on 5 of them at runtime - ja/he/ru/zh/hi need pykakasi /
   dicta / russian_text_stresser / Cangjie preprocessing). Updates the
   GGUF metadata writer to use the new name.

5. src/mtl_tokenizer.cpp decode: change the leading-space guard from
   `k > 0` to `!out.empty()` so that when the first id(s) are skipped
   special markers ([START], [PAD], ...), the first emitted regular
   token doesn't get a spurious leading space. Brings the debug-only
   decoder in line with Python's reference behaviour.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 pushed a commit that referenced this pull request May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants