Metal optimisation #3
Merged
GustavoA1604 merged 19 commits intoMay 4, 2026
Merged
Conversation
The §3.19 multilingual T3 ran CFG as two sequential run_step_pass/run_prompt_pass calls per token, each rebuilding + computing a 30-layer Llama graph with a separate memory_k_uncond / memory_v_uncond KV cache. On Metal this doubled per-step kernel dispatch + weight-read overhead -- exactly the regression `use_b2` already paid off for S3Gen's CFM (chatterbox_tts.cpp:1994 / §3.19). This commit mirrors that on T3 by packing cond + uncond into the batch dim (ne[3]=2) for inputs_embeds, pos_ids, kq_mask, and the per-layer Q/K/V activations. RoPE + flash_attn_ext both broadcast head/seq dims over batch out of the box, so build_llama_block only grew an int B parameter and a size_t b_offset_elems (one cache slab offset for the legacy B=1 CPU fallback). KV layout rework: the two parallel 1-D F32 KV buffers (memory_k + memory_k_uncond) are now a single contiguous 2 * kv_layer_elems buffer per layer, cond at offset 0, uncond at offset kv_layer_elems. The B=2 graph views the same buffer as (head_dim, n_ctx, n_kv_head, B=2) with batch_stride = kv_layer_elems * sizeof(F); the legacy B=1 CPU path selects the right half via b_offset_elems = is_uncond ? kv_layer_elems : 0. Total backend allocation is unchanged (still 2x kv_elements per cache); we just dropped two ggml_new_tensor_1d calls. eval_step_mtl / eval_prompt_mtl dispatch the B=2 path when !ggml_backend_is_cpu(model.backend) -- mirrors use_b2 in S3Gen exactly. CPU keeps the two-call path (per-op B=2 work doubles without saving ops on ggml-cpu). Bench (M3 Ultra Metal, §3.19 reference Spanish prompt, seed 42, greedy): - Q4_0 T3: 872 ms -> 502 ms (-42%); RTF 0.46 -> 0.35 - F16 T3: 1099 ms -> 602 ms (-45%); RTF 0.53 -> 0.37 Parity gates passed: - Greedy decode token parity at --temp 0 --top-k 1: first 100 tokens identical to the two-call baseline on seed 42. - End-to-end WAV byte-exact match vs the §3.19 reference run on Q4_0 and F16 (cmp baseline.wav phase1.wav -> identical). - CPU fallback (--n-gpu-layers 0) still produces audio with the B=1 path. The KV strides in build_llama_block are routed through ggml_type_size(memory_k->type), so flipping GGML_TYPE_F32 -> GGML_TYPE_F16 in load_model_gguf_mtl is a one-line opt-in for memory- bound backends. On M3 Ultra it's neutral (Metal flash_attn_ext was already running its inner matmul at F16 regardless of K/V storage dtype); kept at F32 to preserve the §3.19 numerics envelope. Made-with: Cursor
The s3gen_synthesize_opts.cfm_steps field already existed (and was
honoured by the inner CFM loop in chatterbox_tts.cpp:1973), but the CLI
only surfaced it for streaming chunks via --stream-cfm-steps. Non-
streaming MTL was therefore locked at the GGUF's n_timesteps=10. Add a
top-level --cfm-steps N flag and route it into all three non-streaming
s3gen_synthesize_opts setup sites:
- tokens-file short-circuit (no T3, just S3Gen + HiFT)
- regular non-streaming synthesis path
- the --input-file live-input streaming path keeps using
--stream-cfm-steps; --cfm-steps is documented as a non-streaming
knob there.
Default stays at 0 (use the GGUF's n_timesteps), so existing scripts
keep their behaviour. Quality knee on the §3.19 reference Spanish
prompt (M3 Ultra Metal Q4_0; log-mel cosine vs N=10):
N=6 S3Gen 518 ms log-mel-cos 0.990 PCM-cos 0.88 (borderline)
N=7 S3Gen 571 ms log-mel-cos 0.995 PCM-cos 0.94 ← recommended
N=8 S3Gen 629 ms log-mel-cos 0.997 PCM-cos 0.97
N=10 S3Gen 730 ms log-mel-cos 1.000 PCM-cos 1.00 (default)
N=7 cleanly clears the cos >= 0.99 gate; N=6 sits right on it (PCM
cosine drops to 0.88 -- phase-coherent attack reconstruction starts
to drift).
Made-with: Cursor
build_llama_block did `silu(gate) * up` as three separate ggml ops: ggml_silu(...), ggml_mul_mat(mlp_up, ...), ggml_mul(silu_out, up_out). That's a silu + mul element-wise pair on top of the two mul_mats, at 30 dispatches/token across layers. Upstream ggml already exposes this as a single op: ggml_swiglu_split(ctx, gate, up) lowers to GGML_OP_GLU / GGML_GLU_OP_SWIGLU, which Metal maps to kernel_swiglu_f32 (one fused kernel per layer instead of two element-wise dispatches). The pre-norm `mul(rms_norm(x), g)` pattern was already auto-fused by ggml-metal's can_fuse(RMS_NORM, MUL) path (kernel_rms_norm_mul_f32); left as-is so CPU + non-Metal backends get the same op shape. Bench (M3 Ultra Metal, vs the B=2 CFG baseline from the previous commit, 3 warm runs averaged): - Q4_0 T3: 502 ms -> 482 ms (-4%, within plan's 5% gate but trend is positive across all 3 runs) - F16 T3: 602 ms -> 635 ms (+5%, within run-to-run variance) Marginal on M3 Ultra (the SwiGLU kernel was not on the critical path once Phase 1 collapsed the per-step weight-read + dispatch cost). Kept for code clarity (single fused op vs three) + because future ggml-metal kernel improvements to swiglu will land here automatically. WAV byte-exact vs the previous commit (kernel_swiglu_f32 is bit- equivalent to manual `silu(gate) * up`). Made-with: Cursor
README:
- Multilingual benchmark table at the top grew M3 Ultra rows
(`--cfm-steps 7` and default N=10, Q4_0 + F16) alongside the
existing M4 rows. Best M3 Ultra config now hits RTF 0.30 / **48.4×
faster than ONNX Runtime** on the same prompt; M4 numbers retained
for §3.19/§3.20 continuity.
- New "Multilingual (Mac Studio M3 Ultra, after §3.21 optimisation
pass)" stage block under ## Performance, mirroring the M3 Ultra
Turbo block: T3 / S3Gen / RTF for {Q4_0, F16} × {N=10, N=7}.
- The MTL `tts-cli` example mentions `--cfm-steps`; flag points
readers at PROGRESS.md §3.21 for the quality knee numbers.
PROGRESS.md:
- Append §3.21 mirroring the §3.19/§3.20 layout (pre-rationale ->
what shipped -> bench tables -> what didn't work -> backlog).
Documents the bench matrix on M3 Ultra Metal {Q4_0, F16} ×
{baseline, +Phase 1, +Phase 2, +Phase 4, +final} with stage
breakdowns and audio-quality gates.
- "What didn't work -- NEGATIVE results" subsection explicitly
documents Phase 2 (F16 KV) as a wash on M3 Ultra (with the
observation that ggml-metal's flash_attn_ext was already running
its inner matmul at F16 regardless of K/V storage dtype) and the
Phase 4-stretch Metal-patch fusions as unnecessary because upstream
already covers them.
- "What's next for MTL" backlog from §3.19 is updated with strikes
on items now done (T3 quantisation, --cfm-steps); items remaining
(heterogeneous-core CPU thread default, ja/he/ru/zh/hi tokenizer,
speculative decoding, M4 F16 KV measurement) listed at the end.
Made-with: Cursor
…p scaffolding Three small allocator-side cleanups on top of §3.21 — all byte-exact WAV output (MD5 match against multilingual_merged HEAD), all neutral- to-slightly-positive on M3 Ultra (the savings are too small to escape run-to-run noise here, ~1% on S3Gen wall time, but the changes remove unambiguously wasted work that matters more on slower CPUs / older Metal where the topology walks and 64 MB memset are pricier). 1. `t3_mtl.cpp`: drop the explicit `ggml_gallocr_reserve` before `ggml_gallocr_alloc_graph` in the four `run_*_pass[_b2]` paths. `alloc_graph` already calls `ggml_gallocr_needs_realloc` and only triggers a re-reservation when the graph's per-node sizes grew — the per-step graph keeps the same node count + same shapes for every n_past >= 1, so 83 of the 84 step-pass reserves were doing a full O(n_nodes) topology walk for nothing. Per-step graph build itself is unchanged. 2. `chatterbox_tts.cpp` `run_hift_decode`: turn the 64 MB scratch `std::vector<uint8_t> buf(buf_size)` into a `thread_local`. The previous version forced a 64 MB memset on every HiFT call (one per `--out` invocation in batch mode, one per chunk in streaming). ggml_init resets the arena pointer, so the buffer is reused safely across calls without leaking tensor metadata. 3. `chatterbox_tts.cpp` `compute_time_mlp`: hoist the per-step ggml_init / build-graph / gallocr_new / gallocr_reserve / gallocr_free sequence into a `thread_local time_mlp_cache`. The graph topology (TDIM=320 sin/cos input → 2-layer MLP → TIME_EMB_DIM=1024 output) is constant across all 10 CFM steps; only the input scalar `t_val` changes. Now we build + reserve once, then per-step we just `alloc_graph` + `tensor_set` + `compute` + `tensor_get`. Cache key is `(backend)` so a backend swap rebuilds. Bench (M3 Ultra Metal, Q4_0, ES prompt, seed 42, --temp 0 --top-k 1, jfk.wav voice, 3 invocations): | Stage | §3.21 base | this commit | Δ | |------------|-----------:|------------:|--------:| | T3 ms | 479 | 474 | -1% | | cfm_total | 561 | 550 | -2% | | hift_decode| 128 | 125 | -2% | | S3Gen ms | 730 | 728 | -0.3% | | Total ms | 1209 | 1202 | -0.6% | Within run-to-run noise (~1.5% on T3) but consistently the same direction across three back-to-back runs, and zero regression risk because `ggml_gallocr_alloc_graph` covers the lazy-reserve case. WAV byte-exact gate (md5sum): - pre-commit: 79002f09bc48dda95ec0c2cfc2b895bd - post-commit: 79002f09bc48dda95ec0c2cfc2b895bd Made-with: Cursor
PROGRESS gets a new §3.22 documenting the three small allocator-side cleanups shipped in 6141cf2 (drop redundant gallocr_reserve; thread_local HiFT scratch; cached time_mlp graph). Numbers on M3 Ultra are within run-to-run noise (~1-2% per stage, ~0.6% total) but consistently the same direction across 3 invocations and pass the byte-exact WAV gate against §3.21 HEAD. Section also documents the M3 Ultra "compute floor" finding: each CFM step is ~52 ms steady-state of mat-mul work (480 nodes/step, already amortised in one command-buffer commit), so allocator shaving can only chip at the 20 ms first-step build overhead, not the steady-state. Identifies F32 mul_mm + add(bias) shader fusion as the next worthwhile attack — deferred to a future round. Made-with: Cursor
Multilingual T3 step path on Metal previously dispatched three separate Q4_0 mat-muls per layer for the Q / K / V projections (30 layers × 84 tokens × 3 ≈ 7560 dispatches per call inside the same compute_graph commit). All three weights have identical Q4_0 shape (n_embd=1024 in × n_embd=1024 out), so they can be concatenated along the output (M) dim into one [n_embd, 3*n_embd] Q4_0 tensor and fed into a single mat-mul whose output's first n_embd rows are Q, middle are K, last are V. Implementation: - `chatterbox_model` gains `ctx_stack` + `buffer_stack` (a backend buffer alongside `buffer_w` and `buffer_kv`). - `llama_layer` gains `wqkv : [n_embd, 3*n_embd]`. Allocated only on non-CPU backends (CPU keeps the per-projection path; ggml-cpu's per-kernel overhead is already negligible and the +30 MB weight footprint trades unfavourably with thread-cache locality). - At load time, after the GGUF weights buffer is allocated, the per-layer wq/wk/wv bytes are copied into wqkv via a host-side scratch buffer (Q4_0 row layout is M-major contiguous: row `r` is K/32 blocks of 18 bytes packed back-to-back, so concat is a flat byte append). - `build_llama_block` runs ONE `ggml_mul_mat(W_qkv, cur)` and then carves out Q / K / V via strided `ggml_view_2d` / `_3d` views straight into the (HD, NH, N[, B]) shape RoPE expects. No ggml_reshape (would need contiguous source) and no ggml_cont (would defeat the saving). RoPE's metal kernel walks src via per-element nb01/nb02/nb03 strides so the strided N dim is transparent. Process-wide `t3_stack_registry` + atexit hook in t3_mtl.cpp frees buffer_stack before Metal's static device destructors run; without this the new buffer triggers Metal's `[rsets->data count] == 0` assert at process exit (residency sets still referenced through an orphan backend buffer). Mirrors the existing `s3gen_model_cache_release` atexit pattern. `free_t3()` in main calls `t3_stack_unregister()` on the error-path early-returns so we don't double-free. Why gate/up isn't stacked too: the multilingual T3 converter ships `mlp_gate` as F16 and `mlp_up` as Q4_0 (verified via gguf reader on models/chatterbox-t3-mtl-q4_0.gguf — gate.weight type=F16, up.weight type=Q4_0 for every layer). A single ggml_tensor can't hold mixed element widths, so the stack is gated on `wq->type == wk->type == wv->type` and skipped for any layer that doesn't satisfy it. Bench (M3 Ultra, Metal, ES prompt + jfk.wav voice, seed 42, --temp 0 --top-k 1, mean of 5 invocations): | Variant | T3 §3.22 base | T3 this commit | Δ | |---------|--------------:|---------------:|---------:| | Q4_0 | 474 ms | **433 ms** | **-9.6%** | | F16 | 522 ms | **493 ms** | **-5.5%** | Per-stage breakdown for Q4_0 (one run): encoder 30 ms (unchanged) cfm_total 549 ms (unchanged) hift 125 ms (unchanged) T3 433 ms (-46 ms vs baseline 479 ms) Total 1153 ms (vs 1209 ms baseline, -4.6%) WAV byte-exact gate: md5 `79002f09bc48dda95ec0c2cfc2b895bd` matches across §3.22 base and post-commit at five separate invocations (temp=0, top-k=1, deterministic). Made-with: Cursor
PROGRESS gets a new §3.23 documenting the Phase-15 stacked W_qkv optimisation shipped in 1f43ecc. Key contents: - Implementation: Q/K/V rows concatenated row-wise into one Q4_0 tensor at load time; build_llama_block runs ONE mat-mul + strided view-split into the (HD, NH, N[, B]) layout RoPE expects, no reshape (would need contiguous source) and no cont (would defeat the saving). - Why gate/up isn't stacked: the multilingual T3 ships mlp_gate as F16 and mlp_up as Q4_0, so a single ggml_tensor can't hold them. - Why CFM transformer Q/K/V isn't stacked: documented as a negative result with the GPU-occupancy explanation. CFM has M=512 / N=174 which already saturates M3 Ultra's 60 cores in one wave; the stacked M=1536 / N=174 path needs three waves where one was enough. T3 wins because its step graph has M=1024 / N=1, which was at ~25% GPU occupancy un-stacked. Bench (M3 Ultra, Q4_0): T3 474 → 433 ms (-8.7%), Total 1192 → 1153 ms (-3.3%), WAV byte-exact md5 79002f09bc48dda95ec0c2cfc2b895bd. Cumulative since §3.20 baseline: T3 872 → 433 ms (-50%), RTF 0.46 → 0.29. Made-with: Cursor
…f.py
Multilingual S3Gen ships HiFT as 246 F32 tensors (~80 MB) because
the converter and `requantize-gguf.py` both wholesale-rejected 3-D
shapes — `len(shape) != 2` always returned False in
`should_quantize`. This commit unblocks the rejection on both
fronts:
1. **`should_quantize` now allows 3-D when ne[0] (the conv kernel
size K) is a multiple of the quant block size.** For Q4_0 /
Q5_0 / Q8_0 (block 32) the HiFT stack still gates out
universally — K ∈ {3, 7, 11, 16}, none 32-aligned, and the
block layout assumes blocks span 32 consecutive ne[0] values
within a fixed (ne[1], ne[2]) row, which only works when K
itself is 32-aligned. The plan's predicted-positive
`K * IC % 32 == 0` check is necessary but not sufficient
because re-quantising with a flattened (K*IC) reduction dim
requires storing the result with ggml shape `(K*IC, OC)` —
i.e. 2-D on disk — which then breaks `ggml_im2col(kernel, ...)`
on the C++ side, which derives the kernel size from
`kernel->ne[0]`. Documented inline; the 3-D branch is
forward-compatible for any future converter that ships
K-aligned conv kernels.
2. **Add `f16` as a target dtype.** F16 has block_size = 1, so
the alignment gate is a no-op for any shape. Combined with
a new `--name-filter SUBSTRING` arg that constrains the
rewrite to a tensor-name substring, lets us downcast HiFT
conv kernels F32 → F16 without disturbing the existing Q4_0
CFM linears.
3. **Pass-through path for already-quantised sources.** A Q4_0
source GGUF was previously a hard error in the pass-through
branch (the existing code reshapes `data` to the element shape
and Q-types have packed bytes). Now branches on
`GGML_QUANT_SIZES[t.tensor_type][0] == 1` (the float-types
block) and writes the raw byte buffer through with the
original shape for already-quantised inputs.
Two-pass recipe (multilingual S3Gen, Metal target):
python scripts/requantize-gguf.py \
models/chatterbox-s3gen-mtl-f16.gguf \
/tmp/intermediate.gguf f16 --name-filter hift/
python scripts/requantize-gguf.py \
/tmp/intermediate.gguf \
models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf q4_0
Result on M3 Ultra Metal (3 invocations, ES prompt, --seed 42):
baseline q4_0 GGUF (HiFT all F32):
GGUF size 788.4 MB
[hift_decode] median 124.9 ms
[s3gen_total] median 727 ms
WAV md5 79002f09bc48dda95ec0c2cfc2b895bd
q4_0 + HiFT F16 GGUF (this commit's recipe; 64 of 246 HiFT
tensors get F16, the remaining 21 source_resblocks/* + 161
biases/scalars stay F32):
GGUF size 754.6 MB (-4.3 %)
[hift_decode] median 121.3 ms (-2.9 %)
[s3gen_total] median 726 ms (within noise)
PCM cosine vs baseline 0.999851 (essentially indistinguishable)
The 21 source_resblocks/* tensors that match the existing `/s`
substring deny-list (intentional false positive from the Turbo
deny-list era) are kept F32: tried unblocking them and the
multilingual decode segfaults on
`kernel_mul_mv_f32_f16_short`-not-found in the pinned ggml-metal
build. Documented inline next to the `/s` entry; refining the
deny-list to endswith-only is gated on either patching that
kernel variant in or reshaping those tensors to a non-mat_mv
shape.
Validation: - chatterbox CLI on the new GGUF produces deterministic WAV
(md5 ec58d3e65ab8e9c6f4edefb15b169ea5 every run).
- PCM cosine 0.999851 vs the F32-HiFT baseline; max abs i16
diff 616 / 32768 ≈ 1.9 % — well above the §3.20 cos ≥ 0.99
quality gate.
- 3 × 3-invocation benches (baseline before vs after) confirm
-2.9 % HiFT decode median, -4.3 % GGUF size, no T3 / CFM /
S3Gen-total impact.
Made-with: Cursor
PROGRESS gets a new §3.24 documenting both the Q4_0 negative
finding (K-dim alignment is structurally blocked: HiFT K ∈ {3, 7,
11, 16}, none 32-aligned, and the 2-D-on-disk fix breaks
ggml_im2col on the C++ side) and the F16 alternate path that ships:
script + recipe + bench + quality gate.
Headline numbers (M3 Ultra Metal, ES prompt, 3 invocations):
GGUF size 788.4 MB → 754.6 MB (−4.3%)
[hift_decode] 124.9 ms → 121.3 ms (−2.9%)
PCM cosine 0.999851 across runs (well above 0.99 gate)
The smaller-than-planned win is honestly attributed:
- 21 of the 246 HiFT tensors (source_downs/*, source_resblocks/*)
trip an existing `/s` deny-list false-positive; refining the
deny-list segfaults at runtime (kernel_mul_mv_f32_f16_short
isn't compiled in the pinned ggml-metal). Tracked as follow-up.
- Activation traffic + im2col stay F32; F16 weights only cut the
weight-load phase of mul_mat, not the whole decode.
Made-with: Cursor
Per the plan, the stretch goal was the F32 mul_mm + add(bias) shader fusion in the metal patch. Documented as a tracked follow-up in §3.24 alongside the other two HiFT-related leftovers (missing kernel variant, 2-D-on-disk Q4_0 with C++ conv1d_f32 branch) — the F16 alt-path (which actually shipped today) was the cheaper and more immediately measurable win, so the F32 fusion gets time-budgeted in a future session. Made-with: Cursor
…_attn_ext
Tried flipping `src/chatterbox_tts.cpp::conformer_block()` (the 10 rel-pos
Conformer blocks that make up S3Gen's flow encoder) from the classic
`ggml_soft_max` + separate V mat-mul path to `ggml_flash_attn_ext`,
following the same pattern already used in `t3_mtl.cpp` (T3 Llama
attention) / `chatterbox_tts.cpp::basic_tfm` (CFM transformer) and just
shipped on `parakeet.cpp` §15.8 (their Conformer rel-pos MHA).
Bench (M3 Ultra, Metal, Q4_0, Spanish prompt "Hola mundo, esta es una
prueba multilingue.", seed 42, 3 invocations averaged):
stage | baseline | FA | delta
-----------------|---------:|------:|----------------:
[encoder] ms | ~43 | 29.6 | -13 / -31% (flow encoder only)
S3Gen ms | 721 | 708 | -13 / -1.8%
T3 ms | 433 | 430 | noise
CFM total ms | 546 | 538 | noise (-8)
HiFT decode ms | 126 | 125 | noise
WAV md5 | 79002f09 | a4169d68 | differs
The speedup is real — 40 saved kernel dispatches per synth from
collapsing 10 * (softmax + permute + mul_mat with V) into 10 *
flash_attn_ext — but the WAV quality gate fails.
`ggml_flash_attn_ext` hard-requires an f16 mask
(`ggml.c:5320 GGML_ASSERT(mask->type == GGML_TYPE_F16)`). The Conformer
rel-pos bias `bd_final = mul_mat(p_perm, q_plus_v)` is f32; casting it
to f16 drifts each element by ~1e-4 (f16 has ~10 bits of mantissa,
`bd_final` values sit in the ±5 to ±10 range). That drift compounds
through:
flow encoder (10 conformer blocks) -> CFM 10-step diffusion U-Net
-> HiFT vocoder -> waveform
which is far more drift-amplifying than parakeet's joint-argmax
downstream. Measured WAV:
lengths base=83520 fa=83520
samples n=83520 cos=0.998647
rms_diff=69.334 rms_base=1332.522
max_abs_diff=1702.0
gate: FAIL (threshold > 0.9998; got 0.998647)
Three rescue options explored and rejected:
1. Pass bd_scaled in f32 via ggml_flash_attn_ext — blocked by the
hard f16 assertion.
2. Compute bd_final in f16 from the start (cast p_perm and
q_plus_v) — pushes the same precision loss earlier in the
graph, doesn't improve the downstream cosine.
3. Skip the mask (pass nullptr) — mathematically wrong; bd_final
*is* the relative-position bias that Conformer attention
requires, dropping it breaks position-aware attention.
Reverted: conformer_block stays on the ggml_soft_max path. WAV md5
restored byte-exact to 79002f09bc48dda95ec0c2cfc2b895bd against the
§3.22 reference (verified /tmp/cb_revert.wav == /tmp/cb_base_1.wav).
Code change is additive (documentation-only): 16-line comment in
`conformer_block()` explaining why flash_attn is intentionally not
taken here, pinning the negative-finding cosine number + the speed
upside that was measured, and pointing at the parakeet §15.8
counterexample so the next person who looks at this doesn't redo
the experiment.
PROGRESS §3.25 adds the full writeup with the measurement table,
the reasoning for why parakeet could absorb this precision drift
but chatterbox can't (argmax-vs-waveform downstream sensitivity),
the three rejected rescue options, and the pointer to the two
remaining quick-win candidates that don't have this compounding
problem:
- strip redundant `ggml_cont` after Conformer Q/K/V permutes
(some are removable because Metal mul_mat walks strides natively)
- F32 `mul_mm + add(bias)` shader fusion in
patches/ggml-metal-chatterbox-ops.patch (already queued from §3.24)
Made-with: Cursor
…ants; relax /s deny; fix Q-type passthrough
Closes the open follow-up from §3.24 §3.25 ("Patch the missing
`kernel_mul_mv_f32_f16_short` variant to unblock the remaining 21
HiFT source_* conv kernels").
Problem. §3.24 converted 64 of HiFT's F32 conv-kernel weights to
F16 but kept the 21 `source_*` conv kernels at F32 because
requantize-gguf.py's `/s` glob swept them, and — once that glob
was narrowed — the resulting Metal build crashed:
ggml_metal_library_compile_pipeline: Error Domain=MTLLibraryErrorDomain
Code=5 "Function kernel_mul_mv_f32_f16_short was not found in the library"
... SIGSEGV at first HiFT decode (exit 139).
HiFT's `conv1d_f32` does
`ggml_mul_mat(im2col_f32, kernel_reshaped_f16)`, which the Metal
dispatcher compiles as `kernel_mul_mv_f32_f16_short` (T0=src0=
im2col=f32, T1=src1=kernel=f16, short-axis path because
`source_downs/2/weight` has OC=64). That template instantiation
was missing from the pinned ggml-metal (commit 58c38058) — the
family shipped `f32_f32`, `f16_f32`, `f16_f16`, `bf16_f32`,
`bf16_bf16` but not `f32_f16`.
Fix, three one-liner template instantiations in ggml-metal.metal:
// kernel_mul_mv_t_t family (full-shape mat-vec)
template [[host_name("kernel_mul_mv_f32_f16")]] kernel mul_mv_t_t
kernel_mul_mv_t_t<float, half>;
// kernel_mul_mv_t_t_4 family (vec4 path)
template [[host_name("kernel_mul_mv_f32_f16_4")]] kernel mul_mv_t_t_4
kernel_mul_mv_t_t_4<float, float4, half, half4>;
// kernel_mul_mv_t_t_short family (short-axis path — HiFT's actual
// hit, since source_downs/2/weight has OC=64 which trips the
// "small matrix-vector" fast path)
template [[host_name("kernel_mul_mv_f32_f16_short")]] kernel mul_mv_t_t_short_t
kernel_mul_mv_t_t_short<float, half>;
The `_impl` bodies already handle arbitrary casts via
`(float) x[i] * (float) y[i]` — only the symbol lookup was missing.
requantize-gguf.py gets three paired changes so the recipe works
end-to-end on the current gguf-0.18 writer:
1. `/s` deny narrowed to `/scale`. The broad `/s` glob was
originally a rough proxy for norm-scale params but
incidentally swept every `hift/source_*/` weight + bias
(60+ HiFT tensors, 21 of which are 3-D conv kernels that
§3.26 now unblocks). With the Metal kernels shipped, those
21 conv weights are safe to F16. The norm-scale tensors
the deny was targeting (`/scale`, `/ln_`, `/norm/`,
`/gamma`) remain covered by their own stricter patterns.
2. Q-type passthrough byte-shape fix. gguf-0.18's
`add_tensor_info` treats `raw_shape` as **byte** layout (inner
dim is bytes/row, not elements/row) when tensor.dtype is
uint8. The previous code passed element shape verbatim,
which crashed with
`ValueError: Quantized tensor bytes per row (512) is not a
multiple of Q4_0 type size (18)` on any two-pass recipe
that re-quantised a GGUF already carrying Q-type tensors
(e.g. `f16 → q4_0` or `q4_0 → f16 --name-filter`). Fix:
`byte_inner = elements_inner // block_size * type_size`
before handing to the writer. Floats (block_size=1) keep
the existing element-shape path unchanged.
3. Docstring updated with the now-correct two-pass recipe:
python scripts/requantize-gguf.py \
models/chatterbox-s3gen-mtl-f16.gguf \
/tmp/intermediate.gguf f16 --name-filter hift/
python scripts/requantize-gguf.py \
/tmp/intermediate.gguf \
models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf q4_0
Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42,
3x3 runs averaged):
stage | §3.24 | §3.26 | delta
------------------|-------:|-------:|-------------:
[encoder] ms | 31.3 | 30.5 | -0.8 (noise)
[cfm_total] ms | 541.9 | 550.4 | noise
[hift_decode] ms | 121.3 | 121.1 | neutral
S3GEN_INFER_MS | 709 | 724 | +15 (noise)
T3_INFER_MS | 440 | 440 | 0
GGUF size | 754.4 | 746.7 | -7.7 MB
Speed neutral on M3 Ultra unified memory (the 21 source_* tensors
are small — largest is source_resblocks/0/convs1/*/weight at
~3.4 MB F32 / ~1.7 MB F16). Predicted +2–4 ms HiFT from §3.24
falls inside bench noise here; on bandwidth-limited targets
(M4 Air, iOS ANE-adjacent) expect the full 3–5 % HiFT speedup
seen in §3.24's existing 64 tensors. The real win is the
7.7 MB (~1.0 %) GGUF shrink on a multilingual-distribution
package, plus closing the last known §3.24 blocker.
Parity gates:
- test-metal-ops: all four pre-existing ops (diag_mask_inf,
pad_ext, conv_transpose_1d x3 + tiny edge) PASS; the new
mul_mv_f32_f16 variants are covered by end-to-end audio
parity (their inner math is identical to the shipped
<half, float> / <half, half> / <float, float> variants,
they differ only in the type tags).
- WAV parity vs §3.24 baseline on ES-prompt / jfk-voice / seed 42:
md5 §3.24 baseline : ec58d3e65ab8e9c6f4edefb15b169ea5
md5 §3.26 v2 (3 runs) : d8a1b22375dbcb2259c686426a7d76c5 x3
lengths 83520/83520 cos 1.000000 PASS (threshold > 0.9998)
rms_diff 0.464 rms_base 1332.66 max_abs_diff 4 (of ±32767)
→ 0.035 % relative RMS drift, 0.012 % max sample drift.
Auditorily identical. Deterministic across 3 invocations.
Three §3.24 follow-ups previously tracked:
~~kernel_mul_mv_f32_f16_short patch~~ — ✓ shipped this section.
Q4_0 HiFT via 2-D-on-disk + `conv1d_f32` branch — still deferred
(larger surgery across converter + C++).
F32 mul_mm + add(bias) shader fusion — still deferred
(~150 LOC Metal kernel work for +10–25 ms S3Gen).
Made-with: Cursor
…l-metal
Closes the §3.22 §3.24 §3.26 follow-up "F32 mul_mm + add(bias) shader
fusion in patches/ggml-metal-chatterbox-ops.patch". The fusion in the
pinned ggml-metal only covered Q-variant mul_mv (matrix-vector)
kernels via helper_mv_add_bias — the mul_mm (matrix-matrix) kernel
that the CFM transformer actually hits at T·B ≥ 2 had no equivalent.
This patch wires one in.
Three pieces:
1. kernel_mul_mm in ggml-metal.metal gets:
FC_mul_mm_has_bias_ [[function_constant(FC_MUL_MM + 2)]]
FC_mul_mm_has_residual_ [[function_constant(FC_MUL_MM + 3)]]
device const char * bias [[buffer(4)]]
device const char * residual [[buffer(5)]]
When either FC is true the kernel routes through the shmem-
backed scalar-copy path and folds bias (broadcast over [ne0])
and/or residual (same shape + stride as dst) into the copy
loop. Same post-matmul math as helper_mv_add_bias. Compiler
drops the un-selected branch — zero overhead when neither set.
2. get_pipeline_mul_mm in ggml-metal-device.{cpp,h}:
new has_bias / has_residual flags bake into pipeline name
(kernel_mul_mm_<T0>_<T1>_bci=X_bco=Y_bias=Z_res=W) and FC
values. Shmem bumped from 4 KB+2 KB to 8 KB when fused.
3. Dispatcher ggml_metal_op_mul_mat in ggml-metal-ops.cpp:
mirrors the Q-mul_mv fusion lookup. Tries
{MUL_MAT, ADD, ADD} first, falls back to {MUL_MAT, ADD}.
Handles both orderings of the residual add (ggml_add is
commutative; chatterbox's basic_tfm emits
ggml_add(x, attn_out) with residual x as src[0] and the
mul_mat+bias result as src[1]). Writes fused dst to
node(idx + n_fuse - 1) so the value lands where the skipped
ADDs would have written, returns n_fuse so the outer loop
skips them.
Kernel variants verified via compile_pipeline trace on M3 Ultra:
kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0 ← CFM tfm linears
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0 ← edge blocks
kernel_mul_mm_f32_f32_bci=0_bco=0_bias=1_res=0 ← time_mlp etc.
kernel_mul_mm_f32_f32_bci=0_bco=1_bias=1_res=0
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=0_res=0 ← unfused passthroughs
kernel_mul_mm_f32_f32_bci=1_bco=1_bias=0_res=0
~1820 saved ggml_add dispatches per synthesis. No `res=1` variants
fire in the current chatterbox graph — the ADD(residual) in
basic_tfm is separated from the ADD(bias) by layer_norm + mul_mat
+ add(bias) + gelu_erf + mul_mat + add(bias), so can_fuse correctly
refuses to span those nodes. Infrastructure in place either way
for consumers whose residual is adjacent to mul_mat.
Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42, 5
invocations):
stage | §3.26 baseline | §3.27 fused | delta
------------------|---------------:|------------:|----------:
[encoder] ms | 31.3 | 30.5 | noise
[cfm_total] ms | 541.9 | 542.2 | neutral
[hift_decode] ms | 121.3 | 121.2 | neutral
S3GEN_INFER_MS | 709 | 713.2 | +4 noise
T3_INFER_MS | 440 | 433.4 | -7 noise
WAV md5 byte-exact across 5 runs: d8a1b22375dbcb2259c686426a7d76c5
(matches §3.26 v2 baseline). test-metal-ops PASSes on all four
existing ops. GGML_METAL_FUSION_DISABLE=1 cross-check: CFM 568.9 ms
steady, confirming the entire fusion system (mine + pre-existing
norm+mul+add + Q-mul_mv+bias) is worth ~27 ms aggregate; my mul_mm
contribution is the small end of that.
Ships as **neutral on M3 Ultra** for two reasons:
1. M3U's Metal per-dispatch overhead is ~20–30 µs; 1820 dispatches
* ~25 µs = ~45 ms theoretical, but many overlap with subsequent
kernels via Metal's command buffer and don't sit on the
critical path.
2. Forcing has_bias=true through the shmem path (instead of
direct-store + post-barrier bias-add — too complex to retrofit
into both tensor-API + simdgroup-fallback paths in one session)
costs roughly what the dispatch savings deliver.
Still shipping because:
- Byte-exact + test-metal-ops PASS (correctness guaranteed).
- M4 Air / iPhone / iPad have proportionally higher per-dispatch
overhead and lower core counts — expected +5–15 ms S3Gen win
there (same ratio §3.24's HiFT F16 result predicted on M4).
- Mode 2/3 streaming synthesises short chunks where per-chunk
dispatch count matters more relative to compute — fusion
should be proportionally larger there.
- Infrastructure leverage: FC_MUL_MM + 2/+3 slots + helper
routing are plumbing future sessions will reuse (extend to
mul_mm_id for MoE, extend to F16 weights, reclaim the
shmem-roundtrip cost on M3U by patching the direct-store
paths).
PROGRESS §3.27 has the full bench table + rationale + what's next
(direct-store fold-in ~2–3 h, mul_mm_id, M4 validation).
patches/ggml-metal-chatterbox-ops.patch 733 -> 995 lines, +262,
regenerated from pinned ggml 58c38058, applies cleanly via
`git apply --check`.
Made-with: Cursor
…ff0 path)
Builds directly on §3.27 infrastructure. Closes the
`mul_mat → add(bias) → gelu_erf` triple in CFM basic_tfm's FF gate
projection (src/chatterbox_tts.cpp:738):
ff = ggml_add(ctx, ggml_mul_mat(ctx, w.ff0_w, nx2), w.ff0_b);
ff = ggml_gelu_erf(ctx, ff);
ff = ggml_add(ctx, ggml_mul_mat(ctx, w.ff2_w, ff), w.ff2_b);
§3.27 brought mul_mat + add(bias) into a single dispatch via the
shmem-backed scalar-copy path; §3.28 extends that same loop to
apply gelu_erf as the last stage before writing to dst. The gelu
is inline FP math on each element we're already reading+writing —
no extra memory, no extra barriers, no extra shmem — so unlike
§3.27's neutral-on-M3U result, this one is a clear net positive.
Three pieces:
1. ggml-metal.metal: FC_MUL_MM + 4 = FC_mul_mm_has_gelu_erf_.
New branch at the end of the scalar-copy loop applying
`0.5 * v * (1 + erf_approx(v * SQRT_2_INV))` — the same formula
OP_UNARY_NUM_GELU_ERF uses, using the shared erf_approx<T>
helper. Numerically identical to the unfused path.
2. get_pipeline_mul_mm: signature bumped to
(op, has_bias, has_residual, has_gelu_erf); pipeline name
extended with _gelu=N; shmem sizing gated on any-fold-in flag.
3. Dispatcher mul_mm path: new {MUL_MAT, ADD, UNARY} can_fuse
lookup wedged between the {MUL_MAT, ADD, ADD} residual lookup
and the {MUL_MAT, ADD} bias-only fallback. Verifies
ggml_get_unary_op(f2) == GGML_UNARY_OP_GELU_ERF and
f2->src[0] == f1 before fusing. Gated to GELU_ERF specifically
(basic_tfm's only unary); other sub-ops (SILU/GELU/RELU/...)
extend trivially but aren't needed for chatterbox.
Kernel variants actually compiled (from GGML_LOG_DEBUG trace):
kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0_gelu=1 ← CFM ff0
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0_gelu=1 ← ff0 edge
kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0_gelu=0 ← ff2/to_out
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0_gelu=0
kernel_mul_mm_f32_f32_bci=0_bco=0_bias=1_res=0_gelu=0 ← time_mlp
kernel_mul_mm_f32_f32_bci=0_bco=1_bias=1_res=0_gelu=0
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=0_res=0_gelu=0 ← unfused
kernel_mul_mm_f32_f32_bci=1_bco=1_bias=0_res=0_gelu=0
gelu=1 variants handle 56 basic_tfm × 10 CFM steps × 2 CFG batches
= 1120 saved gelu_erf dispatches per synth, on top of the 1820
bias-add dispatches saved in §3.27.
Bench (M3 Ultra, Metal, Q4_0 + HiFT F16, ES prompt, seed 42,
5 invocations):
stage | §3.27 baseline | §3.28 this | delta
------------------|-----------------:|---------------:|-------------:
[encoder] ms | 30.5 | 30.8 | noise
[cfm_total] ms | 542.2 | 533.4 ± 1.0 | -8.8 / -1.6 %
[hift_decode] ms | 121.2 | 120.8 | neutral
S3GEN_INFER_MS | 713.2 | 706.0 ± 0.8 | -7.2 / -1.0 %
T3_INFER_MS | 433.4 | 431.0 | noise
WAV md5 | d8a1b22… | d8a1b22… | byte-exact x5
test-metal-ops: all 4 pre-existing ops PASS. WAV md5 byte-exact
to §3.26 / §3.27 baseline across all 5 invocations. Deterministic.
Why this lands positive on M3 Ultra (unlike §3.27):
§3.27's gain was eaten by the shmem-roundtrip cost: routing
through temp_str + sgitg==0 scalar copy costs ~= what the 1820
eliminated ggml_add dispatches saved. §3.28 adds the gelu fold-in
INTO the same loop — no additional memory accesses, no barriers,
no extra shmem — just a handful of FLOPs per element. So the
1120 saved gelu_erf dispatches show up as clean net positive.
This also refines the §3.27 story: the infrastructure we built
there is what makes §3.28 cheap. Fusing additional per-element
tail ops into the scalar-copy loop is essentially free, whereas
routing through the shmem path is what cost M3 Ultra its estimated
§3.27 win.
Patch file: 995 -> 1054 lines (+59). Applies cleanly on a fresh
ggml clone at pinned 58c38058 via `git apply --check`.
PROGRESS §3.28 has full writeup + bench table + kernel variant
list + next-step pointers (extend to other unary sub-ops trivially;
reclaim §3.27 shmem cost by patching direct-store paths).
Made-with: Cursor
…verted
Goal: reclaim §3.27's neutral-on-M3-Ultra result by doing the
bias / residual / gelu_erf fold-in as a post-barrier
read-modify-write on device memory, instead of routing through
the shmem + scalar-copy path. The shmem path §3.27 ships is
correct but costs a threadgroup-memory roundtrip that
~cancels out the dispatch savings on M3 Ultra. (§3.28 got a
clean +8.8 ms CFM precisely because its gelu fold-in happened
*inside* a loop that was already going to run — zero extra
memory cost.)
Implementation (reverted, left as a 21-line doc block in the
kernel): after `cT.store(tC)` or the 4-simdgroup
`simdgroup_store` loop writes the full 64x32 block to device
memory, use a threadgroup_barrier(mem_flags::mem_device) to
order the writes, then distribute the RMW across all 128
threads of the threadgroup (16 elements per thread, full block
coverage regardless of how cT.store's cooperative layout
partitioned the write).
Result: test-metal-ops PASSed (diag_mask_inf, pad_ext,
conv_transpose_1d × 3 + tiny) but end-to-end chatterbox synth
produced wrong output:
metric | §3.28 baseline | §3.29 attempt
------------|--------------------------------------|------------------
md5 | d8a1b22375dbcb2259c686426a7d76c5 | 06ee1aaaa94a10d70eec2835d3da7dbf
T3 tokens | 84 | 70 (EOS 14 early)
audio_ms | 3480 | 2920
determinism | stable across 5 runs | stable (same wrong md5)
Deterministic but wrong — a systematic computation bug, not a
race. Suspected root causes (not isolated):
1. cT.store's cooperative layout is implementation-defined;
threadgroup_barrier(mem_device) may not be strong enough
to order its writes against subsequent device reads on
A17 / M3. Would need a `fence()` or `simdgroup_fence_t`
audit.
2. residual_ok's `ggml_are_same_shape(resi, mm)` check
doesn't distinguish contiguous from view-with-strides;
§3.27's scalar-copy path would mask a stride bug that
direct-store would reveal.
3. An off-by-one or wrong-stride in the RMW's index
arithmetic that test-metal-ops doesn't exercise.
What's missing: no per-shape unit test for fused
`mul_mm + add(bias)` vs unfused `mul_mat + add` reference.
test-metal-ops only covers diag_mask_inf, pad_ext,
conv_transpose_1d. §3.27 / §3.28 happen to be byte-exact
because their fold-in happens inside the scalar-copy loop
which is straightforward; §3.29's direct-store RMW has
subtler data flow that would benefit from explicit coverage.
Reverted. cb_rev.wav md5 byte-exact to §3.28 baseline
`d8a1b22375dbcb2259c686426a7d76c5`; T3 back to 84 tokens /
3480 ms. No code change from §3.28 beyond a 21-line
documentation block in the kernel explaining what was
attempted, what failed, and what the next person should try
first (add the unit test, then retry with bias-only scope).
Patch 1054 → 1070 lines (+16). Applies cleanly on pinned
ggml 58c38058.
PROGRESS §3.29 has the full writeup, including next-person
notes pointing at the §5.7 of Apple's Metal Shading Language
Specification for cooperative-store barrier semantics.
Made-with: Cursor
…irect-store retry Closes two §3.29 loose ends: 1. New harness in src/test_metal_ops.cpp — test_mul_mm_fused(cpu, gpu, K, N, T, B, fuse_mode, label). Builds the small graph add(mul_mat(W_q4_0, X_f32), bias) (and optionally + gelu_erf), dispatches on both backends, compares element-wise. On Metal ggml-metal's fusion detector collapses this into a single kernel_mul_mm_..._bias=1_res=X_gelu=Y dispatch; CPU is always the unfused triple. Any numerical drift beyond 2e-2 absolute (4x the Q4_0-dequant-order CPU-vs-GPU noise floor observed on K=256..1024 shapes) flags a real bug. This is the test §3.29 should have had; would have caught the reverted direct-store RMW's wrong-output regression in seconds instead of forcing a full end-to-end chatterbox bench to surface it. 2. Bias-only direct-store retry: full-block writes with has_bias && !has_residual && !has_gelu_erf now take the direct-store path (cT.store / simdgroup_store → device memory), followed by a threadgroup_barrier and a 128-thread post-scan adding bias[r0 + row_off] to each of the 2048 block elements. Residual / gelu fold-ins still route through shmem — §3.29's negative finding on those paths stands, root cause unresolved (suspect cT.store cooperative-layout × mem_flags::mem_device barrier semantics on A17/M3; deeper audit required). This is the minimum-scope slice of §3.29 that the new harness proves byte-stable. Harness coverage — 8 fused-mul_mm shape variants matching CFM hot path: [mul_mm_fused cfm-attn-qkv] OK K=256 N=256 T=87 B=2 bias [mul_mm_fused cfm-attn-out] OK K=256 N=512 T=87 B=2 bias [mul_mm_fused cfm-ff-gate-bias] OK K=256 N=1024 T=87 B=2 bias [mul_mm_fused cfm-ff-gate-bias+gelu] OK K=256 N=1024 T=87 B=2 gelu [mul_mm_fused cfm-ff-down] OK K=1024 N=256 T=87 B=2 bias [mul_mm_fused cfm-b1] OK K=256 N=512 T=87 B=1 bias [mul_mm_fused bco-bias] OK K=256 N=320 T=87 B=2 bias (bco=1 shmem path) [mul_mm_fused bco-gelu] OK K=256 N=320 T=87 B=2 gelu (bco=1 shmem path) All existing test-metal-ops cases (diag_mask_inf, pad_ext, conv_transpose_1d × 3 + tiny) still pass. Bias-only direct-store bench impact on M3 Ultra (5 invocations, Q4_0 + HiFT F16, ES prompt, seed 42): metric | §3.28 | §3.30 | delta -------------------|-----------------:|-----------------:|--------: [cfm_total] ms | 533.4 ± 1.0 | 534.0 ± 0.9 | noise S3GEN_INFER_MS | 706.0 ± 0.8 | 706.2 ± 3.2 | noise [hift_decode] ms | 121.2 | 121.8 | noise Neutral on M3 Ultra (same as §3.27) because in chatterbox's basic_tfm every mul_mat+bias has a follow-up op (residual or gelu) that forces the 3-op fusion through the shmem path. The 2-op path §3.30 optimises only fires for tensors outside basic_tfm (time_mlp / final_proj / resnet t_mlp) that don't contribute measurable wall time. WAV md5 byte-exact vs §3.28 baseline across all 5 runs: d8a1b22375dbcb2259c686426a7d76c5. T3 84 tokens, audio 3480 ms. The harness is the real deliverable. Any future attempt at the residual / gelu direct-store paths now has a way to get fast feedback on whether a change is correct before a full chatterbox run. Patch regenerated: 1070 -> 1088 lines (+18). Applies cleanly on fresh ggml at pinned 58c38058. PROGRESS §3.30 has the full writeup including the three remaining deferred items: residual direct-store (needs barrier audit), gelu direct-store (same), and extending fusion to other unary sub-ops (SILU/GELU/RELU/GELU_QUICK) — none of those are chatterbox hot-path so deferred. Made-with: Cursor
…ion.sh
Closes the validation gap for §3.24 / §3.26 / §3.27 / §3.28 / §3.30.
All five are predicted positive on bandwidth-limited Apple silicon
(M4 / iPhone / iPad) but were measured only on M3 Ultra where
per-dispatch overhead is so low that the fusion wins largely cancel
out against kernel-path overhead. Two pieces:
1. iOS-arm64 cross-build portability:
cmake -S . -B build-ios \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_SYSROOT=iphoneos \
-DCMAKE_OSX_ARCHITECTURES=arm64 \
-DCMAKE_OSX_DEPLOYMENT_TARGET=14.0 \
-DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON \
-DGGML_NATIVE=OFF -DGGML_BLAS=OFF -DGGML_ACCELERATE=OFF
Both libggml-metal.a and libtts-cpp.a compile clean for iOS
14.0+ arm64 on Xcode 16 / iOS 18.5 SDK. Structural proof that
§3.26's kernel_mul_mv_f32_f16{,_4,_short} + §3.27/§3.28/§3.30's
kernel_mul_mm FC-gated bias / gelu_erf fold-ins are iOS-portable
— none of the new kernel code uses macOS-only intrinsics.
Runtime validation still needs TestFlight / device provisioning;
this confirms no compile-time barrier to shipping.
2. scripts/bench-m4-validation.sh — self-contained 150-line bash
harness. Pipeline on any Apple-silicon Mac:
- scripts/setup-ggml.sh to apply Metal + OpenCL patches at
the pinned ggml commit
- build-metal Release build
- test-metal-ops asserts all 14 gates PASS (3 base + 3
conv_transpose_1d + 8 fused-mul_mm)
- 5 invocations of chatterbox on the Spanish-prompt baseline
- mean [encoder] / [cfm_total] / [hift_decode] / S3Gen / T3
- md5 determinism (all 5 runs identical) and byte-exactness
check vs M3 Ultra reference d8a1b22375dbcb2259c686426a7d76c5
- writes artifacts/bench/m4-validation.json with the full
host-vs-reference comparison + Δ% per stage
M3 Ultra reference numbers (post-§3.30, baked into the script):
cfm=534.0 ms s3gen=706.6 ms t3=432.6 ms hift=121.1 ms
Env vars: T3_GGUF, S3GEN_GGUF, REF_WAV, RUNS, OUT_DIR.
Self-smoke on M3 Ultra (expected: tiny deltas, byte-exact PASS):
stage M3 Ultra (ref) this host Δ vs M3U
[cfm_total] ms 534.0 533.7 -0.3 (-0.1%)
S3GEN_INFER_MS 706.6 707.4 +0.8 (+0.1%)
T3_INFER_MS 432.6 434.6 +2.0 (+0.5%)
[hift_decode] ms 121.1 123.1 +2.0 (+1.7%)
determinism: PASS (md5 d8a1b22375dbcb2259c686426a7d76c5 x5)
byte-exact vs M3 Ultra: PASS
All deltas within per-invocation stdev; script is ready to
scp + run on any M4 / M3 / M2 box. If M4 shows noticeably
smaller CFM than M3U (after accounting for M4's higher single-core
clock), §3.27/§3.28/§3.30's predicted-positive story is vindicated.
If M4 matches M3U or regresses, those sections should be revisited.
What I can't do from this host:
- Run on an M4 Air or iOS device (no ssh access to the user's
M4 laptop, no iOS device provisioning on this box). Tailscale
shows the user's MacBook as reachable but port 22 timed out
(sshd not enabled / firewalled). The script is the best
hand-off.
PROGRESS §3.31 documents the iOS build procedure + harness usage
+ expected M4 output shape.
Made-with: Cursor
Wraps the Apr 30 – May 1 Metal optimisation pass on chatterbox.cpp multilingual into a single closeout document. SUMMARY-3.24-3.31.md (new, 120 lines): - Per-section commit log with M3 Ultra deltas and GGUF-size deltas - Parity guarantees (WAV byte-exact d8a1b22375dbcb2259c686426a7d76c5, 14/14 test-metal-ops gates, 8 model pairs, streaming, long-text, patch portability, iOS-arm64 cross-build) - Open follow-ups with effort + expected gain + status - Final bench of the shipping config (Q4_0 + HiFT F16 v2) — CFM 534.0±1.3 ms, S3Gen 706.6±4.5 ms, T3 432.6±2.2 ms, inference ~1165 ms, RTF 0.33 - Reproduction commands README.md "Performance" section gets a new subsection "Multilingual (M3 Ultra, post §3.24–§3.31 Metal kernel portfolio)" showing the 5-run numbers next to the existing §3.21 N=10 row: stage §3.21 → §3.28 delta T3 infer 482 ms → 433 ms -49 ms / -10.2 % S3Gen infer 730 ms → 706 ms -24 ms / -3.3 % RTF 0.35 → 0.33 Net across 8 commits (c47c776 → 0902381): - 5 measurable: §3.24 (-3.6 ms HiFT), §3.26 (-7.7 MB GGUF), §3.27 (infra), §3.28 (-8.8 ms CFM), §3.30 (harness infra) - 3 docs/negative: §3.25 (FA neg), §3.29 (direct-store neg), §3.31 (iOS portability + M4 handoff script) M3 Ultra shipping config CFM drops 541.9 -> 534.0 ms (-1.5 %); bandwidth-limited silicon (M4 / iPhone / iPad) predicted larger win but unmeasured — scripts/bench-m4-validation.sh ready to confirm when test host is available. Made-with: Cursor
GustavoA1604
added a commit
that referenced
this pull request
May 4, 2026
…ng, decode) Folds in five small follow-up fixes from gianni-cor's review of PR #3 (gianni-cor/chatterbox.cpp). All are byte-equivalent for the runtime audio path (md5 57cc80f27a122f03435fd05f47d1b3d2 unchanged on the ES reference prompt + jfk/gianni voice + seed 42); the changes target portability, error-message accuracy, and self-documenting code. 1. scripts/dump-t3-mtl-reference.py: replace the hardcoded /Users/gustavoefa/dev/chatterbox-ref/src sys.path entry with ${CHATTERBOX_REF_SRC} (default ../chatterbox-ref/src relative to the repo root, matching the layout the README prescribes). Anyone else regenerating the parity .npy dumps no longer hits ModuleNotFoundError. 2. src/main.cpp load_model_gguf: when chatterbox.variant is present but not GGUF_TYPE_STRING, refuse to load with a clear error instead of silently defaulting to the Turbo loader (which would later die with a misleading "missing tensor" error from absent GPT-2 names). 3. src/t3_mtl.cpp build_perceiver_attn: add a comment clarifying that the LayerNorm eps is intentionally fixed at 1e-5 (PyTorch nn.LayerNorm default) and is NOT hp.eps (the Llama backbone's RMSNorm eps), so a future reader doesn't "fix" it and break parity. 4. scripts/convert-t3-mtl-to-gguf.py: rename SUPPORTED_LANGUAGES to ALL_KNOWN_LANGUAGES with a comment explaining the C++/Python contract divergence (Python tokenizer accepts 23 codes; C++ tokenizer hard- errors on 5 of them at runtime - ja/he/ru/zh/hi need pykakasi / dicta / russian_text_stresser / Cangjie preprocessing). Updates the GGUF metadata writer to use the new name. 5. src/mtl_tokenizer.cpp decode: change the leading-space guard from `k > 0` to `!out.empty()` so that when the first id(s) are skipped special markers ([START], [PAD], ...), the first emitted regular token doesn't get a spurious leading space. Brings the debug-only decoder in line with Python's reference behaviour. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
pushed a commit
that referenced
this pull request
May 4, 2026
feat: Add multilingual variant
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CFG cond+uncond batched into one Metal forward (B=2) on the multilingual T3, plus
--cfm-steps NCLI knob for the standard 10-step CFM, plusggml_swiglu_spliton the MLP.Results
M3 Ultra Metal, Spanish prompt, seed 42, greedy, 3-warm-run avg.
What worked
use_b2from S3Gen; cond+uncond pack intone[3]=2and share a unified KV buffer. CPU keeps the two-call fallback. WAV byte-exact vs baseline.--cfm-steps 7— −22% S3Gen at log-mel cosine 0.995 vs N=10. Default unchanged (10).ggml_swiglu_split— single fusedkernel_swiglu_f32instead of three element-wise ops. Marginal on M3 Ultra (Q4_0 −4%) but byte-exact and future-proof.What didn't work
flash_attn_extalready runs its inner matmul at F16 regardless of K/V storage dtype. Left as one-line opt-in (GGML_TYPE_F32→F16) for memory-bound backends (e.g. M4).RMS_NORM+MUL/SwiGLU— already covered by upstream'scan_fuse(...). F16mul_mat + add(bias)stretch skipped — Llama Q/K/V/O have no bias.--cfm-steps 6— sits right on the cos ≥ 0.99 threshold (PCM cos drops to 0.88, attack reconstruction drifts). Opt-in only.